Android中的软件Watchdog

本文详细介绍了 Android 系统中 Watchdog 机制的工作原理及其关键组件。Watchdog 能够监视 SystemServer 中各 Service 的运行状态,一旦发现 Service 响应超时,将采取 dump 现场信息甚至重启 SystemServer 的措施来保证系统稳定性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

文章转载于Android中的软件Watchdog,在此基础上增加了源文件的代码注释

WatchDog工作原理也是讲解Watchdog不错的文章

由于Android的SystemServer内有一票重要Service,所以在进程内有一个软件实现的Watchdog机制,用于监视SystemServer中各Service是否正常工作。如果超过一定时间(默认30秒),就dump现场便于分析,再超时(默认60秒)就重启SystemServer保证系统可用性。同时logcat中会打印类似下面信息:

W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager), Blocked in handler on WindowManager thread (WindowManager)


主要实现代码位于/frameworks/base/services/core/java/com/android/server/Watchdog.java和/frameworks/base/core/jni/android_server_Watchdog.cpp。大体的框架很简单。Watchdog是SystemServer中的独立线程,它隔一定时间间隔会向各监视线程调度一次检查操作。这个检查操作当中会调用已注册的Monitor对象。如果Monitor对象上产生死锁,或是关键线程hang住,那么该检查必定不能按时结束,这样就被Watchdog检查到。


先来看看总体类图。因为是唯一的,Watchdog实现为Singleton。其中维护HandlerChecker数组,对应要检查的线程。HandlerChecker数组中有Monitor数组,对应要检查的Monitor对象。要接受检查的对象需要实现Monitor接口。



初始化是从SystemServer的startOtherServices()中开始的,其大体流程如下:


首先,在SystemServer.java中,会创建Watchdog并启动。


  
  1. 472 Slog.i(TAG, “Init Watchdog”);
  2. 473 final Watchdog watchdog = Watchdog.getInstance();
  3. 474 watchdog.init(context, mActivityManagerService);
  4. 1120 Watchdog.getInstance().start();
在Watchdog的构造函数中,会为每个要检查的线程创建HandlerChecker,并加到mHandlerCheckers这个队列中。首先是FgThread。它继承自ServiceThread,是一个单例,负责那些常规的前台操作,它不应该被后台的操作所阻塞。在Watchdog.java中:

  
  1. 215 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
  2. 216 "foreground thread", DEFAULT_TIMEOUT);
  3. 217 mHandlerCheckers.add(mMonitorChecker);

接下来,对于System Server的主线程,UI线程,IO线程和Display线程,分别做相同操作,这坨线程和FgThread一样都继承自ServiceThread。在init()函数中,接下来会调用registerReceiver()来注册系统重启的BroadcastReceiver。在收到系统重启广播时会执行RebootRequestReceiver的onReceive()函数,继而调用rebootSystem()重启系统。它允许其它模块(如CTS)通过发广播来让系统重启。


然后,各个需要被Watchdog监视的Service需要将自己进行注册。它们都实现了Watchdog.Monitor接口,其中主要是monitor()函数。例如ActivityManagerService:


  
  1. 2150 Watchdog.getInstance().addMonitor( this);
  2. 2151 Watchdog.getInstance().addThread(mHandler);
其中addMonitor()将自身放到foreground thread的HandlerChecker的monitor队列中,addThread()根据当前线程的Handler创建HandlerChecker并放到mHandlerCheckers队列中。monitor()的实现一般很简单,只是尝试去获得锁再释放锁。如果有deadlock,就会卡住无法返回。

  
  1. 18767 public void monitor() {
  2. 18768 synchronized ( this) { }
  3. 18769 }

回到SystemServer中,通过Watchdog的start()方法启动Watchdog线程,Watchdog.run()被执行。


Watchdog的主体是一个循环。在每一次循环中,所有HandlerChecker的scheduleCheckLocked()函数被调用。其中主要是往被监视线程的Looper中放入HandlerChecker对象,HandlerChecker本身作为Runnable,是线程的可执行体。因此当被监视线程把它从Looper中拿出来后,它的run()函数被调用。然后,Watchdog.run()等待最长30秒后,调用evaluateCheckerCompletionLocked()检查各HandlerChecker结果状态。一个HandlerChecker结果状态有四种,COMPLETED(0),WAITING(1), WAITED_HALF(2)和OVERDUE(3)。分别代表目标Looper已处理monitor,延时小于30秒,延时大于30秒小于60秒,延时大于60秒。最终的总状态是它们的最大值(也就是意味着最坏情况的那种情况)。如果总状态是COMPLETED,WAITING或WAITED_HALF,则进入循环下一轮。注意如果是WAITED_HALF,也就是说等了大于30秒,需要调用AMS.dumpStackTraces()来dump栈。如果状态为WAITED_HALF,进入下一轮循环后,又会等待最长30秒。

假设有线程阻塞,对于阻塞线程的HandlerChecker,它的延迟超过60秒,导致总状态为OVERDUE。这里会调用getBlockedCheckersLocked()和describeCheckersLocked()打印出是哪个Handler阻塞了。在Eventlog中打出信息后,把当前pid和相关进程pid放入待杀列表。然后和上面一样调用AMS.dumpStackTraces()打印栈。之后等待2秒等stacktrace写完。如有需要还会调用dumpKernelStackTraces()将kernel部分的stacktrace打出来。本质上是读取/proc/[pid]/task下的线程的和相应的/proc/[tid]/stack文件。然后调用doSyncRq()通知kernel把阻塞线程信息和backtrace打印出来(通过写/proc/sysrq-trigger)。之后会创建专门的线程来将信息写入到Dropbox,该线程会执行AMS.addErrorToDropBox()。Dropbox是Android中的一套日志记录系统,作用是将系统出错信息记录下来并且保留一段时间,避免被覆盖。当出现crash, wtf, lowmem或者Watchdog被触发都会通过Dropbox来记录。


  
  1. 425 Thread dropboxThread = new Thread( "watchdogWriteToDropbox") {
  2. 426 public void run() {
  3. 427 mActivity.addErrorToDropBox(
  4. 428 "watchdog", null, "system_server", null, null,
  5. 429 subject, null, stack, null);
  6. 430 }
  7. 431 };
  8. 432 dropboxThread.start();
  9. 433 try {
  10. 434 dropboxThread.join( 2000); // wait up to 2 seconds for it to return.
  11. 435 } catch (InterruptedException ignored) {}
DropBoxManagerService在SystemServer的startOtherServices()中添加,信息默认存放路径为/data/system/dropbox。DropBoxManagerService实现了IDropBoxManagerService服务接口,Client通过DropBoxManager来访问服务。在Watchdog中调用AMS.addErrorToDropBox()后,该函数会起工作线程(因为涉及I/O),将前面得到的stack信息dump到Dropbox,并且获取最近的logcat,最后通过DBMS的addText()接口进行存储。 

接下来,如果设置了ActivityController就调用其systemNotResponding()接口(IActivityController是给测试开发时用的接口,用于监视AMS里的行为)。然后判断Debugger是否连着和是否允许restart。如果没有连着debugger且允许restart,就开始大开杀戒了。

  
  1. 467 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
  2. ...
  3. 476 Slog.w(TAG, "*** GOODBYE!");
  4. 477 Process.killProcess(Process.myPid());
  5. 478 System. exit( 10);
因为Watchdog和SystemServer是同一进程,这里Watchdog kill掉了自己,也就是kill了SystemServer。因它是主要进程,杀掉后会被init重启。

这就是Watchdog的大体流程,回过头看下AMS中dumpStackTraces()的一些细节。参数中的pids包含了本进程,阻塞线程以及phone进程等。NATIVE_STACKS_OF_INTEREST包含了下面三个关键进程。


  
  1. 67 public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
  2. 68 "/system/bin/mediaserver",
  3. 69 "/system/bin/sdcard",
  4. 70 "/system/bin/surfaceflinger"
  5. 71 };
注意虽然它们不在SystemServer中,但因为SystemServer中的Service会用Binder同步调用它们的方法。如果这些进程中阻塞,也可能导致SystemServer中发生阻塞。

dumpStackTraces()实现中先从dalvik.vm.stack-trace-file这个system property中取出trace路径,默认为/data/anr/traces.txt。接着它创建该文件(需要的话),设置属性,最后调用同名函数dumpStackTraces()完成真正的dump工作。dump工作首先会用FileObserver(利用inotify机制)监视trace文件什么时候写完。它会创建一个独立的线程ObserverThread并运行。然后对于前面加入到要dump线程列表中的进程,发送SIGQUIT信号。如果是虚拟机进程,ART中的SignalCatcher::HandleSigQuit()(在/art/runtime/signal_catcher.cc)会被调用来dump信息(DVM也是类似的)。对于前面的核心Service,调用Debug.dumpNativeBacktraceToFile()来输出它们的backtrace。

总结下来,dumpStackTraces()大体流程如下:

可以看到,其中主要收集三类信息:一是关键进程(也就是上面收集的pid)的stacktrace;二是几个关键native服务的stacktrace;三是CPU的使用率。其中一是通过往目标进程发SIGQUIT来获取,因为Java虚拟机的Runtime会捕获SIGQUIT信号打印栈信息。二的原理是向后台debuggerd这个daemon发起申请,让其用ptrace打印目标进程的stacktrace然后用本地socket传回来。部分实现位于android_os_Debug.cpp和/system/core/libcutils/debugger.c。发起申请和接收数据的代码在以下函数:


  
  1. 131 int dump_backtrace_to_file_timeout(pid_t tid, int fd, int timeout_secs) {
  2. 132 int sock_fd = make_dump_request(DEBUGGER_ACTION_DUMP_BACKTRACE, tid, timeout_secs);
  3. ...
  4. 137 /* Write the data read from the socket to the fd. */
  5. ...
  6. 141 while ((n = TEMP_FAILURE_RETRY(read(sock_fd, buffer, sizeof(buffer)))) > 0) {
  7. 142 if (TEMP_FAILURE_RETRY(write(fd, buffer, n)) != n) {
  8. 143 result = -1;
  9. 144 break;
  10. 145 }
  11. 146 }
  12. ...
最后使用ProcessCpuTracker类测量CPU使用率。它主要是通过读系统的/proc/[pid]/stat文件。里边可以读到进程所占用的时间(user mode和kernel mode)。统计半秒后,排序后输出最占CPU的前几名的stacktrace以便分析谁可能是罪魁祸首。

总得来说,Watchdog是一个软件实现的检测SystemServer进程内死锁或挂起问题,并能够从中恢复的机制。除了Watchdog外,Android中还有一些自检容错及出错信息收集机制,前者有ANR,OOM Killer,init中的重启机制等,后者有Dropbox,Debuggerd,Eventlog,Bugreport等。除此之外,其它的信息查看和调试命令就不计其数了,如dumpsys, dumpstate, showslab, procrank, procmem, latencytop, librank, schedtop, svc, am ,wm, atrace, proc, pm, service, getprop/setprop, logwrapper, input, getevent/sendevent等。充分利用这些工具能够有效提高分析问题的效率。

以下是注释过的Watchdog类

/*
* Copyright (C) 2014 MediaTek Inc.
* Modification based on code covered by the mentioned copyright
* and/or permission notice(s).
*/
/*
 * Copyright (C) 2008 The Android Open Source Project
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.android.server;

import android.app.IActivityController;
import android.os.Binder;
import android.os.RemoteException;
import com.android.server.am.ActivityManagerService;

import android.content.BroadcastReceiver;
import android.content.ContentResolver;
import android.content.Context;
import android.content.Intent;
import android.content.IntentFilter;
import android.os.Debug;
import android.os.Handler;
import android.os.IPowerManager;
import android.os.Looper;
import android.os.Process;
import android.os.ServiceManager;
import android.os.SystemClock;
import android.os.SystemProperties;
import android.util.EventLog;
import android.util.Log;
import android.util.Slog;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;

import com.mediatek.aee.ExceptionLog;

/**
 * This class calls its monitor every minute. Killing this process if they don't return
 * 用于监视SystemServer中各Service是否正常工作。如果超过一定时间(默认30秒),
 * 就dump现场便于分析,再超时(默认60秒)就重启SystemServer保证系统可用性。
 * **/
public class Watchdog extends Thread {
    static final String TAG = "Watchdog";

    // Set this to true to use debug default values.
    static final boolean DB = false;

    // Set this to true to have the watchdog record kernel thread stacks when it fires
    static final boolean RECORD_KERNEL_THREADS = true;

    static final long DEFAULT_TIMEOUT = DB ? 10*1000 : 60*1000;
    static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;

    // These are temporally ordered: larger values as lateness increases
    static final int COMPLETED = 0;
    static final int WAITING = 1;
    static final int WAITED_HALF = 2;
    static final int OVERDUE = 3;
    static final int TIME_SF_WAIT = 20000;

    // Which native processes to dump into dropbox's stack traces
    public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
        "/system/bin/audioserver",
        "/system/bin/cameraserver",
        "/system/bin/drmserver",
        "/system/bin/mediadrmserver",
        "/system/bin/mediaserver",
        "/system/bin/sdcard",
        "/system/bin/surfaceflinger",
        "media.codec",     // system/bin/mediacodec
        "media.extractor", // system/bin/mediaextractor
        "com.android.bluetooth",  // Bluetooth service
    };

    static Watchdog sWatchdog;
    ExceptionLog exceptionHWT;

    /* This handler will be used to post message back onto the main thread */
    //handler集合,检测线程的loop是否阻塞
    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
    //monitor集合,检测是否死锁
    final HandlerChecker mMonitorChecker;
    ContentResolver mResolver;
    ActivityManagerService mActivity;

    int mPhonePid;
    IActivityController mController;
    boolean mAllowRestart = true;
    public long GetSFStatus() {
        if (exceptionHWT != null) {
         return exceptionHWT.SFMatterJava(0, 0);
        } else {
        return 0;
    }
    }

    public static int GetSFReboot() {
        return SystemProperties.getInt("service.sf.reboot", 0);
    }

    public static void SetSFReboot() {
        int OldTime = SystemProperties.getInt("service.sf.reboot", 0);
        OldTime = OldTime + 1;
        if (OldTime > 9) OldTime = 9;
        SystemProperties.set("service.sf.reboot", String.valueOf(OldTime));
    }


    /**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     * 通过Handler的looper的MessageQueue来判断该线程是否卡住。
     */
    public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final String mName;
        private final long mWaitMax;
        /* Monitor的实现类可以通过addMonitor()把自己加入这个集合,以便watchdog在轮询的过程中可以通过scheduleCheckLocked回调*/
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

        public void addMonitor(Monitor monitor) {
            mMonitors.add(monitor);
        }

        public void scheduleCheckLocked() {
            //mMonitors集合为空,且loop正在轮询,返回
            //isPolling()返回当前的looper线程是否在polling工作来做,这个是检测loop是否存活的方法,也就是是否被阻塞了。
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread.  Note that we
                // only do this if mCheckReboot is false and we have no
                // monitors, since those would need to be executed at this point.
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;//设为false,说明回调正在执行
            mCurrentMonitor = null;
            /*记录开始check monitor的时间,这里之所以记录开始的时间
             * 是因为如果到了等待的最大时间mWaitMax,mCompleted还是false,就会返回超时OVERDUE
             */
            mStartTime = SystemClock.uptimeMillis();
            //插入消息到mHandler的消息队列头
            mHandler.postAtFrontOfQueue(this);
        }

        public boolean isOverdueLocked() {
            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }

        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

        public Thread getThread() {
            return mHandler.getLooper().getThread();
        }

        public String getName() {
            return mName;
        }

        public String describeBlockedStateLocked() {
            if (mCurrentMonitor == null) {
                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
            } else {
                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                        + " on " + mName + " (" + getThread().getName() + ")";
            }
        }

        /* 执行mMonitors集合的所有回调*/
        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;//设为true说明回调执行结束
                mCurrentMonitor = null;
            }
        }
    }

    final class RebootRequestReceiver extends BroadcastReceiver {
        @Override
        public void onReceive(Context c, Intent intent) {
            if (intent.getIntExtra("nowait", 0) != 0) {
                rebootSystem("Received ACTION_REBOOT broadcast");
                return;
            }
            Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
        }
    }

    /** Monitor for checking the availability of binder threads. The monitor will block until
     * there is a binder thread available to process in coming IPCs to make sure other processes
     * can still communicate with the service.
     */
    private static final class BinderThreadMonitor implements Watchdog.Monitor {
        @Override
        public void monitor() {
            Binder.blockUntilThreadAvailable();
        }
    }

    public interface Monitor {
        void monitor();
    }

    /**
     * 单例模式,但是这个地方没有,应该是不会发生并发
     * 从代码看,确实也只有SystemServer类调用了
     * @return
     */
    public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

    /* watchdog是Thread的子类*/
    private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        /**
         * 对于System Server的主线程,UI线程,IO线程和Display线程,分别做相同操作,这坨线程和FgThread一样都继承自ServiceThread。
         */
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());
        //if (SystemProperties.get("ro.have_aee_feature").equals("1")) {
            exceptionHWT = new ExceptionLog();
        //}

    }

    public void init(Context context, ActivityManagerService activity) {
        /**
         * 对于每一个应用程序来说,如果想要访问内容提供器中的共享数据,就一定要借助ContentResolver类,
         * 可以通过Context中的getContentResolver()方法获取到该类的实例,ContentResolver中提供了一系列的方法用于对数据进行CRUD
         *操作.其中insert()方法用于添加数据,update()方法用于更新数据,delete()方法用于删除数据,query()方法用于查询数据.
         *不同于SQLiteDatabase,ContentResolver中的增删改查方法都是不接收表名参数的,而是使用一个Uri参数代替,这个参数被称为内容URI,
         *内容URI给内容提供器中的数据建立了唯一标识符,它主要由两部分组成:authority和path,authority是用于对不同的应用程序做区分的,一般为了避免冲突,
         *都会采用程序包名的方式来进行命名,比如某个程序的包名是com.example.app,
         *那么该程序对应的authority就可以命名为com.example.app.procider.path则是用于对同一个应用程序中不同的表做区分的,通常会添加到authority后面,
         *比如某个应用的数据库里面存在两张表:table1和table2,这时就可以将path分别命名为/table1和table2,然后把authority和path进行组合,
         *内容URI就变成了 com.example.app.procider/table1和com.example.app.procider/table2.
         *
         * 不过mResolver在这个类中好像没有使用
         */
        mResolver = context.getContentResolver();
        /**
         * 获取ActivityManagerService,为了使用addErrorToDropBox()方法
         * 此方法用于调用DropBoxManager记录日志。DropBoxManager 是 Android 在 Froyo(API level 8) 引入的用来持续化存储系统数据的机制,
         * 主要用于记录 Android 运行过程中, 内核, 系统进程, 用户进程等出现严重问题时的 log, 可以认为这是一个可持续存储的系统级别的 logcat.
         */
        mActivity = activity;

        /* 注册重启系统的广播 */
        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);

        if(exceptionHWT!= null){
                exceptionHWT.WDTMatterJava(0);
        }
    }

    public void processStarted(String name, int pid) {
        synchronized (this) {
            if ("com.android.phone".equals(name)) {
                mPhonePid = pid;
            }
        }
    }

    public void setActivityController(IActivityController controller) {
        synchronized (this) {
            mController = controller;
        }
    }

    public void setAllowRestart(boolean allowRestart) {
        synchronized (this) {
            mAllowRestart = allowRestart;
        }
    }

    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            }
            mMonitorChecker.addMonitor(monitor);
        }
    }

    public void addThread(Handler thread) {
        addThread(thread, DEFAULT_TIMEOUT);
    }

    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Threads can't be added once the Watchdog is running");
            }
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

    /**
     * Perform a full reboot of the system.
     */
    void rebootSystem(String reason) {
        Slog.i(TAG, "Rebooting system because: " + reason);
        IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
        try {
            pms.reboot(false, reason, false);
        } catch (RemoteException ex) {
        }
    }

    private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        return state;
    }

    private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
        ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            if (hc.isOverdueLocked()) {
                checkers.add(hc);
            }
        }
        return checkers;
    }

    private String describeCheckersLocked(ArrayList<HandlerChecker> checkers) {
        StringBuilder builder = new StringBuilder(128);
        for (int i=0; i<checkers.size(); i++) {
            if (builder.length() > 0) {
                builder.append(", ");
            }
            builder.append(checkers.get(i).describeBlockedStateLocked());
        }
        return builder.toString();
    }

    @Override
    public void run() {
        boolean waitedHalf = false;
        boolean mSFHang = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            String subject;
            mSFHang = false;
            if (exceptionHWT != null && waitedHalf == false ) {
                exceptionHWT.WDTMatterJava(300);
            }
            final boolean allowRestart;
            int debuggerWasConnected = 0;

            Slog.w(TAG, "SWT Watchdog before synchronized:" + SystemClock.uptimeMillis());

            synchronized (this) {

                Slog.w(TAG, "SWT Watchdog after synchronized:" + SystemClock.uptimeMillis());

                long timeout = CHECK_INTERVAL;
                long SFHangTime;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout);
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                //MTK enhance
                SFHangTime = GetSFStatus();
                Slog.w(TAG, "**Get SF Time **" + SFHangTime);
                if (SFHangTime > TIME_SF_WAIT * 2) {
                    Slog.v(TAG, "**SF hang Time **" + SFHangTime);
                    mSFHang = true;

                } //@@
                else {
                    final int waitState = evaluateCheckerCompletionLocked();
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        //CputimeEnable(new String("0"));
                        continue;
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                       // CputimeEnable(new String("0"));
                        continue;
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                        if (exceptionHWT != null) {
                            exceptionHWT.WDTMatterJava(360);
                        }
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                            waitedHalf = true;
                        }
                        continue;
                    }
                }
                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            Slog.e(TAG, "**SWT happen **" + subject);
            if (mSFHang && subject.isEmpty()) {
                subject = "surfaceflinger  hang.";
            }
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            if (exceptionHWT != null) {
                 exceptionHWT.WDTMatterJava(420);
            }
            ArrayList<Integer> pids = new ArrayList<Integer>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(2000);

            // Pull our own kernel thread stacks as well if we're configured for that
            if (RECORD_KERNEL_THREADS) {
                dumpKernelStackTraces();
            }

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            /// M: WDT debug enhancement
            /// need to wait the AEE dumps all info, then kill system server @{
            /*
            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                name, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}
            */
            Slog.v(TAG, "** save all info before killnig system server **");
            mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null, subject, null, null, null);

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if ((mSFHang == false) && (controller != null)) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    Slog.i(TAG, "Binder.setDumpDisabled");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                    Slog.i(TAG, "Activity controller requested to reboot");
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i<blockedCheckers.size(); i++) {
                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) {
                        Slog.w(TAG, "    at " + element);
                    }
                }
                /// @}

                Slog.w(TAG, "*** GOODBYE!");
                // MTK enhance
                if (mSFHang)
                {
                    Slog.w(TAG, "SF hang!");
                    if (GetSFReboot() > 3)
                    {
                        Slog.w(TAG, "SF hang reboot time larger than 3 time, reboot device!");
                        rebootSystem("Maybe SF driver hang,reboot device.");
                    }
                    else
                    {
                        SetSFReboot();
                    }
                }
                //@

                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

    private void doSysRq(char c) {
        try {
            FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
            sysrq_trigger.write(c);
            sysrq_trigger.close();
        } catch (IOException e) {
            Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
        }
    }

    private File dumpKernelStackTraces() {
        String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
        if (tracesPath == null || tracesPath.length() == 0) {
            return null;
        }

        native_dumpKernelStacks(tracesPath);
        return new File(tracesPath);
    }

    private native void native_dumpKernelStacks(String tracesPath);
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值