Picking Up The Pieces After Your LINUX System Crashes
Picking Up The Pieces After Your LINUX System Crashes
Alan Boda
Hewlett-Packard Company
Introduction
Did my system crash or hang?
Why did my system crash or hang?
What do I save?
What should I do next time?
What should I do now?
2
Alan Boda - HP 8/15/2006
What we will be discussing
System Admin goals
Crash analogy
Difference between a crash and a hang
Environment Scenarios
Tools to have in place now
What data to gather and how to gather it
What to do before and after the reboot
Reconstructing the crash scene
Ways to look at gathered data
3
Alan Boda - HP 8/15/2006
What we won’t be discussing
Crash dump analysis
Tool installation and configuration details
System or Application Performance
Tuning
4
Alan Boda - HP 8/15/2006
Goals as System Administrator
Prepare system so it can tell you what
happened if it hangs or crashes
Reconstruct the Crash/Hang Scene
Develop emergency procedures
sar envir logs
onme
nt
track
dumps
record updat
es
5
Alan Boda - HP 8/15/2006
Car Crash Analogy
Accident Reconstruction Consultants
Evidence and Clues
Making the accident scene tell its story
More clues = clearer picture
Goal: Prep system to tell you what
happened
6
Alan Boda - HP 8/15/2006
Car Crash vs. System Crash
skid marks performance degradation
Weather system environment changes
blown tire Failures in storage, fan, power supply, etc…
eye witnesses system administrator or user who saw the failure or hang
survivors Do any processes respond?
Login response?
Db/sql query response?
Ping response?
9
Alan Boda - HP 8/15/2006
System Tools
to Configure Now!
10
Alan Boda - HP 8/15/2006
SysRq
aka magic keys
used during hang/freeze situations
Alt-SysRq-<command key> sequence
provides memory, stack trace, process info
commands to sync disks, crash system
logs to messages and netdump (RHEL) – best effort
RHEL and SLES kernels have SysRq configured but not
enabled.
To verify if enabled:
# cat /proc/sys/kernel/sysrq
(1 = enabled, 0 = disabled)
Security risk
11
Alan Boda - HP 8/15/2006
SysRq Configuration
Must set /proc/sys/kernel/sysrq to 1
-or-
#sysctl –w kernel.sysrq=1
#sysctl –p
12
Alan Boda - HP 8/15/2006
Sample SysRq Output
SysRq : Show Regs
Pid/TGid: 0/0, comm: swapper
EIP: 0060:[<c0109129>] CPU: 1
EIP is at default_idle [kernel] 0x29 (2.4.21-20.ELsmp)
ESP: 080b:c01091c2 EFLAGS: 00000246 Tainted: P
EAX: 00000000 EBX: c0109100 ECX: c043bc80 EDX: c9b20000
ESI: c9b20000 EDI: c9b20000 EBP: c0109100 DS: 0068 ES: 0068 FS: 0000 GS:
0000
CR0: 8005003b CR2: b729f000 CR3: 376c9900 CR4: 000006f0
Call Trace: [<c01091c2>] cpu_idle [kernel] 0x42 (0xc9b21fb0)
[<c01291c3>] call_console_drivers [kernel] 0x63 (0xc9b21fc4)
[<c01294f3>] printk [kernel] 0x153 (0xc9b21ffc)
Zone:Normal freepages:108783 min: 1279 low: 4544 high: 6304
Zone:HighMem freepages:1209405 min: 255 low: 20990 high: 31485
Free pages: 1321089 (1209405 HighMem)
( Active: 78806/14876, inactive_laundry: 4493, inactive_clean: 0, free:
1321089
)
…
13
Alan Boda - HP 8/15/2006
sysstat
package containing iostat, sadc, sar, mpstat
System activity data collected
snapshots taken every 10 minutes
saves 7 days of reports by default (RHEL)
to verify:
# rpm -qa | grep sysstat
sysstat-5.0.1-35.4
14
Alan Boda - HP 8/15/2006
Contents of /var/log/sa (RHEL)
# ls -l /var/log/sa
total 4060
-rw-r--r-- 1 root root 207600 Jan 20 23:50 sa20
-rw-r--r-- 1 root root 207600 Jan 21 23:50 sa21
-rw-r--r-- 1 root root 207600 Jan 22 23:50 sa22
-rw-r--r-- 1 root root 207600 Jan 23 23:50 sa23
-rw-r--r-- 1 root root 207600 Jan 24 23:50 sa24
-rw-r--r-- 1 root root 207600 Jan 25 23:50 sa25
-rw-r--r-- 1 root root 207600 Jan 26 23:50 sa26
-rw-r--r-- 1 root root 207600 Jan 27 23:50 sa27
-rw-r--r-- 1 root root 88080 Jan 28 10:00 sa28
-rw-r--r-- 1 root root 287976 Jan 20 23:53 sar20
-rw-r--r-- 1 root root 287976 Jan 21 23:53 sar21
-rw-r--r-- 1 root root 287976 Jan 22 23:53 sar22
-rw-r--r-- 1 root root 287976 Jan 23 23:53 sar23
-rw-r--r-- 1 root root 287976 Jan 24 23:53 sar24
-rw-r--r-- 1 root root 287976 Jan 25 23:53 sar25
-rw-r--r-- 1 root root 287976 Jan 26 23:53 sar26
-rw-r--r-- 1 root root 287976 Jan 27 23:53 sar27
15
Alan Boda - HP 8/15/2006
Contents of /var/log/sa (SLES)
# ls /var/log/sa
. sa.2006_01_13 sa.2006_01_24 sar.2006_01_10 sar.2006_01_21
.. sa.2006_01_14 sa.2006_01_25 sar.2006_01_11 sar.2006_01_22
sa.2006_01_04 sa.2006_01_15 sa.2006_01_26 sar.2006_01_12 sar.2006_01_23
sa.2006_01_05 sa.2006_01_16 sa.2006_01_27 sar.2006_01_13 sar.2006_01_24
sa.2006_01_06 sa.2006_01_17 sa.2006_01_28 sar.2006_01_14 sar.2006_01_25
sa.2006_01_07 sa.2006_01_18 sar.2006_01_04 sar.2006_01_15 sar.2006_01_26
sa.2006_01_08 sa.2006_01_19 sar.2006_01_05 sar.2006_01_16 sar.2006_01_27
sa.2006_01_09 sa.2006_01_20 sar.2006_01_06 sar.2006_01_17
sa.2006_01_10 sa.2006_01_21 sar.2006_01_07 sar.2006_01_18
sa.2006_01_11 sa.2006_01_22 sar.2006_01_08 sar.2006_01_19
sa.2006_01_12 sa.2006_01_23 sar.2006_01_09 sar.2006_01_20
16
Alan Boda - HP 8/15/2006
Sample sar report
# more sar20
Linux 2.4.21-37.ELsmp (karp.alf.cpqcorp.net) 2006-01-20
00:00:00 proc/s
00:10:00 0.03
00:20:00 0.01
00:30:00 0.01
00:40:00 0.01
00:50:00 0.01
01:00:00 0.01
01:10:00 0.03
01:20:00 0.01
01:30:00 0.01
01:40:00 0.01
01:50:00 0.01
02:00:00 0.01
02:10:00 0.03
02:20:00 0.01
02:30:00 0.01
02:40:00 0.01
17
Alan Boda - HP 8/15/2006
System Management Tools
snmp-based tools
Examples:
– IBM: IBM Director Agents
– Dell: OpenManage Server Administrator
– HP: Insight Manager and Agents
Agents monitor and log to system logs
Predictive fault (if supported by driver)
18
Alan Boda - HP 8/15/2006
Other tools
Special situations
vendor-specific cron script to gather
– /proc/meminfo
– top
– /proc/slabinfo
– vmstat
– netstat
– interrupt
– lsof
19
Alan Boda - HP 8/15/2006
Crash Dump Issues
Inconsistent crash dump methods
Standard kernel
deadlocks
resources
network throughput for network-based dumps
assumes trusted kernel state
where to dump
ASR interference
20
Alan Boda - HP 8/15/2006
Crash Dump tools
netdump - RHEL
diskdump - RHEL
LKCD - SLES
mkdump
kdump
21
Alan Boda - HP 8/15/2006
Netdump
dumps to remote disk
nic must support polled operation
log file of panic, oops and other SysRq output
Verify:
– # service netdump status
– # service netdump-server status
– Check /etc/sysconfig/netdump
DEV=eth0 (or other nic)
NETDUMPADDR={netdump-server IP}
22
Alan Boda - HP 8/15/2006
Diskdump
dumps to local disk
limited controllers
dump levels
Available as of RHEL 3 U3
Verify:
23
Alan Boda - HP 8/15/2006
LKCD
dumps to local disk
can also dump to netdump-server (default)
different dump levels
Verify:
– # lkcd query
24
Alan Boda - HP 8/15/2006
mkdump
minikernel dump (based on mkexec)
OpenSource
Uses netdump and LKCD dump format
kdump
kexec-based kernel crash dump mechanism
OpenSource
Use crash to analyze dump file
25
Alan Boda - HP 8/15/2006
Dump Suggestions
Disk-based dumps
Network-based dumps
Automatic Server Recovery (ASR)
timeouts
Synchronize time
Best effort
26
Alan Boda - HP 8/15/2006
Test out Dump
Enable the magic sysrq key
# sysctl -w kernel/sysrq=1
Enable panic_on_oops
# sysctl -w kernel/panic_on_oops=1
netdump: check to see if netlog is working
# echo h > /proc/sysrq-trigger
netdump: Test SysRq writes to netdump log file
#echo m > /proc/sysrq-trigger
Sync all mounted file systems
# echo s > /proc/sysrq-trigger
Crash the system
– # echo c > /proc/sysrq-trigger (RHEL)
– # echo d > /proc/sysrq-trigger (SLES)
– crash.c (RHEL)
diskdump – check /var/crash/127.0.0.1-<date>
lkcd – check /var/log/dump/
27
Alan Boda - HP 8/15/2006
System Snapshot
Take snapshot of working system now
Run normal working load while taking
snapshot
Will discuss tools one can use shortly
28
Alan Boda - HP 8/15/2006
Now that the System has
Crashed or Hung
30
Alan Boda - HP 8/15/2006
After the Reboot
kernel (uname –a)
loaded modules (lsmod)
bus information (lspci -w)
boot information (dmesg)
system logs (/var/log/*)
memory (/proc/meminfo)
cpu (/proc/cpuinfo)
disk (/proc/scsi/scsi)
disk partition (/proc/partitions)
installed rpm’s (/var/log/rpminfo, “rpm –qa”)
time of hang or crash
cpu details – (dmidecode)
31
Alan Boda - HP 8/15/2006
Snapshot Tools for After Reboot
sysreport (RHEL)
sitar (SLES)
config.sh (SLES)
cfg2html (OpenSource)
h/w diagnostic tools (vendor-specific)
32
Alan Boda - HP 8/15/2006
sysreport
Verify: rpm –qa | grep sysreport
What does it generate?
# ls -w 50
boot free ksyms mount rpm-Va
date hardware.py lib proc uname
df hostname ls-boot ps uptime
etc ifconfig lsmod pstree var
fdisk-l installed-rpms lspci route
33
Alan Boda - HP 8/15/2006
sitar
Generates various reports detailing
– add-on's
– installed packages
– system info
– yast installed packages
Reports created in different formats
34
Alan Boda - HP 8/15/2006
sitar-generated files
# sitar
# ls /tmp/sitar-fwills.america.cpqcorp.net-2006020104/
.
..
sitar-addon-fwills.america.cpqcorp.net-yast2.sel
sitar-fwills.america.cpqcorp.net-yast1.sel
sitar-fwills.america.cpqcorp.net.html
sitar-fwills.america.cpqcorp.net.sdocbook.xml
sitar-fwills.america.cpqcorp.net.tex
sitar-sles-fwills.america.cpqcorp.net-yast2.sel
35
Alan Boda - HP 8/15/2006
Sitar .html report
fwills.america.cpqcorp.net, Wed Feb 1 04:48:57 2006
Linux fwills 2.6.5-7.193-default #1 Wed Jul 20 14:39:18 UTC 2005 i686 i686 i386 GNU/Linux
SUSE LINUX Enterprise Server 9 (i586)
Table of Contents
1. General Information
2. CPU
…
1. General Information
Hostname fwills.america.cpqcorp.net
Operating System SUSE LINUX Enterprise Server 9 (i586)
UName Linux fwills 2.6.5-7.193-default #1 Wed Jul 20 14:39:18 UTC 2005 i686 i686 i386 GNU/Linux
Date Wed Feb 1 04:48:57 2006
Main Memory 385976 KByte
Cmdline root=/dev/sda2 vga=0x317 selinux=0 resume=/dev/sda3 elevator=cfq splash=silent
Load 0.00 0.00 0.00 1/79 2084
Uptime (minutes hours days) 181715 3028 124
Idletime (minutes hours days) 37934 632 26
2. CPU
36
Alan Boda - HP 8/15/2006
config.sh
What does it generate?
# ls -w 50
. iscsi.txt performance.txt
.. lvm.txt rcd.txt
boot.txt messages.txt release.txt
chkconfig.txt modules.txt rpm.txt
config.sh.txt mpio.txt rug.txt
cron.txt ncp.txt scsi.txt
env.txt network.txt siga.txt
evms.txt nss.txt softraid.txt
hwinfo.txt pam.txt y2log.txt
37
Alan Boda - HP 8/15/2006
Tools to view sysstat data
sar
isag
sarcheck
38
Alan Boda - HP 8/15/2006
Sample sar commands
# sar -u 2 4
Linux 2.4.21-37.ELsmp (karp.alf.cpqcorp.net) 02/01/2006
# cd /var/log/sa
# sar -A -f sa01 > sar01-new
# ls -l sa*01*
-rw-r--r-- 1 root root 132720 Feb 1 15:10 sa01
-rw-r--r-- 1 root root 181575 Feb 1 15:08 sar01-new
39
Alan Boda - HP 8/15/2006
Reconstructing the Crash Scene
Check logs
/var/log/messages
– Search for kernel load entry
– Work backwards and look for:
Errors or Warnings
Oops messages with trace output
41
Alan Boda - HP 8/15/2006
Oops
Oct 30 00:05:34 karp kernel: Unable to handle kernel NULL pointer dereference
at virtual address 00000008
Oct 30 00:05:34 karp kernel: printing eip:
Oct 30 00:05:34 karp kernel: c011ec5d
Oct 30 00:05:34 karp kernel: *pde = 2aefa001
Oct 30 00:05:34 karp kernel: Oops: 0000
Oct 30 00:05:34 karp kernel: Kernel 2.4.9-e.38enterprise
Oct 30 00:05:34 karp kernel: CPU: 1
Oct 30 00:05:34 karp kernel: EIP: 0010:[get_module_list+61/816] Tainted: P
Oct 30 00:05:34 karp kernel: EIP: 0010:[<c011ec5d>] Tainted: P
Oct 30 00:05:34 karp kernel: EFLAGS: 00010246
Oct 30 00:05:34 karp kernel: EIP is at get_module_list [kernel] 0x3d
Oct 30 00:50:00 karp syslogd 1.4.1: restart.
Oct 30 00:50:00 karp syslog: syslogd startup succeeded
Oct 30 00:50:00 karp kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 30 00:50:00 karp kernel: Inspecting /boot/System.map-2.4.9-e.38enterprise
Oct 30 00:50:00 karp syslog: klogd startup succeeded
42
Alan Boda - HP 8/15/2006
ksymoops
>>EIP; c0113f8c <sys_init_module+49c/4d0>
Trace; c011d3f5 <sys_mremap+295/370>
Trace; c011af5f <do_generic_file_read+5bf/5f0>
Trace; c011afe9 <file_read_actor+59/60>
Trace; c011d2bc <sys_mremap+15c/370>
Trace; c010e80f <do_sigaltstack+ff/1a0>
Trace; c0107c39 <overflow+9/c>
Trace; c0107b30 <tracesys+1c/23>
Trace; 00001000 Before first symbol
43
Alan Boda - HP 8/15/2006
SAR Data Example
15:20:00 dentunusd file-sz %file-sz inode-sz super-sz %super-sz dquot-sz %dquot-sz rtsig-sz %rtsig-sz
15:30:00 1866554 2728 2.08 2055024 0 0.00 0 0.00 1 0.10
15:40:01 1866909 2785 2.12 2055020 0 0.00 0 0.00 1 0.10
…
17:20:00 1870217 2786 2.13 2055019 0 0.00 0 0.00 1 0.10
17:30:00 1870516 2762 2.11 2055022 0 0.00 0 0.00 1 0.10
17:40:00 1870848 2785 2.12 2055019 0 0.00 0 0.00 1 0.10
17:50:00 1569671 2156 1.64 1743619 0 0.00 0 0.00 1 0.10
18:00:00 1570730 1984 1.51 1744880 0 0.00 0 0.00 1 0.10
18:10:00 1571240 1792 1.37 1745241 0 0.00 0 0.00 1 0.10
18:20:00 1571768 1510 1.15 1745796 0 0.00 0 0.00 1 0.10
18:30:00 1572100 1483 1.13 1745826 0 0.00 0 0.00 1 0.10
18:40:00 1573175 16 0.01 1747980 0 0.00 0 0.00 1 0.10
44
Alan Boda - HP 8/15/2006
ISAG View of Sar Data
45
Alan Boda - HP 8/15/2006
The Question of Debuggers
Torvalds quote
“I do see some good points in a kernel debugger, but I have yet to be
convinced that the good things outweigh the bad. The only valid uses of
debuggers is to get a stack backtrace and a register dump, imho, and
that is what you get from a kernel panic anyway (and the ksymoops.cc
program will actually make it readable for others than just me ;-)
I'm afraid that I've seen too many people fix bugs by looking at
debugger output, and that almost inevitably leads to fixing the symptoms
rather than the underlying problems. “
Ref: http://www.ussg.iu.edu/hypermail/linux/kernel/9510/0103.html
46
Alan Boda - HP 8/15/2006
In The Dumps
Recover the vmcore
Tools to analyze:
– netdump / diskdump: use crash
– LKCD: use lcrash or crash
– mkdump: use lcrash or crash
Key items: process stacks, system calls
Requirements needed from crashed system
47
Alan Boda - HP 8/15/2006
Netdump
Check netdump log file first
– Oops or panic messages
– loaded modules
– SysRq memory, trace, process info
– Stack trace
Use “crash” on vmcore
syslog
Ref: /usr/share/doc/netdump-*
48
Alan Boda - HP 8/15/2006
Netdump-server Files
# ls -l /var/crash/16.113.5.104-2003-12-15-12:21
total 141108
-rw------- 1 netdump netdump 63067 Dec 15 2003 log
-rw------- 1 netdump netdump 134205440 Dec 15 2003 vmcore
# du -sk /var/crash/16.113.5.104-2003-12-15-12:21/vmcore
131192 /var/crash/16.113.5.104-2003-12-15-12:21/vmcore
49
Alan Boda - HP 8/15/2006
Netdump-server log
# more log
Oops: 0002
Kernel 2.4.9-e.3
CPU: 0
EIP: 0010:[<c8a44076>] Tainted: P
EFLAGS: 00010282
EIP is at init_module [crash] 0x16
eax: 00000013 ebx: c8a44000 ecx: 00000000 edx: c543e000
esi: 00000000 edi: 00000000 ebp: c3149f28 esp: c3149f20
ds: 0018 es: 0018 ss: 0018
Process insmod (pid: 5619, stackpage=c3149000)
Stack: 00000000 00000060 00000060 c0118eb5 00000000 c36fb000 00000098 c35c6000
00000060 ffffffea 00000005 c468b740 00000060 c8a3f000 c8a44060 000002e8
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Call Trace: [<c0118eb5>] sys_init_module [kernel] 0x535
[<c8a44060>] init_module [crash] 0x0
[<c0106f03>] system_call [kernel] 0x33
Code: c6 05 00 00 00 00 00 b8 00 00 00 00 c9 c3 63 72 61 73 68 69
< netdump activated - performing handshake with the client. >
50
Alan Boda - HP 8/15/2006
Disk Dump
51
Alan Boda - HP 8/15/2006
LKCD
check /var/log/dump/n
Use lcrash or crash to analyze vmcore
Ref: /usr/share/doc/packages/lkcdutils/
52
Alan Boda - HP 8/15/2006
What next?
H/W vendor
System Service Provider
OS vendor
OpenSource Community
53
Alan Boda - HP 8/15/2006
Summary
54
Alan Boda - HP 8/15/2006
Questions & Answers
???
55
Alan Boda - HP 8/15/2006
Make your system talk
Prepare your system now
so it can tell you what
happened!
56
Alan Boda - HP 8/15/2006
Appendix – More Information
http://www.novell.com/coolsolutions/tools/16106.html -- SLES config.sh
http://come.to/cfg2html -- cfg2html utility to gather system information
http://www.linuxtroubleshooting.com/wiki/index.php?title=Main_Page – Linux troubleshooting tools
http://www.volny.cz/linux_monitor/isag/ -- isag
http://rpmfind.net//linux/RPM/contrib/noarch/noarch/isag-4.1.1-1.noarch.html -- isag
http://www.sarcheck.com/sclinux.htm -- sarcheck
http://linuxgazette.net/issue59/nazario.html -- good dmesg description
http://lkcd.sourceforge.net -- lkcd
http://lkcd.sourceforge.net/doc/lcrash.pdf -- lcrash HOWTO
http://lkcd.sourceforge.net/doc/lkcd_tutorial.pdf -- good lkcd tutorial
/usr/share/doc/packages/lkcdutils/README.SuSE – LKCD setup
http://www.novell.com/coolsolutions/feature/14813.html -- SLES lkcd
http://support.novell.com/cgi-bin/search/searchtid.cgi?10099561.htm – SLES lkcd howto
/usr/share/doc/diskdumputils-*/README -- diskdump setup
http://www.redhat.com/support/wpapers/redhat/netdump/ -- netdump
/usr/share/doc/netdump*/README* -- netdump / netdump-server
http://www.linuxforums.org/forum/peripherals-hardware/35963-cpu-naming-schemes-x86-386-486-586-amd-64-ia64-em64t.html?
highlight=naming+schemes -- good cpu chip reference
http://mkdump.sourceforge.net -- mkdump
http://lse.sourceforge.net/kdump/ - kdump
http://www.linuxdevcenter.com/lpt/a/1319 -- “Linux System Failure Post-Mortem”, by Jennifer Vesperman
(O’Reilly Network)
http://www.die.net/doc/linux/man/man5/proc.5.html - manpage for /proc details
http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0509wright/?ca=dgr-lnxw06DB2Linux –
good article on Linux memory utilization
http://www.ataassociates.com/Process.htm -- accident reconstruction
57
Alan Boda - HP 8/15/2006
Appendix - Vocabulary
AMD64/EM64T – Similar X86 architectures w/ 64 bit mem registers
collectively known as X86_64
ARC – Accident Reconstruction Consultant
ASR - Automatic Server Recovery
IA64 – CPU based on 64-bit Itanium chipset
ISAG – Interactive System Activity Grapher
lkcd – Linux Kernel Crash Dump utility
mkdump – minikernel dump utility
RHEL - Red Hat Enterprise Linux
SAR – System Activity Report
SLES - SuSE Linux Enterprise Server
SysRq (aka magic keys) – key sequence intercepted by kernel to
perform certain operations
x86 – CPU based on Intel 80x86 chipset
58
Alan Boda - HP 8/15/2006
Appendix – Pre-Crash Check List
Enable SysRq
Enable sysstat
Enable system management tools
Develop emergency procedures
Train staff in emergency procedures
Configure and enable dump utility
Take system snapshot on loaded/running system
Setup remote console access
59
Alan Boda - HP 8/15/2006
60
Alan Boda - HP 8/15/2006