UNIX Hints & Hacks |
|||||||||||||||||||||||||||||||||||||
Chapter 4: System Monitoring |
|
||||||||||||||||||||||||||||||||||||
|
When a system crashes, crash files are created in crash directories that are already set up on the system to help diagnose problems.
Flavors: AT&T, some newer BSD versions
Check your man pages for savecore to see whether your flavor is supported. Every flavor that is supported is configured a little differently.
If a system takes an unexpected crash, it can be configured to write out the contents of memory to the dump device, which is, in most cases, swap. When the system boots back up and processes the S#savecore script, it performs a check on the raw partition swap device to see whether data was dumped to it. If data is found, a file is created into /usr/adm/crash. This file generally takes on the name core.n, unix.n, or vmcore.n. Here is an example from SGI's IRIX:
# cd /var/adm/crash # ls -al total 89688 drwxr-xr-x 2 root sys 4096 Sep 9 10:18 ./ drwxr-xr-x 7 adm adm 4096 Oct 18 01:01 ../ -rw-r--r-- 1 root sys 1294 Aug 12 11:28 analysis.0 -rw------- 1 root sys 3968160 Sep 21 11:28 unix.0 -rw------- 1 root sys 41918464 Sep 21 11:38 vmcore.0.comp
Similar to core files, these large crash files (unix.0 and vmcore.0) are in a binary format. When the crash files are created, some flavors are nice enough to run an analysis on the crash files and build a report for you. Here is one such report that helps diagnose what the exact problem might have been:
# cat /var/adm/crash/analysis.0
savecore: Created log Sept 21 11:28:12 1998
Dump Header Information ------------------------------------------------------- uname: IRIX xinu 6.2 03131015 IP22 physical mem: 96 megabytes phys start: 0x8000000 page size: 4096 bytes dump version: 1 dump size: 40936 k crash time: Mon Sep 21 11:28:12 1998 panic string: <0>PANIC: IRIX Killed due to Bus Error kernel putbuf: pb 0: ounting filesystem: / pb 1: <5>NOTICE: Starting XFS recovery on filesystem: / (dev: 128/16) pb 2: <5>NOTICE: Ending XFS recovery for filesystem: / (dev: 128/16) pb 3: <4>WARNING: Process [iexplorer] 10768 generated trap, but has signal 11 held or ignored pb 4: Process has been killed to prevent infinite loop pb 5: <4>WARNING: Process [iexplorer] 23645 generated trap, but has signal 11 held or ignored pb 6: Process has been killed to prevent infinite loop pb 7: Recoverable memory parity error corrected by CPU at 0x9116190 <0x302> code:30 pb 8: Memory Parity Error in SIMM S2 pb 9: GIO Error/Addr 0x400:<TIME > 0x7f242c0 pb 10: pb 11: <0>PANIC: IRIX Killed due to Bus Error pb 12: at PC:0x88082ee8 ep:0xffffca20 pb 13: pb 14: pb 15: Dumping to dev 0x2000011 at block 0, space: 0x27fa pages
This is a great feature that you should use whenever possible. When a system goes down, syslog might not have time or the capability to log why the system crashed. In this case, you do not see any problems reported in the system log files, unless the problems have been developing over time. Because the contents of memory are dumped, it's more likely that you'll know why the system crashed.
Because these are also core-type files written in binary, you might want to use the strings command. In most cases there is useful information in the crash file. Although the vendors often have their own set of tools to extract problems from the crash files, they don't release these tools to system administrators. Your best chance is to try to grep out any errors that might have been written to the crash file.
# cd /var/adm/crash # ls -al -rw------- 1 root sys 3968160 Sep 21 11:28 unix.0
# strings unix.0 | grep -i errors | more
panic string: <0>PANIC: IRIX Killed due to Bus Error pb 7: Recoverable memory parity error corrected by CPU at 0x9116190 <0x302> code:30 pb 8: Memory Parity Error in SIMM S2 pb 9: GIO Error/Addr 0x400:<TIME > 0x7f242c0 pb 11: <0>PANIC: IRIX Killed due to Bus Error
UNIX Hints & Hacks |
|||||||||||||||||||||||||||||||||||||
Chapter 4: System Monitoring |
|
||||||||||||||||||||||||||||||||||||
|
© Copyright Macmillan USA. All rights reserved.