Chapter 4: System Monitoring: 4.11 Monitoring Crash Files

UNIX Hints & Hacks

Chapter 4: System Monitoring

Sections in this Chapter:
4.1 Monitoring at Boot Time	4.5 Mail a Process	4.9 Monitoring with ping	4.13 Checking the Time
4.2 Starting with a Fresh Install	4.6 Watching the Disk Space	4.10 Monitoring Core Files
4.3 Monitor with tail	4.7 Find the Disk Hog	4.11 Monitoring Crash Files
4.4 Cut the Log in Half	4.8 Watching by grepping the Difference	4.12 Remember Daylight Savings Time

4.11 Monitoring Crash Files

4.11.1 Description

When a system crashes, crash files are created in crash directories that are already set up on the system to help diagnose problems.

Example

Flavors: AT&T, some newer BSD versions

Check your man pages for savecore to see whether your flavor is supported. Every flavor that is supported is configured a little differently.

If a system takes an unexpected crash, it can be configured to write out the contents of memory to the dump device, which is, in most cases, swap. When the system boots back up and processes the S#savecore script, it performs a check on the raw partition swap device to see whether data was dumped to it. If data is found, a file is created into /usr/adm/crash. This file generally takes on the name core.n, unix.n, or vmcore.n. Here is an example from SGI's IRIX:

# cd /var/adm/crash
# ls -al
total 89688
drwxr-xr-x    2 root   sys      4096 Sep  9 10:18 ./
drwxr-xr-x    7 adm    adm      4096 Oct 18 01:01 ../
-rw-r--r--    1 root   sys      1294 Aug 12  11:28 analysis.0
-rw-------    1 root   sys   3968160 Sep 21  11:28 unix.0
-rw-------    1 root   sys   41918464 Sep 21  11:38 vmcore.0.comp

Similar to core files, these large crash files (unix.0 and vmcore.0) are in a binary format. When the crash files are created, some flavors are nice enough to run an analysis on the crash files and build a report for you. Here is one such report that helps diagnose what the exact problem might have been:

# cat /var/adm/crash/analysis.0

savecore: Created log Sept 21 11:28:12 1998

               Dump Header Information
-------------------------------------------------------
 uname:        IRIX xinu 6.2 03131015 IP22
 physical mem: 96 megabytes
 phys start:   0x8000000
 page size:    4096 bytes
 dump version: 1
 dump size:    40936 k
 crash time:   Mon Sep 21 11:28:12 1998
 panic string: <0>PANIC: IRIX Killed due to Bus Error
 kernel putbuf:
   pb 0: ounting filesystem: /
   pb 1: <5>NOTICE: Starting XFS recovery on filesystem: / (dev: 128/16)
   pb 2: <5>NOTICE: Ending XFS recovery for filesystem: / (dev: 128/16)
   pb 3: <4>WARNING: Process [iexplorer] 10768 generated trap, but has signal 11 held or ignored
   pb 4: Process has been killed to prevent infinite loop
   pb 5: <4>WARNING: Process [iexplorer] 23645 generated trap, but has signal 11 held or ignored
   pb 6: Process has been killed to prevent infinite loop
   pb 7: Recoverable memory parity error corrected by CPU at 0x9116190 <0x302> code:30
   pb 8: Memory Parity Error in SIMM  S2
   pb 9: GIO Error/Addr 0x400:<TIME > 0x7f242c0
   pb 10:
   pb 11: <0>PANIC: IRIX Killed due to Bus Error
   pb 12:      at PC:0x88082ee8 ep:0xffffca20
   pb 13:
   pb 14:
   pb 15: Dumping to dev 0x2000011 at block 0, space: 0x27fa pages

Reason

This is a great feature that you should use whenever possible. When a system goes down, syslog might not have time or the capability to log why the system crashed. In this case, you do not see any problems reported in the system log files, unless the problems have been developing over time. Because the contents of memory are dumped, it's more likely that you'll know why the system crashed.

Real World Experience

Because these are also core-type files written in binary, you might want to use the strings command. In most cases there is useful information in the crash file. Although the vendors often have their own set of tools to extract problems from the crash files, they don't release these tools to system administrators. Your best chance is to try to grep out any errors that might have been written to the crash file.

# cd /var/adm/crash
# ls -al
-rw-------    1 root     sys      3968160 Sep 21  11:28 unix.0

# strings unix.0 | grep -i errors | more

 panic string: <0>PANIC: IRIX Killed due to Bus Error
   pb 7: Recoverable memory parity error corrected by CPU at 0x9116190 <0x302> code:30
   pb 8: Memory Parity Error in SIMM  S2
   pb 9: GIO Error/Addr 0x400:<TIME > 0x7f242c0
   pb 11: <0>PANIC: IRIX Killed due to Bus Error

UNIX Hints & Hacks

Contents Index

Chapter 4: System Monitoring

Previous Chapter Next Chapter

Sections in this Chapter:
4.1 Monitoring at Boot Time	4.5 Mail a Process	4.9 Monitoring with ping	4.13 Checking the Time
4.2 Starting with a Fresh Install	4.6 Watching the Disk Space	4.10 Monitoring Core Files
4.3 Monitor with tail	4.7 Find the Disk Hog	4.11 Monitoring Crash Files
4.4 Cut the Log in Half	4.8 Watching by grepping the Difference	4.12 Remember Daylight Savings Time