UNIX Hints & Hacks

ContentsIndex

Chapter 4: System Monitoring

 

Previous ChapterNext Chapter

Sections in this Chapter:

   

4.1 Monitoring at Boot Time

 

4.5 Mail a Process

 

4.9 Monitoring with ping

 

4.2 Starting with a Fresh Install

 

4.6 Watching the Disk Space

 

4.10 Monitoring Core Files

 

 

4.3 Monitor with tail

 

4.7 Find the Disk Hog

 

4.11 Monitoring Crash Files

 

 

4.4 Cut the Log in Half

 

4.8 Watching by grepping the Difference

 

4.12 Remember Daylight Savings Time

 

 

 

4.11 Monitoring Crash Files

4.11.1 Description

4.11.1 Description

When a system crashes, crash files are created in crash directories that are already set up on the system to help diagnose problems.

Example

Flavors: AT&T, some newer BSD versions

Check your man pages for savecore to see whether your flavor is supported. Every flavor that is supported is configured a little differently.

If a system takes an unexpected crash, it can be configured to write out the contents of memory to the dump device, which is, in most cases, swap. When the system boots back up and processes the S#savecore script, it performs a check on the raw partition swap device to see whether data was dumped to it. If data is found, a file is created into /usr/adm/crash. This file generally takes on the name core.n, unix.n, or vmcore.n. Here is an example from SGI's IRIX:

# cd /var/adm/crash
# ls -al
total 89688
drwxr-xr-x    2 root   sys      4096 Sep  9 10:18 ./
drwxr-xr-x    7 adm    adm      4096 Oct 18 01:01 ../
-rw-r--r--    1 root   sys      1294 Aug 12  11:28 analysis.0
-rw-------    1 root   sys   3968160 Sep 21  11:28 unix.0
-rw-------    1 root   sys   41918464 Sep 21  11:38 vmcore.0.comp

Similar to core files, these large crash files (unix.0 and vmcore.0) are in a binary format. When the crash files are created, some flavors are nice enough to run an analysis on the crash files and build a report for you. Here is one such report that helps diagnose what the exact problem might have been:

# cat /var/adm/crash/analysis.0
savecore: Created log Sept 21 11:28:12 1998
Dump Header Information ------------------------------------------------------- uname: IRIX xinu 6.2 03131015 IP22 physical mem: 96 megabytes phys start: 0x8000000 page size: 4096 bytes dump version: 1 dump size: 40936 k crash time: Mon Sep 21 11:28:12 1998 panic string: <0>PANIC: IRIX Killed due to Bus Error kernel putbuf: pb 0: ounting filesystem: / pb 1: <5>NOTICE: Starting XFS recovery on filesystem: / (dev: 128/16) pb 2: <5>NOTICE: Ending XFS recovery for filesystem: / (dev: 128/16) pb 3: <4>WARNING: Process [iexplorer] 10768 generated trap, but has signal 11 held or ignored pb 4: Process has been killed to prevent infinite loop pb 5: <4>WARNING: Process [iexplorer] 23645 generated trap, but has signal 11 held or ignored pb 6: Process has been killed to prevent infinite loop pb 7: Recoverable memory parity error corrected by CPU at 0x9116190 <0x302> code:30 pb 8: Memory Parity Error in SIMM S2 pb 9: GIO Error/Addr 0x400:<TIME > 0x7f242c0 pb 10: pb 11: <0>PANIC: IRIX Killed due to Bus Error pb 12: at PC:0x88082ee8 ep:0xffffca20 pb 13: pb 14: pb 15: Dumping to dev 0x2000011 at block 0, space: 0x27fa pages

Reason

This is a great feature that you should use whenever possible. When a system goes down, syslog might not have time or the capability to log why the system crashed. In this case, you do not see any problems reported in the system log files, unless the problems have been developing over time. Because the contents of memory are dumped, it's more likely that you'll know why the system crashed.

Real World Experience

Because these are also core-type files written in binary, you might want to use the strings command. In most cases there is useful information in the crash file. Although the vendors often have their own set of tools to extract problems from the crash files, they don't release these tools to system administrators. Your best chance is to try to grep out any errors that might have been written to the crash file.

# cd /var/adm/crash
# ls -al
-rw-------    1 root     sys      3968160 Sep 21  11:28 unix.0
# strings unix.0 | grep -i errors | more
panic string: <0>PANIC: IRIX Killed due to Bus Error pb 7: Recoverable memory parity error corrected by CPU at 0x9116190 <0x302> code:30 pb 8: Memory Parity Error in SIMM S2 pb 9: GIO Error/Addr 0x400:<TIME > 0x7f242c0 pb 11: <0>PANIC: IRIX Killed due to Bus Error

UNIX Hints & Hacks

ContentsIndex

Chapter 4: System Monitoring

 

Previous ChapterNext Chapter

Sections in this Chapter:

   

4.1 Monitoring at Boot Time

 

4.5 Mail a Process

 

4.9 Monitoring with ping

 

4.2 Starting with a Fresh Install

 

4.6 Watching the Disk Space

 

4.10 Monitoring Core Files

 

 

4.3 Monitor with tail

 

4.7 Find the Disk Hog

 

4.11 Monitoring Crash Files

 

 

4.4 Cut the Log in Half

 

4.8 Watching by grepping the Difference

 

4.12 Remember Daylight Savings Time

 

 

 

© Copyright Macmillan USA. All rights reserved.