Chapter 4: System Monitoring: 4.1 Monitoring at Boot Time

UNIX Hints & Hacks

Chapter 4: System Monitoring

Sections in this Chapter:
4.1 Monitoring at Boot Time	4.5 Mail a Process	4.9 Monitoring with ping	4.13 Checking the Time
4.2 Starting with a Fresh Install	4.6 Watching the Disk Space	4.10 Monitoring Core Files
4.3 Monitor with tail	4.7 Find the Disk Hog	4.11 Monitoring Crash Files
4.4 Cut the Log in Half	4.8 Watching by grepping the Difference	4.12 Remember Daylight Savings Time

Chapter 4
System Monitoring

There is never a time in a UNIX administrator's career when a system never has to be monitored. All the systems that an administrator touches have to be monitored at some point and in some way. It might be for security reasons, performance monitoring, troubleshooting, disk utilization, or hardware failures. Whatever the reason, the system has to be monitored.

One of the most common reasons is to gather system utilization information to generate reports and pretty graphs for management. The information that is collected should also be a good indication of the direction a system is going. Over a period of time, a pattern begins to emerge on most systems. As applications and other pieces of software get upgraded and the environment starts growing with users and other devices, the need for more memory, faster CPUs, and more disk space is always there.

In a fantasy world, you can go to management and say you need more disk space or more memory and they would simply say Okay. In the real world, however, every company follows a budget. If it isn't in the budget, it doesn't get purchased. Which is good only when you have a vendor bugging you to buy his products. Tell him with all honesty, no money has been budgeted for his product. Usually he won't call again until the day after the new fiscal year begins.

So how does something get in the budget? You have to learn a very important word, justification. If the need isn't justified, forget it. The best way to justify something is with proof. That is where the system monitoring comes in. When a system is monitored from the beginning, the data cannot be disputed. Depending on the services that the UNIX workstation or server provides, there are instances when you can charge some of the expenses toward the purchase of a new system back to another department or group in the company. Managers love this, because it corroborates the need from other departments that a new system is a necessity when they go to get the approval for it.

In almost all cases, the key pieces exhibit their own unique growth patterns to watch for. Although every system and environment is different in its own right, there are still patterns that follow a trend as a system becomes overloaded. Here are some patterns you can watch out for.

CPUs behave extremely differently depending on the applications or services running on them. A system rendering can have 18 CPUs, be 98% utilized at all times, and be normal. Many systems that start with small load averages could begin to see a slow rise or even periods of sharp spikes. The load averages can increase or drop dramatically throughout normal business hours. Eventually, the drops are fewer and the CPUs become increasingly overused.

In monitoring the memory, you typically might see a couple of things happen. Memory steadily increases as applications are upgraded, more services are added, or user productivity increases over time. As memory increases and more information may start gets swapped out, the system slows. There is another possibility where a new user uses the system for a different purpose or a new application is installed and a need for new memory could happen overnight.

As time goes by, keep an eye on the disk space. It often does something very interesting. It is pretty easy to tell when the need for more disk space arises. At first, disk usage fluctuates up and down, but it increases a lot more than it ever decreases. Most users do not pay any attention to the percentage of disk space that is available; they seem to believe there is an endless amount until it runs out. When it does run out and the disks are full, users normally clear up only 5-10% of the space, because everything is vital to them. When they don't accept alternative possibilities (archiving, recordable CDs, and so on), all you can do is sit back and watch the users struggle for disk space in the last 10%. By this time it is too late and users begin to take it out on you, asking you to work miracles and find more space that doesn't exist.

4.1 Monitoring at Boot Time

4.1.1 Description

Why settle for a little information at boot time? There are ways to monitor the boot up process for more information that is sent to the console.

Example One: The rc Files

Flavor: BSD

The boot up process is filled with informational echo statements, throughout the boot scripts that bring the UNIX operating system up. If you want to know more information as the server is booting, you can add more echo statements into the various boot scripts, /etc/rc.boot, /etc/rc.local, and /etc/rc. These files are very well commented from beginning to end. Simply turn the comments into statements, and the next time you boot you will see the console filled with information.

Note - Before touching any of the rc files, make backup copies of each one.

# vi /etc/rc.local

#
# Trying to add a default route...
#
if [ ! -f /sbin/route -a -f /etc/defaultrouter ]; then
       route -f add default `cat /etc/defaultrouter` 1
fi

This is an excerpt from the /etc/rc.local and can be modified to explain what happens next. Change the comment to an echo statement that is displayed when the system boots to better inform you of what is happening in the boot process:

#
echo "Trying to add a default route..."
#
if [ ! -f /sbin/route -a -f /etc/defaultrouter ]; then
       route -f add default `cat /etc/defaultrouter` 1
fi

In some instances you might run into a problem booting the system. It sits there hanging, trying to finish a command. This can usually be attributed to some sort of network problem. Finding what network daemon or system call it is hung on can be tricky. As the system is booting, messages are displayed on the console only after a command has been executed. This is okay, but if a command hangs, there is no way to know what the command was that it hung on.

This is where inserting echo statements prior to the execution of a command or daemon can be a great benefit. Knowing what is about to be executed can provide the answers to what you are looking for. Look at the excerpts from the /etc/rc and /etc/rc.local files:

if [ -f /usr/etc/inetd ]; then
       inetd;                  echo -n ' inetd'
fi
if [ -f /usr/lib/lpd ]; then
       rm -f /dev/printer /var/spool/lpd.lock
       /usr/lib/lpd;           echo -n ' printer'
fi

if [ -f /usr/etc/in.named -a -f /etc/named.boot ]; then
       in.named;               echo -n ' named'
fi

if [ -f /usr/etc/biod ]; then
       biod 4;                 echo -n ' biod'
fi

Additions can be made to these excerpts so that you know what is starting before it actually gets executed by the system:

if [ -f /usr/etc/inetd ]; then
       echo -n 'Starting inetd: '
       inetd;                  echo ' inetd started.'
Fi
if [ -f /usr/lib/lpd ]; then
       rm -f /dev/printer /var/spool/lpd.lock
        echo -n 'Starting lpd:'
       /usr/lib/lpd;           echo ' printers started.'
fi

if [ -f /usr/etc/in.named -a -f /etc/named.boot ]; then
       echo -n 'Starting DNS: '
        in.named;               echo ' named started.'
fi

if [ -f /usr/etc/biod ]; then
       echo -n 'Starting Biod: '
       biod 4;                 echo ' biod started.'
fi

Example Two: The rc "S" Scripts

Flavor: AT&T

When your system boots to a multiuser state and executes the /etc/rc2 script, it is a simple task to modify the startup script to expand on the information that is sent to the console. A simple echo statement is the only necessary addition to the /etc/rc2 script.

# vi /etc/rc2


# Execute all package initialization scripts
# (i.e.: mount the filesystems, start the daemons, etc)
#
if [ -d /etc/rc2.d ]
then
  for f in /etc/rc2.d/S*
  {
    if [ -s ${f} ]
    then
        echo $f
       /sbin/sh ${f} start
    fi
  }
fi

In this excerpt from the /etc/rc2 script, if the directory /etc/rc2.d exists, progress through the directory, echo the name of each script that starts with a S to the console, and execute.

Reason

More information is always better. When it comes to troubleshooting a system that won't boot up all the way, the more information that can be learned during the bootup process, the more quickly you can find the problem.

Real World Experience

Most workstations come up within two to five minutes. The boot process on workstations is faster than their multiprocessing server counterparts, and often quicker to diagnose problems. In most cases you would never need to turn on the expanded monitoring as described. Many users who come from the MS-DOS world are often intimidated by the flooding of messages that echo to the console when it is turned on. They usually comment on the intensity and slowness of the boot process over MS-DOS and Windows when monitoring of this type takes place.

This is really a time saver at the server level. Breaking out of a server that is hanging at boot time is especially dangerous when you don't know at what point or where the system hung. Having the system echo more information is not for the user's benefit; it is for yours. So activate it and let it all scroll by.

Other Resources

Man pages:

rc, rc2, rc.boot, rc.local