Chapter 4: System Monitoring: 4.2 Starting with a Fresh Install

UNIX Hints & Hacks

Chapter 4: System Monitoring

Sections in this Chapter:
4.1 Monitoring at Boot Time	4.5 Mail a Process	4.9 Monitoring with ping	4.13 Checking the Time
4.2 Starting with a Fresh Install	4.6 Watching the Disk Space	4.10 Monitoring Core Files
4.3 Monitor with tail	4.7 Find the Disk Hog	4.11 Monitoring Crash Files
4.4 Cut the Log in Half	4.8 Watching by grepping the Difference	4.12 Remember Daylight Savings Time

4.2 Starting with a Fresh Install

4.2.1 Description

All monitoring needs a base point from which to start. After your system is built or rebuilt and is ready to go into production start taking snapshots of the system. If you wait and the users get into the system, you never really have a base set of figures to work from and judge whether the system is overloaded. No two systems exhibit the same set of numbers. You can see this when various systems are compared with the upload command. They all result in different numbers.

Platform: HP--K460, 500MB memory, 2 CPUs, multiple databases applications, 62 users

BASE:         7:35am  up 1 day,  2:52,   1 user, load average: 0.53, 0.29, 0.14
PRODUCTION:  10:17am  up 13 days,  3:06,  62 user, load average: 2.53, 2.29, 2.14

Platform: Linux--P166, 64MB memory, 1 CPU, Web Server, 3 users

BASE:         6:40am  up 1 day, 5:12,  1 user, load average: 0.00, 0.00, 0.00
PRODUCTION:  11:04am  up 2 days, 10:09,  3 user, load average: 0.23, 0.28, 0.31

Platform: SGI--Onxy2, 1GB memory, 16 CPUs, render server, 17 users

BASE:         7:26am  up 18:23,  1 users, load average: 0.01, 0.02, 0.02
PRODUCTION:   16:10pm  up 4 days 12:42,  17 users,  load average: 4.06, 4.02, 4.03

Platform: SCO--P150, 64MB memory, 1 CPU, database application, 5 users

BASE:         7:28am  up 1 day, 14:57,  1 users, load average: 0.03, 0.01, 0.01
PRODUCTION:   17:35pm up 2 days, 13:15,  5 users, load average: 0.44, 0.32, 0.42

Platform: Sun Sparc 20, 192MB memory, 1 CPU, Web server, 15 users

BASE:         7:27am  up 1 day(s), 14:59,  1 users, load average: 0.04, 0.05, 0.04
PRODUCTION:   15:20pm  up 4 day(s), 15:22, 15 users, load average: 1.43, 1.43, 1.62

These are values taken from various platforms when the systems were first built, with no users and in a nice, quiet, idle state. The second set of load averages are with the systems in a full production state, but not overloaded. You can see how different the values are. Your values are different too, because many combinations of applications and system configurations can be set up and running.

When the system has users on it, you see a growth in the values. Monitor these values and compare them to disk I/O and memory usage (with vmstat and sar commands). These are the values you need to keep an eye on and monitor.

If and when the system does peak and begins to slow down, the uptime load average numbers should increase in size. Still keep in mind that an overloaded system showing a load average of 4 on one system, might be underused with the same value of 4 on another system, as seen in the preceding examples. When the system does peak and you have a base number to judge from, you can better understand when the system on its way to being overloaded or underused.

Real World Experience

Monitoring the performance level of a production system is taken seriously in my world, as it should be in all worlds. If you know what the load average of your system is when it peaks and slows the system down, you can monitor the system with a very simplified script.

Example One

Flavor: AT&T

Shell: sh

Syntax:

uptime
cut [-d char] [-f value]
ps [-o options]
Mail [-s string] address
sleep value

One way is to execute the uptime command and monitor the load averages. If the load average reaches a certain threshold, the script can email you a message.

#! /bin/sh

MAX=$1

while [ 1 ]
do
 LOAD=`uptime | cut -d":" -f4 | cut -d"," -f1`
 if [ $LOAD -gt $MAX ]; then
   ps -ef -o user -o pid -o pcpu -o comm | Mail -s "`hostname` OVERLOADED"  root@rocket.ugu.com
 fi
 sleep 5
done

Line 1: Define the shell to be used.

Line 3: Pass the maximum threshold value into the variable $MAX.

Line 5: Begin monitoring endlessly.

Line 7: Get the uptime load average value.

Line 8: Check whether the load average exceeds the threshold level.

Line 9: If the threshold is exceeded, send an email to the system administrator with a copy of the process table containing the user, PID, percentage of CPU being used, and the command. The options in the ps command differ a little from flavor to flavor, see your man pages for the arguments needed on your specific platform.

Line 11: Pause for five seconds so that the script doesn't add to the load. (The value can be modified to fit your needs.)

To execute the script, use the following command:

% peak 4

The value 4 was predetermined to be a maximum load average when a particular system was overloaded. Your threshold value is definitely different. If the threshold is reached you should see a listing of the process table emailed.

   USER   PID %CPU COMMAND
gtromero  2299    3 view_serv
 vobadm  2300    2 vob_serve
  steve  6239    0 csh
   root  7517    0 db_server
 vobadm  2539    0 vob_serve
   root  7091   10 nsrexecd
   root  2744    0 rlogind
   root  7516    4 vobrpc_se
  bmaca  6607    0 csh
   root  4054    5 vobrpc_se
   sach  4899    0 emacs-19.
  medca  2745    0 tcsh
   root  7092    0 save
  medca  2766    0 view_serv
.
.
.etc...

Be careful; depending on what your sleep value is set to, if you are not around, you might have hundreds of emails waiting for you and filling up your mailbox. To correct this, either increase the value of the sleep or add on a function that mails only two or three times at the maximum.

Other Resources

Man pages:

cut, mail, ps, sleep, uptime

UNIX Hints & Hacks

Contents Index

Chapter 4: System Monitoring

Previous Chapter Next Chapter

Sections in this Chapter:
4.1 Monitoring at Boot Time	4.5 Mail a Process	4.9 Monitoring with ping	4.13 Checking the Time
4.2 Starting with a Fresh Install	4.6 Watching the Disk Space	4.10 Monitoring Core Files
4.3 Monitor with tail	4.7 Find the Disk Hog	4.11 Monitoring Crash Files
4.4 Cut the Log in Half	4.8 Watching by grepping the Difference	4.12 Remember Daylight Savings Time