UNIX Hints & Hacks |
|||||||||||||||||||||||||||||||||||||
Chapter 4: System Monitoring |
|
||||||||||||||||||||||||||||||||||||
|
All monitoring needs a base point from which to start. After your system is built or rebuilt and is ready to go into production start taking snapshots of the system. If you wait and the users get into the system, you never really have a base set of figures to work from and judge whether the system is overloaded. No two systems exhibit the same set of numbers. You can see this when various systems are compared with the upload command. They all result in different numbers.
Platform: HP--K460, 500MB memory, 2 CPUs, multiple databases applications, 62 users
BASE: 7:35am up 1 day, 2:52, 1 user, load average: 0.53, 0.29, 0.14 PRODUCTION: 10:17am up 13 days, 3:06, 62 user, load average: 2.53, 2.29, 2.14
Platform: Linux--P166, 64MB memory, 1 CPU, Web Server, 3 users
BASE: 6:40am up 1 day, 5:12, 1 user, load average: 0.00, 0.00, 0.00 PRODUCTION: 11:04am up 2 days, 10:09, 3 user, load average: 0.23, 0.28, 0.31
Platform: SGI--Onxy2, 1GB memory, 16 CPUs, render server, 17 users
BASE: 7:26am up 18:23, 1 users, load average: 0.01, 0.02, 0.02 PRODUCTION: 16:10pm up 4 days 12:42, 17 users, load average: 4.06, 4.02, 4.03
Platform: SCO--P150, 64MB memory, 1 CPU, database application, 5 users
BASE: 7:28am up 1 day, 14:57, 1 users, load average: 0.03, 0.01, 0.01 PRODUCTION: 17:35pm up 2 days, 13:15, 5 users, load average: 0.44, 0.32, 0.42
Platform: Sun Sparc 20, 192MB memory, 1 CPU, Web server, 15 users
BASE: 7:27am up 1 day(s), 14:59, 1 users, load average: 0.04, 0.05, 0.04 PRODUCTION: 15:20pm up 4 day(s), 15:22, 15 users, load average: 1.43, 1.43, 1.62
These are values taken from various platforms when the systems were first built, with no users and in a nice, quiet, idle state. The second set of load averages are with the systems in a full production state, but not overloaded. You can see how different the values are. Your values are different too, because many combinations of applications and system configurations can be set up and running.
When the system has users on it, you see a growth in the values. Monitor these values and compare them to disk I/O and memory usage (with vmstat and sar commands). These are the values you need to keep an eye on and monitor.
If and when the system does peak and begins to slow down, the uptime load average numbers should increase in size. Still keep in mind that an overloaded system showing a load average of 4 on one system, might be underused with the same value of 4 on another system, as seen in the preceding examples. When the system does peak and you have a base number to judge from, you can better understand when the system on its way to being overloaded or underused.
Monitoring the performance level of a production system is taken seriously in my world, as it should be in all worlds. If you know what the load average of your system is when it peaks and slows the system down, you can monitor the system with a very simplified script.
Flavor: AT&T
Shell: sh
Syntax:
uptime cut [-d char] [-f value] ps [-o options] Mail [-s string] address sleep value
One way is to execute the uptime command and monitor the load averages. If the load average reaches a certain threshold, the script can email you a message.
#! /bin/sh
MAX=$1
while [ 1 ] do LOAD=`uptime | cut -d":" -f4 | cut -d"," -f1` if [ $LOAD -gt $MAX ]; then ps -ef -o user -o pid -o pcpu -o comm | Mail -s "`hostname` OVERLOADED" root@rocket.ugu.com fi sleep 5 done
Line 1: Define the shell to be used.
Line 3: Pass the maximum threshold value into the variable $MAX.
Line 5: Begin monitoring endlessly.
Line 7: Get the uptime load average value.
Line 8: Check whether the load average exceeds the threshold level.
Line 9: If the threshold is exceeded, send an email to the system administrator with a copy of the process table containing the user, PID, percentage of CPU being used, and the command. The options in the ps command differ a little from flavor to flavor, see your man pages for the arguments needed on your specific platform.
Line 11: Pause for five seconds so that the script doesn't add to the load. (The value can be modified to fit your needs.)
To execute the script, use the following command:
% peak 4
The value 4 was predetermined to be a maximum load average when a particular system was overloaded. Your threshold value is definitely different. If the threshold is reached you should see a listing of the process table emailed.
USER PID %CPU COMMAND gtromero 2299 3 view_serv vobadm 2300 2 vob_serve steve 6239 0 csh root 7517 0 db_server vobadm 2539 0 vob_serve root 7091 10 nsrexecd root 2744 0 rlogind root 7516 4 vobrpc_se bmaca 6607 0 csh root 4054 5 vobrpc_se sach 4899 0 emacs-19. medca 2745 0 tcsh root 7092 0 save medca 2766 0 view_serv . . .etc...
Be careful; depending on what your sleep value is set to, if you are not around, you might have hundreds of emails waiting for you and filling up your mailbox. To correct this, either increase the value of the sleep or add on a function that mails only two or three times at the maximum.
Man pages:
cut, mail, ps, sleep, uptime
UNIX Hints & Hacks |
|||||||||||||||||||||||||||||||||||||
Chapter 4: System Monitoring |
|
||||||||||||||||||||||||||||||||||||
|
© Copyright Macmillan USA. All rights reserved.