UNIX Hints & Hacks |
||||||||||||||||||||||||||||||||
Chapter 9: Users |
|
|||||||||||||||||||||||||||||||
|
In working with UNIX servers, outages are a way of life and working around users' schedules can sometimes be a difficult chore. There are three types of outages you will encounter: routinely scheduled outages, emergency outages, and unexpected outages.
Routinely scheduled outage: In a production environment, these are fixed outages all users should be aware of that take place on a specific time and day of the month. They provide an administrator with a window of opportunity in a 24/7 environment. Users are made aware that the system may be unavailable to them during these times. They are typically scheduled around the time in which the environment and the system will be impacted and utilized the least, and backups and batch jobs are often scheduled around these times. Workstations can also have scheduled outages for software and operating system updates, patches, and other preventative measures.
Emergency outages : These outages typically occur within a 6-24 hour period when not all the users will be aware that there is a failure in the hardware or software that disrupts or impacts the server, but doesn't render it useless. Some examples would be a tape device that becomes inactive or some zombie processes that need cleaning up.
Unexpected outages : An unexpected outage can affect many users and they usually know right away when it happens. The outage is usually a system failure with the drives, CPU, memory, or some other piece of hardware in the system. A crash of the operating system can force an unexpected outage to take place. These can take place on both workstations and servers. On workstations, typically one or two users are affected; when a server takes an unexpected outage, thousands of users can potentially be affected.
The biggest question on every user's mind is how long the system will be down. Every outage is different and each will take a different amount of time to complete. Unless you have done similar outages in the past you will have to use your own judgment. Here is a generic timetable I put together that you may be able to apply to your own environment.
Reboot time : Every platform's shutdown and boot up cycle time is different. By now you should have rebooted your systems enough to have an idea of the amount of time it takes to boot up the various systems in your environment. These can range from two minutes for a workstation to 20-40 minutes on a large multiprocessing server that may not shutdown gracefully and has to progress through filesystem checks. Account for twice the amount of time it takes to reboot the system in case the system doesn't boot on the first attempt after whatever changes took place to the system.
New hardware installation : For simple hardware plug-and-play devices, allocate 15 minutes. For complete raid arrays or rack mounting, allow 30-60 minutes, per rack. Add the equivalent time of three reboots in case the hardware isn't recognized by the system and add another 30 minutes for troubleshooting and diagnostics in case something goes wrong. You can also talk to a support engineer who does this on a regular basis for a rough estimate on the amount of time it will take.
Replacement hardware: Most hardware today is plug and play even in the UNIX world. If the devices are not hot swappable, then for memory, allocate 20 minutes. CPUs can from 10 minutes to 60 minutes. Typically, it is faster to replace CPUs on servers than on workstations. Allocate the time of three reboots in case the hardware isn't recognized by the system and add another 30 minutes for troubleshooting and diagnostics in case something goes wrong. A support engineer can provide a rough estimate of the amount of time it is going to take.
Patches: Most patches today can be installed while the system is in a multiuser state. The outage will only be as long as it takes to reboot. However, you should double the amount of time for the outage to cover any possible back-out plan if the installation process fails. If the patches require taking the system to a single user state, use the amount of time for three reboots, plus 15 minutes for the patch install (more if the patches are over 50 MB), and 15 minutes to back-out if necessary.
Software installations : Most software installation can be done while the system is in a multiuser state and would only require the time for a reboot of the system and an extra 20 minutes for troubleshooting in case something goes wrong.
Vendors: If you don't know the vendor support engineer who is sent out to your sight to work on the outage, allocate an extra 30-40 minutes for troubleshooting your environment. They typically are unfamiliar with your configuration and will tend to analyze the configurations and be a little slower than a support engineer who is familiar with your systems.
Here is a way that you can run late on a scheduled outage and still look like a UNIX guru! There is a formula that a technician, James Doohan, used while determining the amount of time an outage would take when he was based on a ship.
He concluded that you could take the amount of the real time it would take to perform the outage and multiply the value by two. This would be the allowable total time the outage should not exceed. If the real time exceeded two hours, then take sum of the real time and add half the value of the real time and this too would equal the allowable total time the outage should not exceed. The equation would appear as follows:
Where,
RT = Real estimated time for the outage. TOTAL = Total Time for the outage to take place.
And,
RT * 2 = TOTAL If RT => 2 hours, then RT + (RT / 2) = TOTAL
Although Jim's superiors never knew of his formula, they sometimes considered his estimates to be a lot of time to devote to outages, but they often allowed the time. Jim's thinking was that by allocating as much as twice the amount of time to certain outages, he would have enough time to fix any problems that developed from the changes that occurred. What makes this such a good formula, is that if you finish before or at the time you originally estimated, everyone will think are ahead of the game and a total UNIX guru! If you run over the time you originally estimated, you still have a buffer zone for finishing the job. If you finish before the end of the actual scheduled outage, you still look like a guru for finishing on schedule. Try it. It actually works!
Users deserve to be notified of changes in the status of a system or server that can affect their workflow and productivity. If you don't notify them, your phone and any help desk system you may have in place will suffer a large number of phone calls. The amount of notice that you give to users depends on the type of outage.
Routinely scheduled outage : All users should be aware of fixed-schedule outages. Because these windows of opportunity are not utilized on every scheduled date, it is best to send out a notice three days prior to the outage. This will give users plenty of notice that the server will be down at the specific date and time. There will always be a few users who will never get the notice and still place calls into you or a help desk system.
Emergency outages: These are the hardest outages to schedule because you provide users with only 6 to 24 hours notice. They typically hate this form of an outage and will always try to push you off to a later date. The only problem is that the longer you wait, the worse things can get for the system. You sometimes need the backing of management because these outages disrupt the users' productivity.
Unexpected outages: These are unavoidable and you are forced to save sending the notification to the user until after the workstation or server is back online. When you do send out the update notices be truthful and honest about what took place. Don't get too technical. A generic description is fine for the user, but management will require technical descriptions and plans for future prevention. Therefore, you will need to prepare two notices.
From time to time, an administrator might consider rebooting a production system during production hours and disguising it as an unexpected outage. There are times that we would love to do this; we cannot. Anyone who has tried it, will admit that whatever can go wrong, will go wrong. Wait and schedule an emergency outage.
You should try to avoid too many emergency outages. If you schedule an outage once a week on a high-availability server, management will want to know why the system is going down so much. The worst part is that you will find yourself in one meeting after another explaining to various people what is going on with the servers. If you are the one who picked out the server, you have even more on the line than just explaining the outages.
Not all outages will affect all the users. A company I once worked for would always send out notifications of an outage to the entire company. After a while, I was answering more calls from users questioning whether they would be affected by the outage than those who were actually affected by the outage. The problem was the notices never said who would be affected by the outage.
When you write the outage notice, be clear, concise, informative, and to the point. Provide all the dates, times, and reasons for the outage. The longer and more drawn out the notice is, the more likely users are to brush over it and not read the important information that may impact their workflow. When I write outage notices, I always keep them simple.
Subject: ROCKET - OUTAGE: Jan 25, 8:00-8:30pm
AN OUTAGE IS SCHEDULED ON "ROCKET"
DATE: January 25, 1999 TIME: 8:00pm - 8:30pm (30 Minute Outage)
DESCRIPTION: A reboot is needed to make Y2K patches active on the server ROCKET.
JUSTIFICATION: To bring the system up to a state of being Y2K compliant.
IMPACT: All UNIX users will be affected. Home directories will be unavailable,. Printing will be offline.
Please direct all questions and Concerns to <You System Administrator>
Thank you for your patience.
There are several ways that you can notify users that there will be an outage. Depending on the severity of the outage, you may want to plan more than one way of notifying the users who will be affected.
Email: Because just about everyone at a company, education facility, or corporation has email, this is the best way to notify all the users that will be affected. Distributed mailing lists can be set up for each system containing a list of any users that would be affected by the outage.
/etc/motd: The message-of-the-day file is useful for notifying all the users that physically connect and log in to the server. If the user is accessing the system through a client-side application, the user will never see the notification displayed in the message of the day.
wall: The UNIX wall command will send a notification to all the terminal sessions that various users have open. Again, you risk not notifying those users who are connecting to the server through a client-side application. If this method is still useful, run this command twice on the day of the outage to let the users know it will be happening.
Intranet Web sites : If you have an intranet Web site that is widely used by the user community, post a bold message on the home page of the site notifying the users of the outage.
Client-side applications : Some third-party applications and home-grown software have the ability to broadcast messages to all the users who are using the client-side application. If you have this ability, it is a fantastic way to make sure that all users who are logged into the server will see the messages.
UNIX Hints & Hacks |
||||||||||||||||||||||||||||||||
Chapter 9: Users |
|
|||||||||||||||||||||||||||||||
|
© Copyright Macmillan USA. All rights reserved.