Recovery from failure
This is a cookbook for systems staff who need to restart or otherwise recover the Sun ONE server, but who are not experts in the Sun ONE software. This is a list of things I've had to do in the past.
These instructions are for recovery from a netapp crash. I haven't had a system crash yet, so I don't know what's involved. It should be similar, except that recovery of the databases should not be needed.
Note on the database technology
The Sun ONE software uses the Sleepycat Berkeley DB. For details, see http://www.sleepycat.com. Dirserv uses version 3.2.9. Calserv and apparently mailserv use 2.6.7 for version 5.x. Version 6.0 of calserv and I assume mailserv will use 3.2.9. It appears that they use the transactional DB. This uses log files to improve performance. Transactions are first written to a log file and then to the normal database. In principle this is a very resilient system, as long as its assumptions aren't violated. I believe our environment does not violate the assumptions.
/army/sunone/dbtools has tools that can be used for diagnosis and recovery. In dbtools there are separate directories for each of the applications containing the right version for that application. The .sun directories are pointers to Sun's copy of the tools. I'd probably start with these where they exist, although I've built my own copies from source. For documentation see db tools version 2.6.7 and db tools version 3.2.9.
Note that backups are done using netapp snapshots. It is safe to backup a database using cp (or possibly dd, if cp violates certain assumptions about atomicity by using mmap), as long as the database files are copied before the log files. However with the snapshots, effectively it's all done at once. Thus I believe that netapp snapshots are valid backups, as long as you get the whole directory complete with log files.
The Sun code should call the right routines to do recovery after failure. However the code assumes that all database and log files are present. If not, more complex recovery is needed, which it is not clear is done automatically by the Sun software, except for the calendar system's recover utility.
It is fairly important to have at least some log file when doing a normal recovery. That allows partially completed transactions to be backed out.
The example on which this page is based was a netapp failure where the file system went off line during operation. In principle this should result in a recoverable situation. In practice, recovery looped while processing the log files. I ended up deleting the log files. There wasn't a lot of choice, as the applications wouldn't continue otherwise.
I have considered removing log files as part of the abnormal recovery process. However unless we have further experiences like this I'm not going to do so. The only case I've seen where this kind of manual recovery was needed is when the file system went south. I hope we won't see that again.
In principle the log files are needed to restore the databases to consistency. It is probably better to let the system process the log files, even if it ends up looping at the end. At any rate, if you do have to remove log files, I would try doing it one at a time. That is, if there are several log file, I would remove the most recent (the one with the highest number) first, then the next, etc.
[Incidentally, I believe I can greatly simplify this, but it's going to take some testing with simulated failures.]
Notify the users
Make sure you update the rulink web page to tell users that the system is down, and give a predicted uptime. /army/sunone/servers/apache/htdocs/index.htmlConsider rebooting
One of the problems would have been prevented had we rebooted. However it's probably best to make the software not start automatically after the reboot. To make sure of this, do "rm /army/sunone/CURRENT" before rebooting.[The other problems could be prevented by deleting some state files in a restart. I'm going to modify the scripts to do this, but I need more testing first.]
IP address
Make sure the floating IP address is assigned to your system. "ifconfig -a" should show eri0:1 as 128.6.76.199.Directory server
Get the directory server up. If it isn't up, other things will hang mysteriously. Make sure it stays up.
cd /army/sunone/servers/dirserv/slapd-rulink-ldap1
ps -ef | grep dirserv
If any processes are alive, try shutting the thing down with ./stop-slapd or if necessary kill -9.
./start-slapd.
If it returns, you're probably OK.
Try /bin/ldapsearch -b o=rutgers.edu uid=hedrick (you need the /bin or you'll get some odd thing from /usr/local/bin). Make sure you get a result.
Look in /army/sunone/servers/dirserv/slapd-rulink-ldap1/logs/access. Make sure your query shows there. It will take 30 sec or so for the system to write the entry to disk. (It buffers the logs, to reduce the amount of disk I/O.)
Known failure modes
After a crash, /army/sunone/servers/dirserv/slapd-rulink-ldap1/logs/errors will show the startup. Then it will say:
[27/May/2003:14:50:59 -0400] - Detected Disorderly Shutdown last time Directory
Server was running, recovering database.
That's OK. Watch with ps or top. It should eventually quiet down, and start-slapd should return. If it just keeps running, eventually you'll give up. You have two reasonable approaches:
- remove the log file from /army/sunone/servers/dirserv/slapd-rulink-ldap1/db. This contains state information that could confuse things if it's wrong.
- move /army/sunone/servers/dirserv/slapd-rulink-ldap1/db out of the way and copy the previous night's from /army/sunone/.snapshot/nightly.0/servers/dirserv/slapd-rulink-ldap1/db. I'd do "cp -rp".
One other odd problem I saw, which also resulted in the system looping: it started fine, and responded to the search. Then after about 30 sec it started looping and stopped responding. A "truss -p <pid>" suggested that it was trying to write a log entry to access or errors. Apparently the files were in some odd state between Solaris and the netapp. In /army/sunone/servers/dirserv/slapd-rulink-ldap1/logs/ I renamed access and errors to *.hold. That fixed it. (If there's a possible file lock problem, you want to rename the file rather than deleting it. Otherwise the locked inode may get reallocated to something else.) I believe this kind of problem would not happen after a reboot.
Calendar server
Make sure the directory server is running. If things fail mysteriously, check the directory server again. Really.
First, here's where the action is:
/army/sunone/servers/calserv/SUNWics5/cal/binThe first is the calendar server binaries. The second is the data.
/army/sunone/calstore
do "ps -ef | grep calserv". Make sure nothing is running. If it is try ./stop-cal and if necessary kill -9.
Problems that can occur with the following
I've seen both csdb check and start-cal go into a loop. In one case it was because the directory server wasn't responding. In the other I ended up having to delete the log* and *share files from calstore. I would start by removing *share. They are meaningless after a restart, but if there are problems that can cause trouble. log* theoretically has redo log information in it, so deleting them could cause you to lose recent changes. However if you have to remove them, do so. (I'd put these files into another directory rather than actually deleting them.)Check the database
From the binary area, do "./csdb check". Normally output is fairly brief, just summarizing the number of things it found. If there's any corruption it will say things like "1 instance of corruption found". Output can be quite voluminous in this case. Unfortunately it always returns successful status, so you really have to look at the output to see whether it says anything involving corruption.
It's important to do csdb check any time there has been any kind of abnormal event. If you let the system come up with a bad database, you're likely to lose user data later.
csdb check can be run with the system up, although I generally wouldn't start it until the check has been done.
Fix the database if necessary
WARNING: this step involves copying files. Make very sure that they actually go where you want them to go. It's easy to type the wrong command and have files go one level too high or low. I'd do ls -l of what you think the destination should be, just to make sure you really put it there.
If csdb check finds any problem, you'll need to do csdb rebuild. csdb rebuild looks at a database (by default, the current one, but you can point it to a backup), tries to find all the information it can, and builds a new clean one. The new one is in a subdirectory of the binary area, called rebuilt_db.
Make sure the system is down.
Do "./csdb rebuild", probably with output to a file, e.g. "./csdb rebuild >& /tmp/rebuildlog".
If it rebuilds successfully, do "./csdb list -v rebuilt_db". Look at the number of calendars, events, and tasks. Remember those for comparison.
Look at /army/sunone/servers/calback to see what backups you have. Do "./csdb list -v /army/sunone/.snapshot/.../calstore/" for the last few backups to see how many calendars, events, and tasks there are. If the rebuilt database has significantly less data than the backups, you should suspect that data has been lost. Consider restoring from one of the backups rather than the rebuilt data. Of course if you do this you'll lose recently added events. But if the database is in really bad shape you may do better to lose a half day of work than to come up with a database that's half missing.
If you decide to use a backup, check it first, e.g. "./csdb check /army/sunone/.snapshot/nightly.0/calstore/". You can do "csdb rebuild" on a backup if you need to.
Copy the current database somewhere else: "cp -p /army/sunone/calstore/* SOMEPLACE".
Copy the database you decide to use to that, e.g. "cp -p rebuilt_db/* /army/sunone/calstore"
Remove extraneous files:
- rm /army/sunone/calstore/*share
- rm /army/sunone/calstore/log*
Start the daemons
./start-cal
Point a browser at https://rulink.rutgers.edu:1025 and make sure you get a login screen.
Mail server
The place you want to be is /army/sunone/servers/mailserv/msg-rulink-mail.
do "ps -ef | grep mailserv". See if anything is running. ./stop-msg, or kill -9 if you have to.
Now ./start-msg ought to work. If there's a database problem, the first major item, "store" will hang. In this case, look at /army/sunone/mailstore/mboxlist/. As usual, you may have to delete *share or log*. Start with *share.
Final checks
Once the major stuff is up, go back through /etc/rc.3/S9* and make sure no steps have been omitted. In particular, make sure stunnel and the admin servers are running. [or shut everything down and restart, as documented below].
If you deleted /army/sunone/CURRENT, do "echo `hostname` >/army/sunone/CURRENT", so that this system is shown as the current one. "setsystem show" should should the current hostname.
The simplest way to make sure everything is running is to do "setsystem give" and then "setsystem take". That will shut everything down and then restart it cleanly. This will only work if /army/sunone/CURRENT is set to the current system.
Once you're up, remove the note from index.html saying that the system is down. If you had to reconstruct the database, add something to the "News" box in index.html saying that due to a problem you had to reconstruct data. Describe any expected loss, and give an email contact.
Send email to hedrick@rutgers.edu explaining what you did and why.
For more information, contact
rulink-support@rutgers.edu
©
2007
Rutgers, The State University of New Jersey. All rights reserved.
