Rutgers, The State University of New Jersey
OFFICE OF INFORMATION TECHNOLOGY | RULINK HOME

Failure Log

This log is a list of unexpected downtime or failures for rulink, i.e. events that were not during announced test time or other times announced in advance. For May 2007 and before, this log also includes failures for ldap.rutgers.edu.

November 19, 12:11PM. Calendar failed, but was restarted automatically.

November 12, 10:06AM. Calendar failed, but was restarted automatically.

October 29, 11:56AM. Calendar failed, but was restarted automatically.

October 29, 11:51AM. Calendar failed, but was restarted automatically.

October 29, 11:26AM. Calendar failed, but was restarted automatically.

October 29, 11:16AM. Calendar failed, but was restarted automatically.

October 27, 2:01PM. Calendar failed, but was restarted automatically.

October 27, 1:36PM. Calendar failed, but was restarted automatically.

October 26, 10:26PM. Calendar failed, but was restarted automatically.

October 26, 8:16PM. Calendar failed, but was restarted automatically.

October 26, 8:06PM. Calendar failed, but was restarted automatically.

October 9, 9:31AM. Calendar failed, but was restarted automatically.

September 16, 10:41AM. Calendar failed, but was restarted automatically.

September 10, 11:36PM. Calendar failed, but was restarted automatically.

July 15, 11:36PM. Calendar failed, but was restarted automatically.

July 13, 11:36PM. Calendar failed, but was restarted automatically.

June 17, 11:37PM. Calendar failed, but was restarted automatically.

June 16, 11:36PM. Calendar failed, but was restarted automatically.

May 25, 11:36PM. Calendar failed, but was restarted automatically.

May 19, 10:26AM. Calendar failed, but was restarted automatically.

April 23, 12:11PM. Calendar failed, but was restarted automatically.

April 23, 12:06PM. Calendar failed, but was restarted automatically.

April 22, 11:36PM. Calendar failed, but was restarted automatically.

April 9, 11:37PM. Calendar failed, but was restarted automatically.

March 26, 11:36PM. Calendar failed, but was restarted automatically.

March 23, 11:36PM. Calendar failed, but was restarted automatically.

March 20, 11:36PM. Calendar failed, but was restarted automatically.

March 13, 11:36PM. Calendar failed, but was restarted automatically.

March 6, 11:37PM. Calendar failed, but was restarted automatically.

January 31, 12:11AM. Calendar failed, but was restarted automatically.

January 30, 11:31PM. Calendar failed, but was restarted automatically.

January 27, 11:37PM. Calendar failed, but was restarted automatically.

December 16, 11:36PM. Calendar failed, but was restarted automatically.

December 11, 11:36PM. Calendar failed, but was restarted automatically.

December 10, 11:36PM. Calendar failed, but was restarted automatically.

December 8, 11:36PM. Calendar failed, but was restarted automatically.

December 6, 11:36PM. Calendar failed, but was restarted automatically.

December 3, 11:36PM. Calendar failed, but was restarted automatically.

November 23, 11:37PM. Calendar failed, but was restarted automatically.

November 12, 11:37PM. Calendar failed, but was restarted automatically.

November 11, 11:37PM. Calendar failed, but was restarted automatically.

October 14, 11:36PM. Calendar failed, but was restarted automatically.

October 9, 11:37PM. Calendar failed, but was restarted automatically.

September 21, 11:36PM. Calendar failed, but was restarted automatically.

September 20, 11:36PM. Calendar failed, but was restarted automatically.

September 6, 11:36PM. Calendar failed, but was restarted automatically.

September 3, 11:36PM. Calendar failed, but was restarted automatically.

August 24, 11:36PM. Calendar failed, but was restarted automatically.

June 30, 4:57PM. Calendar failed, but was restarted automatically.

June 21, 1:16AM. The mail system failed, but was restarted automatically.

June 2, 11:38PM. Calendar failed, but was restarted automatically.

June 2, 3:41PM. Calendar failed, but was restarted automatically.

June 2, 3:06PM. Calendar failed, but was restarted automatically.

June 2, 1:41PM. Calendar failed, but was restarted automatically.

June 2, 1:21PM. Calendar failed, but was restarted automatically.

June 2, 1:16PM. Calendar failed, but was restarted automatically.

May 30, 12:51PM. Calendar failed, but was restarted automatically.

May 30, 12:41PM. Calendar failed, but was restarted automatically.

May 29, 5:47PM. Calendar failed, but was restarted automatically.

April 27, 11:31PM. Calendar failed, but was restarted automatically.

April 2, 10:00AM. Certain users have been receiving "I/O Errors" when trying to access large mailboxes. It appears that the IMAP process was running out of memory. I've added a second IMAP process. I'm hoping that will keep either of them from getting too big.

April 2, 11:07AM. Calendar failed, but was restarted automatically.

March 19, 11:37PM. Calendar failed, but was restarted automatically.

February 28, 11:31AM. Calendar failed, but was restarted automatically.

February 28, 11:06AM. Calendar failed, but was restarted automatically.

February 27, 4:11PM. Calendar failed, but was restarted automatically.

February 22, 2:56PM. Calendar failed, but was restarted automatically.

February 22, 2:51PM. Calendar failed, but was restarted automatically.

February 22, 1:01PM. Calendar failed, but was restarted automatically.

February 21, 3:26PM. Calendar failed, but was restarted automatically.

February 21, 2:46PM. Calendar failed, but was restarted automatically.

February 21, 2:41PM. Calendar failed, but was restarted automatically.

February 21, 2:16PM. Calendar failed, but was restarted automatically.

February 2, 3:11PM - 4:29. Calendar and uwc were down several times. I was trying to update the SSL certificate, and very strange things happened. I still can't explain them, but the system does seem to be up with all certificates correct.

January 17. Calendar failed around 8am, 10:46am, 12:46am. There was some indication of database problems, so I took at down around 5:30pm to rebuild the database.

January 9, 4:26PM. Calendar failed, but was restarted automatically.

January 7, 4:41PM. The mail system failed, but was restarted automatically.

December 17, 3:56PM. Calendar failed, but was restarted automatically.

December 17, 3:36PM. Calendar failed, but was restarted automatically.

December 17, 3:31PM. Calendar failed, but was restarted automatically.

December 14, 3:54PM. The calendar has been crashing as quickly as we can bring it up. As an experiment I tried disabling access from outside the Rutgers network. It seems to be staying up. (My theory is that someone from outside is attacking it.) You should still be able to get to your calendar through the new user interface, http://rulink.rutgers.edu/uwc.

I have an emergency call in to Sun. I will likely leave it this way until late tonight (Friday night).

December 11, 9:36AM. Calendar failed, but was restarted automatically.

December 10, 11:11AM. Calendar failed, but was restarted automatically.

December 6, ca 10:30-11:00am. The old calendar interface (port 1025) apparently hung and stopped responding to requests. It was restarted, which seems to have fixed the issue.

November 19, ca 11:30-11:45am. The web server hung. Restarted. However it hung again. The problem seems to have been the calendar server. Restarted calendar around 11:45 and things seem OK.

October 18, 3am. The calendar server hung and had to be restarted. Down about 30 min. It appears that the web interface (uwc) may not have worked while the calendar was broken. Also, it appears that spam processing didn't come back up after its nightly restart. Spam processing was not happening from about 3am to 12:45 pm.

September 26, 10am? - 10:30am. uwc was down. It appears that the Java virtual machine ran out of memory. Restarted the web server. This is the first time we've had trouble with the JVM for a long time, but I'm going to schedule an update for uwc this Friday or next. We haven't applied patches for quite a while.

September 21, 10-11pm. Installed latest patches to mail and calendar.

?? - September 21. Calendar crashes: The calendar system is the least reliable part of rulink. The software has a history of crashing when it gets unusual input (e.g. lines that are too long). We have normally had at least one crash a week. We check for these and restart the server. For the last month or so it had gotten worse. I put up the newest version of the code Sept 21. As of 9/26 there have been no crashes since, so I'm going to say tentatively that this problem is solved (or at least back to normal). There were so many that I didn't document calendar crashes here as separate entries. For historical purposes here's a list:

6/5, 6/19, 7/3, 7/6, 7/10, 7/11, 7/12 2 times 7/13 2 times, 7/16, 7/17, 7/20 2 times, 7/23 2 times, 7/26, 8/3 2 times, 8/9 2 times, 8/15, 8/21, 8/30 2 times, 9/4, 9/6, 9/7, 9/11 2 times, 9/12, 9/16, 9/17, 9/18, 9/20

?? - September 17. We have had periodic slowdowns in mail forwarded by the system. It is now 99% certain that this was due to slowness in spam checking. In May we started checking mail forwarded through rulink for spam, using fairly aggressive checks. Spam checking (at least with the software we're using) is fairly expensive. Once everyone was back in Sept we couldn't handle the load. The failure was subtle. It isn't that all spam checking failed. A few messages took too long. However this caused mail to back up in a way that eventually caused all mail to be delayed.

I have moved to a configuration where we will only do spam checking if the user requests it. The user can request it for both for mail that is forwarded through rulink and mail that is delivered locally, although the mechanism is different.

August 31 to 2:45 pm; caught up ca. 8:30pm . Rulink was down. A problem developed in access to the file server that Rulink uses. It is not clear whether this was due to network equipment or the file server itself. At any rate the failure made it impossible for the Rulink software to work. The network was rearranged to bypass the problem.

Rulink was up by around 2:45 pm. However mail was backed up. As of 8:30 pm the backlog seems to be cleared. See the next entry for more information about delays in mail delivery the week of Aug 27.

The events of Aug 14 and 15 were almost certainly early signs of the same problem. Recent calendar problems are almost certainly not connected.

August 15, 2007 15:33 - 16:21, 17:00-17:14. System was effectively down. The file server was not responding. Many of the symptoms were the same as yesterday, but it looks more like the problem with the directory server was a symptom, not the cause. The problem seems to be either with the file server or Solaris. It cleared up mysteriously around 16:20. Staff were unable to find anything wrong with the file server, and other systems using the same file server had no problem. We're rebooted at 17:00 in case the problem is with Solaris (which seems possible, though by no means certain).

August 14, 2007 afternoon, ending 15:17. System was very slow. It looks like the problem was that the directory server had grown unreasonably large. I'm going to go back to restarting it daily. I find it a bit odd what effect this had. It caused very high I/O rates, which bogged down every part of the system. Mail sent from rulink was delayed. Those delays seem to have continued until about 20:00.

August 3, 2007 9:30 - 10:00. Calendar crashed several times. It appears that a user was repeatedly trying to edit very long notification messages. I suspect there's a problem with notifications longer than 200 or so characters.

July 27, 2007 I believe the ongoing problem of calendar crashes was fixed by the rebuild on July 23. We had been seeing problems daily before that, and haven't seen it since.

July 26, 2007 6:00 - ca 13:30. We had to restart the system because of maintenance work on the Netapp file server. When it came back up it appeared OK, but actually was generating a continuous stream of bogus email. That swamped the rest of the mail system. It took until 13:30 to clear the half million bogus messages. During this time, the system was working, but mail was very slow.

This was due to a known bug (Bugid 6548178). Around midnight July 27 I used the documented recovery procedure, and installed a workaround to make sure it won't happen again.

July 26, 2007 5:00 - 8:00. System was taken down because of maintenance work on the Network Applications file server used by all OIRT applications. We didn't give as much notification as usual because originally we thought this downtime wasn't going to be visible.

July 18, 2007 6:45 - 16:30. LCSR Spam filtering was down. Mail was still delivered, so spam would have been treated as normal mail. This is the first time I've seen this, so there was no automatic check, as we have for other components. This weekend I'll add a batch job that checks the spam processing and restarts it if necessary.

July 16, 2007 11:30 - 11:50. Both the web server and calendar server were down. My suspicion is that it was the calendar server that was the cause. I restarted both, but things didn't get back to normal until the calendar system was fully restarted. We've been having lots of calendar server failures over the last week. I'm going to start collecting core files and get Sun involved.

late June - July 2 I don't have exact dates on this, but mail to mailing lists was being delayed. This was a consequence of improvements in spam processing done in May. They stressed a part of the system that didn't previously have to handle much volume. This was fixed by a reconfiguration July 2.

June 26, 2007 Delivery of spam to the inbox for forwarded mail seems to have been hung since 10am June 25. I believe the problem was fixed around 8:15 am. However it took until 8:50am June 27 to catch up on the backlog.

June 4, 2007 2pm - 6pm. Due to the earlier problems, when the calendar system tried to send reminders, the attempt failed. Unfortunately this created a completely invalid situation where the reminder system tried to send huge numbers of reminders. This deluged all the mail infrastructure, and delayed mail delivery. About a dozen people will have lost calendar notifications around 6pm, as a result of recovery processes. Starting around 4pm the calendar system was restarted a number of times.

June 4, 2007 applications on rulink didn't work from about noon to 2pm. It now appears that we had an unintended dependency on RCI's name server, and RCI was down for this period. We've found the cause and removed the dependency. [The kerberos system attempts to look up an SRV DNS record for the Kerberos domain, even if it is configured so that it doesn't need the information in that record. The faculty Kerberos domain is RCI.RUTGERS.EDU. Unlike the rest of our systems, RCI has its own DNS server. Thus with RCI down, attempts to lookup the SRV record for RCI.RUTGERS.EDU hung. We have added the necessary configuration information to turn off these SRV lookups.]

May 31, 2007 5am - 11am. Ldap data for Summer, 2007, courses disappeared in the nightly update. These courses were missing from the administrative data feed. I rebuilt the LDAP database using data from the previous day. We have been assured that the problems will be fixed by the next nightly update.

April 23, 2007 3pm - 3:30pm. All OIRT systems were unavailable due to network problems within Hill Center.

April 22, 2007 8:15pm - 10:15pm. All OIRT systems were unavailable from outside Rutgers, due to network problems with the Rutgers connection to the Internet.

April 11, 2007 around 5pm, a staff member accidentally disconnected the network. All OIRT systems were affected, including RULink and LDAP. Unfortunately simply reconnecting didn't fix things. The systems had hung. They all had to be restarted. It took about 40 min to fix the situation on all systems.

March 3, 2007 10pm, reloaded calendar, and a few minutes later uwc. We needed to update the timezones file for the new daylight savings rules. The uwc reload was because my calendar was showing wrong in uwc. The reload actually didn't do anything. Rather, my personal time zone was set to GMT. I hope I did that myself in the course of testing, and this isn't some system failure. If people suddenly start seeing times 5 hours off, they should check Options General and make sure they are in the right time zone.

Feb 27, 2007 around 8:00, ldap on rulink failed again. I'm now fairly sure that the failure was caused by DNS problems. It appears that the DNS servers are being taken down for maintenance. That should be OK -- every host should have more than one DNS server. But it appears that the rulink ldap server essentially hangs if the primary DNS server isn't working. I'm guessing that it does a reverse lookup for every incoming connection. Until I can find a way to disable this, I've set up a local caching server. That should eliminate any dependence upon a single external DNS server.

Feb 26, 2007 around 17:30, logins failed because the rulink LDAP server failed to respond. It had to be restarted twice. I turned off access from outside, on the theory that perhaps we were the victims of a denial of service attack. I turned it back on a couple of hours later, but I'm now recording where accesses come from, so if it crashes again there's a good chance I can see what query caused it. At about the same time there were problems with the University DNS infrastructure. It's possible that these were related. I've put the address of the Kerberos servers into /etc/hosts. That should lessen or eliminate the dependency of the LDAP server on DNS, though of course the rest of the mail system will still have problems if DNS is down.

Feb 20, 2007 minor issues remained with rulink. Users who reside in domains couldn't login. That was fixed around midnight. Also some configuration issues affecting Outlook addressbooks, also fixed late in the day.

Feb 19, 2007 ldap crashed several times. We're still not sure what happened.

Feb 17-19, 2007 rulink was taken down Feb 16 at 10pm to move to a new server. This was announced, so it wouldn't usually go here. However it took much longer to come up than expected. The process of moving the mail store to a different architecture didn't work. The tools for rebuilding the data didn't solve all the problems. I ended up having to do some of my own tools. The system was enabled for passthrough mail (mail to user@rutgers.edu, where the mail is forwarded) came up at about 8pm Saturday. Mail on rulink itself was up Saturday afternoon, but had enough trouble that it wasn't usable until Monday. 70% of the users were OK by noon, 100% by about 4pm.

Feb 13, 2007 06:28 - 08:11 - ldap.rutgers.edu was down. It should never be down for more than 5 min, because there's a cron job that should restart it. However I found a problem with it. Fixed around 8am Feb 13.

Feb 12, 2007 ca. 21:04 - ldap2.rutgers.edu (the backup server) was apparently down. I can't determine for how long, because I see no gap in responses to user queries.

Feb 12, 2007 17:19 - 17:59 - ldap.rutgers.edu was down. See entry for Feb 13.

Feb 12, 2007 11:20 restarted the various components to use a new SSL certificate. Should have been done the previous Friday night. I apologize.

November 15, 2006 21:22 - 21:25. ldap.rutgers.edu was down briefly to move files from the Netapp to local disk. This was in preparation for Netapp downtime. As of 21:25 neither ldap nor ldap2 has the netapp mounted.

November 13, 2006 18:40. Restarted the web server. uwc had not been working.

November 13, 2006 13:11-13:20. Rulink rejected mail during certain brief periods in this range. The problem seems to have been failure in the LDAP server. I'm going to put the new ldap code on rulink this weekend. I note that there was a change in Kerberos servers scheduled for today. It is possible that this caused a brief failure in Kerberos service, which might have caused problems for the LDAP Kerberos plugin.

November 3, 2006 09:38-10:17. ldap.rutgers.edu (i.e. no effect on rulink): There was a problem with the new ldap code that caused the primary not to respond to queries. I believe I know what happened, so there will be another attempt to use the new code at 10pm tonight. This time I'll fix the monitoring script to move back to the old code automatically, so any problems should last only 5 min.

November 2, 2006 22:52-22:56. I restarted ldap.rutgers.edu with the new code. This should have taken 5 sec, but the server didn't come down cleanly. It's lucky I did this, since it would almost certainly not have been restarted properly by the cron job later in the morning.

October 31, 2006 ...-8:30. The nightly web server restart didn't work. The server was up, but not responding. I tried a manual restart of the web server. That didn't work either. Because I couldn't fix it, and we were coming up on the start of the day, I thought the best approach was to restart the system. I believe IMAP and mail transport had been working fine, although attempts to bring them down cleanly failed. I didn't check calendar before the restart. But probably the only thing down was uwc. The rest would have been down only during the restart, which was about 10 min.

October 31, 2006 ldap.rutgers.edu has been hanging nightly. I can't tell when it started, but it may have been happening all fall. I check it every 5 min, and do a restart if necessary. So in general the problems haven't been noticed or reported. There was a period of a couple of days when it would be down for 2 hours or so, because it was only partly hung, so the 5 min job wouldn't catch it. I updated the job, and as far as I know the failures are now caught within 5 min. I've been working on the cause. As of Oct 31, I believe I have a fix. I'm testing it carefully, and intend to put it up Nov 3. If the problem becomes more serious I can put it up sooner.

October 19, 2006 4:00-4:26. The web server restart done nightly failed: the server failed to die, and thus failled to come back up. I had to kill it manually.

October 9, 2006 Created script to restart the calendar server if necessary. There was some calendar failure in the previous week not recorded here that led me to do that. At this point all the major components are set to be restarted if there's a problem. Except the web server, which hasn't crashed so far.

August 31, 20067:30-9:30. It appears that smtp_server ncrashed at 7:30. It was restarted automatically, but came up without the ability to accept the STARTTLS command. This would cause all attempts to send mail from a desktop system to fail, since we require SSL/TLS. A manual restart brought it back up with no problem. This interacted very badly with Mac mail. Mac mail gave an error, then tried again without TLS. Of course that failed. But I had to restart the Mac mail program to get things reestablished.

August 26, 200615:44-15:51. Apparently https: web mail access failed. The cron job restarted mail at 15:51. I'm guessing mshttpd was dead, since the cron job detects only processes that are gone, not processes that are there but unresponsive. According to the operators, the AV system failed at the same time. Not sure what is going on.

August 21-26, 2006This is a whole saga. The system wasn't down. However there have been problems both delivering mail and receiving it. These are due to suboptimal configuration, and ISPs' response to them. Basically we were generating lots of bogus bounce messages due to lots of bogus spam attempts. Some ISPs got annoyed at this "backscatter" and blacklisted the AV system. Since we sent much of the mail via the AV system, this caused problems outgoing. I believe these were fixed August 26, with a reconfiguration so that we send stuff directly. Incoming mail was slowed down during the same period due to problems with the layer 4 switch between us and the AV systems.

July 12, 20068:06 am. Restarted calendar. We were down to 2 processes running.

June 7-8, 20069:30 pm - 9:05 am. Mail from rulink to systems outside Rutgers failed. A workaround was installed around 8:30 am, but a couple of messages failed around 9:02 because we prematurely removed the workaround for 30 sec or so. This problem is a leftover from the change in IP address. The Anti-Virus system was not updated to accept mail from the new IP address.

June 7, 20067 - 9:30 pm. Rulink was down. This was announced at least a week in advance. We needed to move the computer to a different location. This was both a physical move and a change in IP address. Changes in IP address don't always take effect immediately. If you are accessing RULink from outside Rutgers it could take up to a day for your system to find the new address.

June 6, 20069:54 - 10:20am. Calendar was down. Web daemon crashed.

April 25, 2006 3:44 - 4:17 pm. Calendar was down. We don't know what was wrong; the web daemon just crashed. A restart was enough. Interestingly, uwc was working, although it would have failed once I started the restart process.

March 13, 2006 10:30 - 11:30 am. Rulink (the calendar) was mostly hung. The initial report was that people were unable to subscribe to a calendar, but in uwc and the old interface. However shortly the web server hung (I believe because it hit the maximum allowed number of threads, due to operations hanging). At that point it became clear that portions of the calendar system were hung, particularly enpd. The nightly calendar backup was also hung. The safest approach seemed to be to restart calendar and then web. Calendar had been up since Feb 17. Unfortunately at this point I'm not going to be able to get enough information to know why the calendar system hung. That would have required leaving the system hung for an extended period while I investigated.

Mar 1, 2006 7-9am. Network was down for reconfiguration. rulink itself was fine, but no one could get to it.

Feb 26 - Mar 3, 2006 Note on this episode. The issue was entirely with the uwc. The traditional mail and calendar interfaces were fine during this whole period.

We had a week of instability. In retrospect the problem is clear, but it took a while to locate. A patch installed mid-Feb had a subtle bug: connections from uwc to LDAP are never closed. There's a finite number of connections allowed (12 in the default config). When those were exceeded, address book lookups in UWC failed.

As a workaround I set LDAP so it didn't use a fixed size pool. It should then connect separately for each event. That did in fact happen but it still didn't close the connections. So we ran out of file descriptors. That caused no end of random failures both in uwc and the admin PHP interface (because they're both in webserv).

Finally I partially backed out of the patch (T27) and moved to a newly released core patch, -23. That did fix these problems, though it meant I couldn't test code to properly display calendars for which the user has only free-busy access.

On Mar 10, I moved to T29 and a developer hot patch, to try and fix displaying calendars with free-busy access. That worked, but the LDAP problem came back. I found a hack: I've disabled LDAP pooling, but set the directory server to time out idle connections, currently after 4 min. This seems to do the trick, though it causes mail to open and close directory connections a lot more often.

Feb 8, 2006 4:33am - ca 12:18. ldap.rutgers.edu (it did not affect rulink). I restart ldap nightly at 4:33, because otherwise it becomes unstable after a few days. This morning it did not start. In the time available before 8:30am (when users start coming back), I was unable to figure out what was happening. Thus I copied the entire ldap system from the backup server to the primary. It worked fine. It was up at 8:15am. However in the process I ended up with an SSL cert claiming that the system is ldap2.rutgers.edu. Applications that check the cert might have failed. At 12:18 I did a restart (downtime about 10 sec) to put the right cert in. I have a duplicate server on another system, which will be used to figure out what is going on. I will no doubt end up restarting the server late tonight.

Jan 28, 2006 midnight to about 3:16pm. Unified interface login didn't work reliably. A new certificate was loaded last night. Unfortunately it was signed by a different certificate authority than in the past. Thus portions of the system didn't recognize it. I needed to update the certificate authority information. For some reason the failure didn't show the first time I logged in after a server restart. Since I only tested it once at midnight, I didn't see the problem.

Jan 17, 2006 ca 9:00 - 13:19. Calendar was down. This is a known bug, which Sun promises a fix for. We've installed a workaround that will make sure it doesn't cause trouble. This actually has nothing to do with the new version, except that in putting up the new version I changed a configuration setting (from a non-standard one to a recommended one). The workaround is to put it back to its old value. There are other problems with the new version, but none cause the system to be down in this way.

Jan 13-16, 2006 22:00 Jan 13 on. Took the system down at 10pm Jan 13 to upgrade to JES 2005Q4. Back up around 2am. The upgrade was tedious but straightforward. Unfortunately there were two serious problems with the new version: attempts to look at many calendars in monthly view caused the web server to go into a loop, and in the mail system, some spam rules ocassionally don't trigger. The calendar issue is a serious one because it puts the web server into a loop. I had to restart it without warning Monday, 17:20, because otherwise I couldn't get enough CPU time for other tasks that are critical (including the one to fix the problem). Unfortunately I'll have to keep restarting it as necessary, though by now enough calendars have been updated that the problem may not happen as quickly.

I found that the calendar problem is due to a slight change in calendar format. Starting Jan 16 afternoon I ran a script over every calendar to update it. Unfortunately the script is taking a long time. At best it will finish by midnight.

I took down all cron jobs at the start of the upgrade. I'm just now putting back the one that updates ldap. A few users have been unable to login because they are new and the feed script hadn't run. I'm running it manually now, at 17:00 or so. They should be on by 18:00 to 19:00, depending upon how long it takes to run the script.

Jan 4, 2006 10:00-12:00. Mail was delayed, in some cases over an hour. We were being deluged by email from an errant process in administrative computing. It had recovered by 12:00.

Dec 31, 2005 10:00-13:30. This was announced downtime. System was down for a software upgrade. It turned out not to be possible to do the upgrade within the announced time window, so we brought the system back up. The final upgrade will be done in mid-January, 2006.

Nov 2, 2005 23:05. Restarted calendar system, because 2 of 4 server processes had died. Downtime was a couple of minutes.

Oct 30 - 31, 2005 3pm to 3pm. Mail delivery was more or less down due to problem in the AV systems. Those were due to a flood of spam caused by compromised systems in several departments at Rutgers. Rulink itself wasn't affected. But since all mail going in or out of the system goes through the AV boxes, rulink mail might as well have been down.

Oct 4, 2005 9:38. Restarted the calendar web interface. It's supposed to have 4 processes. All but one had died, making it very slow. Downtime was about 30 sec, but users may have had to login again.

Sept 25/26, 2005 15:00 Sept 25 - 11:30 Sept 26. A small fraction of email was rejected with the error message given when an external site attempts to use rulink as a relay. Email from the anti-virus systems was being misclassified as from outside Rutgers. It's not clear what caused this. Restarting the daemon that receives mail fixed it.

June 29, 2005: 13:00-13:45. imap seems to have been down. Actually I believe it was working for existing connections but not permitting new ones, so probably some people would have seen problems but others would have been OK. 13:00 is my best estimate of when it started, based on the logs, but I can't absolutely guarantee that. I ended up restarting the whole mail system, because stopping and starting imap alone didn't fix things.

June 22, 2005: 6:13, ldap.rutgers.edu. Employee information was unavailable when we did the nightly update of data. The new consistency checks worked, so we rejected the zero-length data file. However employee information may be a day out of date.

June 21, 2005: 11:00, ldap.rutgers.edu was down for about 10 min. The 5 min test job believed it was dead, and restarted it. It takes about 10 min to restore the database to consistency after restart. I believe this problem was ultimately caused by the one below.

June 21, 2005: 6:00 to approx 15:35. The nightly load of data from administrative systems was missing certain student data. There are some consistency checks, all of which unfortunately succeeded. Thus we cleared most of the student-specific information. A rebuild started around 12:50 pm and finished at 15:35. I've made the consistency checks more paranoid, to avoid a reoccurrence.

May 20, 2005: 4:40 to 5:20, ldap.rutgers.edu was down. The automatic restart didn't work properly, leaving it in a hung state. The backup server was fine.

May 4, 2005: 19:40 to 20:00 system was down. I unintentionally rebooted it. There was a problem with the startup scripts, which I believe I have now fixed. (Otherwise downtime should have been more like 5 min.)

April 20, 2005: 10:19 to 1:44 or so the backup LDAP server was down. There was about a day's warning, so this isn't really a failure.

February 6, 2005: After midnight, I restarted all components, to install new SSL certificates. This should have resulted in 30 sec downtime. Unfortunately the calendar turned out not to accept the certificates. I ended up restarting it several times. Total downtime was still only a couple of minutes.

February 2 - 3, 2005: The web server ran out of memory on Feb 3. As a result, the unified interface had intermittent failures about 9 - 10:15 am. A few isolated failures occurred starting around 5pm Feb 2. The system restarted the web server at 10:15, after which things were OK. I have returned to restarting the web server every morning at 3am. That should prevent the problem, although it will make it more difficult to gather the data necessary for Sun to assess the problem.

January 3, 2005: Did an emergency restarts of the web server because in moving to the new server I hadn't installed php.ini. The admin interface didn't work. Downtime 30 sec for UWC at 10:45 am.

January 2, 2005: Did 2 emergency restarts of the web server because the UWC appeared to be hung, 14:33 and 16:41. Downtime about 30 sec just for admin tool and UWC. Logins to the UWC were appearing to hang. In fact they didn't quite; they would work after about 2 min timeout. The log was complaining that an internal connection from tie.rutgers.edu didn't recognize the client certificate. Changing rulink and rulink2 from the tie address to 128.6.76.209 in /etc/hosts seems to have fixed it. I have no idea why, except that this is how /etc/hosts was setup on suit. I'll document this in the setup documentation.

December 18, 2004: Did an emergency restart of the web server at midnight. The web server has been getting into web server at midnight. The web server has been getting into a loop. From dump analysis it looks like a problem in UWC (the new user interface). The problem comes on slowly, so for the moment I'm restarting the web server at 3am every day. That produces about 40 sec of downtime for the admin tool and UWC.

November 26, 2004 - December 6 (transition to new software): System was down for two days after Thanksgiving for a move to version 6. That was announced in advance. There have been issues since that, which we're continuing to work on. Those generally didn't result in downtime, so they aren't documented here. However the calendar was down Sunday, December 5, for about 30 min as part of debugging a problem, and a couple of restarts were done after 6pm December 6 for debugging. The new user interface did not work Monday December 6, although all other services were OK.

October 4, 2004: The directory server for rulink was restarted automatically. As far as I know, no users saw any problem.

September 7, 2004 and August 31, 2004: The directory server for rulink was down briefly. This would have caused logins for mail and calendar to fail. On Sept 7, downtime was from 10:10 to 10:52. On August 31, it was 15:03 to 15:19. It's not yet clear the reason for the crash. Since we're about to move to a major new release, it probably doesn't make sense to diagnose issues with the directory server right now. However normally this would be handled by a job that runs every 5 min and checks system status. It appears that this job had exited abnormally, leaving a lock. Thus it was not functional. That has been fixed. Thus if there's another failure, it should be fixed automatically within 5 min.

August 31, 2004: ldap.rutgers.edu was down for a 10 sec period every 5 min, from 14:56 to 15:46. This was due to an administrator error. While 10 sec downtime is barely noticable in itself, it caused more serious problems for Radius, whose backup strategy was still not finally set up.

July 21, 2004: The calendar was down for about 5 min around 13:20 (although users already logged in probably saw only 1 min of downtime). I am bringing up the next release of the mail and calendar. Although I was installing in different directories on a different machine, the installation process did me the favor of removing the old (production) version. Restoration from backup was easy. It didn't touch configuration files or user data, so there should be no observable symptoms other than a few minutes downtime. Mail was not touched.

July 20, 2004: The web page claimed there was a failure, but it wasn't real. The 5 min check wasn't properly interlocked with the nightly restart, so it reported a failure while the directory server was being restarted.

June 7, 2004: rulink calendar crashed at 12:06. The autorestart process had not gotten it back up by 12:18. On looking at it, I thought the database check that is done during restart had hung. It now appears that in one of the patches, the behavior of the database checker changed. It now requires that part of the calendar system has to be up. I have fixed the restart process to handle this, and (on the backup server) verified that it works properly for both a normal calserv crash and one that has database corruption.

June 1 - 7, 2004: Ldap.rutgers.edu (no effect on rulink): Moved to a new version of the directory server on ldap.rutgers.edu. This version crashed twice a day, June 1 - 4, and once Jun 7. The autorestart process restarted it, so downtime was about 5 min in each case. The new version did two things (1) added a feature needed by OIT, (2) a major rewrite, to clean up the code and to move from out of date API calls to the current equivalents. I had hoped that moving to current APIs might deal with a memory leak, and in fact there is some evidence that it did. However the system was unacceptably unstable. I conjecture that it was (2) that caused the problems. Thus as of June 8 I have put up a version based on the last stable code, with a minor change to add the minimal new features needed.

May 14, 2004: Mail and calendar were slow and/or not working from 10:30 to 11:03. It appears that the directory server was in a progressively worsening state, with response getting slower. It was fixed at 11:03 when it finally failed the automatic check and the scripts restarted it. I've checked everything, and the pieces all seem to be working properly.

May 8, 2004: Took calendar down for a restart, because notifications weren't working. It now appears that there's an interaction between mail and calendar notifications, so that taking down the mail system took down notifications. Because I had to forcibly kill the system, a scan was needed. Hence it was down from 21:28 to 21:50.

April 8, 2004: Took calendar down for a restart, because notifications weren't working. In retrospect, there's a good chance restarting runotify alone would have worked. The stop didn't work, so I had to forcibly kill the calendar system. That raises the danger of a database inconsistency, so I had to run csdb check. That takes about 12 min, so the calendar system was down for 13 min or so at about 7:30 am.

March 23, 2004: Mail was slow during the morning. A mail loop cause the anti-virus appliances to be very slow. All mail into and out of RULink passes through those appliances. There was no problem on RULink itself. This was fixed by early afternoon.

February 12, 2004 Restarted mail around 16:00 and again around 17:46. The SSL certificate was expired. I had installed the new one but not restarted imap. Downtime was just a few seconds, so most users won't have seen it. However some users saw error messages about expired certificates. Generally they will have clicked "continue".

NOTE: I've now found the cause of the periodic directory server restarts. There's a problem with the code that results in a hang every month or so. Until it is fixed, the hang can be prevented by restarting the servers every week during scheduled downtime.

February 10, 2004 Automatic checker restarted the rulink directory server. This runs every 5 min, so services may have been interrupted for up to 5 min. [This log combines information about the rulink directory server with ldap.rutgers.edu. Thus it may not be obvious that the last failure on rulink was October 26, 2003.]

January 27, 2004. LDAP service on ldap.rutgers.edu was down from 14:49 to 14:54. There is no indication why the server was down. It was restarted by the automatic checking process. The backup was in operation at the time, so processes configured to use the backup would have seen no problem.

December 12, 2003. ldap.rutgers.edu was not responding to queries from 13:27 to 14:23. The process was running but non-responsive. There is a script that tests for the system being down. However it did not detect a situation where the system is running but not responding. I will modify it to do actual queries, and restart if there is no response. The backup was in operation during this period, so services that use the backup would have been OK. At the moment it appears that only a couple of clients move to the backup.

November 24, 2003. System was taken down from noon to 12:10pm to fix hardware. This was announced the previous week, so I don't consider it a failure, but I'm noting it here in case any users are curious why the system was down.

October 26, 2003. Automatic checker restarted the directory server at 15:16:09. Presumably it had died within the previous 5 minutes. No indications of any user-visible symptoms, though I'd still be interested in knowing why it happened. From the logs it looks like the mail system generated temporary failure codes (as it should), so a couple of messages received during this time would have been kept at the sender's site and retransmitted. I.e. no mail should have been lost.

October 9, 2003. Mail addressed to users at moltar.rutgers.edu had problems this afternoon, starting at 14:17 and ending around 16:30, due to an uncoordinated change to DNS data. Moltar is not a normal hostname for rulink, but some people who participated in early testing might still use addresses with moltar in them. Those addresses should still work, although generally it's best to use @rutgers.edu rather than @moltar.rutgers.edu.

October 8, 2003. Mail delivery was delayed 11:03 to 12:24. A process had been started without the proper setup for our configuration. This caused it to get into a loop, holding up delivery of mail. IMAP and Web mail continued to work fine. However new incoming mail was delayed. Fixing it required a mail restart. Depending upon your software, you might have had to login again around 12:30, although most software other than MS Outlook does this automatically.

(We use a Network Appliances file server. A special environment variable is needed for all mail processes, in order to avoid getting them into a loop. That variable was set for all normal processes, but I had not realized that some processes get started in a different way. I'm fairly certain that this environment variable is now set for all processes.)

October 1, 2003. The directory server was done from 21:36 to 21:57. This wouldn't have interrupted people currently logged in, but would have prevented new logins. There's no obvious way to know what happened. However I've now fixed things so that a timed process checks all processes every 5 minutes. In the future it will restart processes in cases like this automatically.

September 15, 2003. Web mail interface was down from 8:15 to 9:05. (The delay was caused by the fact that no one notified me it was down.) Unable to restart just the web mail interface, so had to restart all of mail. That caused like a 30 sec downtime in imap and pop, but web mail did come back. The web mail process crashed, in the Personal Addressbook code.

August 6, 2003. Calendar was down from 11:42 to 13:15. The calendar application crashed. Problem has been reported to Sun. We know enough about why it happened that I have been able to make changes that will prevent it from occuring again. This appears to be the same problem that occured July 23-24.

July 23-24, 2003. At 10am July 23, one of three calendar processes crashed. The system was still operational, but calendar operations were slowed. This was fixed at 11:37 am July 24 by restarting the calendar system. That caused an outage of about 2 minutes. There's a tool that checks the system every 5 minutes. It has been updated to detect this situation and warn the staff. Had it in been in place, we would have restarted the calendar late at night July 23.

July 1, 2003. Parts of the mail software were down for about 2 min each for the secure services changeover. The changeover was announced, but no specific time was given, so this could be construed as unannounced downtime.

May 30, 2003. System was unusably slow 3:46 to 5:23. Staff error: While fixing a problem, one step had unexpected consequences. The system was technically up, but response from the directory was slow enough that many operations timed out, making large portions of the system unusable.

May 26, 2003. System was down about 12:30 to 3pm. Failure of two disk drives caused the Netapp file server to take itself offline for manual recovery. The Netapp can recover from a single drive failure. For safety reasons, if a second failure occurs while it is rebuilding the file system from the first, the system takes the volume offline. Recovery was slower than it might have been because this is the first time this has happened to us. We've been using Netapps for three years with no other serious problems.

May 16, 2003. Operation moved from pilot to production hardware. The current system should be regarded as essentially a new system as of this date. Somewhat reduced stability would not be surprising for the next couple of weeks, since both hardware and setup problems often show up in new systems at a higher rate.

BACK TO TOP

For more information, contact rulink-support@rutgers.edu
© 2007 Rutgers, The State University of New Jersey. All rights reserved.

 

Search Rutgers