Tax Season Turbulence

This is a quick story about the importance of IBM Content Manager OnDemand Performance Tuning — when the server is down and customers are blowing up the helpdesk phones, you’re WAY too late.

My phone beeped with a text message on a Tuesday morning.  A customer I’d worked for two years prior asked me to give them a call.  When I asked how I could help, the response was panic.  “Our CMOD admin is on vacation on the other side of the world, and the server has been up and down all last week, and it’s getting worse.  Is there any way you could help us?”  It was a weird co-incidence, since I was headed to the train station in less than an hour, and their office was literally across the street.  I packed my bags and zipped over.

It was the typical “war-room” scenario.  There were a dozen people in an 8-seat meeting room.  Three people on their cell phones, and several people talking on top of each other on the speakerphone.  Graphs with red bars projected onto the screen.  Two people I’d worked closely with on my last project were there, and they brought me up to speed.  The problem was straightforward — it was tax season, and thousands of people were fetching their tax forms online.  Documents were being retrieved without issue — but queries were taking upwards of 40 seconds.  When there are 100 people searching every minute, it doesn’t take many 40-second searches to add up to an unresponsive server.

We checked the OnDemand System Log to look for the Application Groups that were performing the worst.  There were three.  Then we looked more closely at those query records — and paid special attention to the fields that were being searched for.  Their applications usually searched for a year’s worth of data, and they used one of two fields to find individual documents.

Next stop, the IBM CMOD Admin Client…  We checked the configuration of the Application Groups, and wanted to see which fields were being indexed.  It turned out that NEITHER of the two most popular fields for finding documents were being indexed by the database — that meant that the database engine was repeatedly searching MILLIONS of documents for the overwhelming majority of searches.  We needed to index the fields, but doing so would make the server even MORE busy, and reject hundreds or thousands of users while the indexes were being rebuilt.  We sent the message to management, and waited about 15 minutes for them to give us permission to make the change in production.

While we were waiting, we checked the other CMOD Application Groups, and it was the same story:  The fields that were the MOST important for fast searches had no database index, meaning the server was reading millions and millions of rows from the database, frantically searching for each user’s documents.

When management gave us permission to go ahead, we made the change, and I started the ‘stopwatch’ on my phone.  It took 4 minutes, and yes, the chart full of red bars looked even more angry the entire time.  But 4 minutes later, the database caught up — and suddenly the bars on the chart were 50% red, and 50% green.  Management was happy, and gave us permission to make more changes.  With the database doing less work, adding the next index took only 1 minute.  And the next was 22 seconds.  Each time we made the database faster, there was more bandwidth for us to add other indexes more quickly.

Within 4 hours, the crisis was unofficially over.  99% of the queries we sampled were being served in milliseconds, not tens-of-seconds.  The angry, hot graphs on the conference room wall were a nice, cool green.  People who had been in that crowded room for days packed up and left quietly.

Then I walked over to the train station, and caught the next train out…  and wrote this post.  🙂