Repercussions of Being Leading Edge

Leading Edge

Only a couple days before go time on an ISIM upgrade, I ran into an issue that was introduced in IBM’s latest fixpack at the time.  Unfortunately for my team, we were forced to upgrade to the latest fixpack due to some other issues that the fixpack resolves.  However, I knew there was a big risk of being on the leading edge.

For those curious of the technical aspect, using IdentityPolicy.userIDExists within an Identity Policy is now broken in ISIM 6.0.0.3 (FP3).  Rumor has it, a code change that was intended to improve performance broke things.  Using the userIDExists function now only searches for user IDs on accounts owned by that identity.  It is supposed to search for user IDs on any accounts to prevent duplicate user IDs from being used.  This led to duplicate user IDs being created within the system, which would have been a security nightmare if it would have not been discovered before being placed into production.

Because the fixpack was so new, not many of IBM’s customers had upgraded.  This meant I was left to discover this new bug.  After two days of troubleshooting, including combing through literally millions of lines of debug output, I was finally able to present IBM with what exactly bad broken.  They immediately confirmed the issue and have since opened an ticket on their side to have the code corrected.  Unfortunately this caused the emergency break to be pulled on our scheduled upgrade, which ultimately makes myself (the integrator) and my team look bad.

Going back many years, I encountered a similar issue on a Sonicwall firewall with IDS/IPS services enabled.  I had the IPS services configured to download the latest updates as soon as they were available.  During the middle of a business day all of a sudden Internet browsing stopped working for everyone.  The IPS began sending me thousands of alerts, one for each time a user attempted to access a webpage.  Sonicwall had released a bad rule which matched all normal HTTP traffic.  In only a moment I was able to disable the rule.  However, I was thankful I was sitting at my desk and not on the beach at the time.

There is a big risk with being first when it comes to upgrades (patches and major versions alike).  Think about that every time your Windows workstation applies the latest patches just released from Microsoft.  If something is wrong with those patches, you’ll be one of the first to discover there is an issue, and that could cause you downtime.  This may not be much of an issue with a home computer, but in a business could cause significant downtime, especially if hundreds of workstations and/or servers are involved.

When it comes to patching, I recommend using some sort of patch management such as WSUS, SMS, RHEL Satellite Server, or the countless other tools designed specifically for your environment.  These tools will allow you to schedule updates at your business’s pace.  It is best to create three primary groups, alpha, beta, and production, and schedule updates at different times for each group.  If you can’t afford one of these solutions, there are countless examples of how to setup free “home grown” solutions to perform their functions.

Place a few of your most tech savvy administrators in the alpha group.  Always release updates to this group first, and wait at least half a day before releasing updates to the beta group.  The beta group should comprise of a couple tech savvy users from each major department within your organization.  Never release updates to production until they have been tested by beta users for at least 24 hours, preferably 48 hours.  Always communicate with each group when you are releasing updates, and make it clear to your alpha and beta users that you need them to test and provide feedback by a certain time.

This process must be tweaked slightly when applying patches to enterprise systems that multiple users access.  Within these environments, you should have the system duplicated to represent an alpha, beta, and production group (aka development, testing, and production).  The user groups represented for the alpha and beta groups should have access to the alpha and beta enterprise system so that they can perform testing for you.  Unfortunately because production data will seldom exist on development and testing systems, you will have to assume some risk when you go to production with any patch.

In the end, by allowing your alpha users to be on the leading edge, you will reduce the impact of bad patches across your organization.  While bad patches don’t happen often, they do happen, and they do cause major headaches, which ultimately points back to you for releasing them.