Reach for the common denominator
|Problem solving, how to get better at it, and how to teach it, remains a concern with me.|
In looking at problem solving heuristics in IT the other day, I mentioned that one important rule for me is that when several problems start up at once, odds are they are related. So one of my approaches is to investigate the most transparent of the problems, and see if that leads to a general solution. But I realized while helping a client solve a problem the other day that there is a second approach I use all the time that also stems from this rule. When several problems crop up at once, look for possible common denominators and test those possibilities.
The real life scenario: At one of our YMCA clients, things seemed to be going from bad to worse over a period of several weeks. Problem one: the serial-to-ethernet converters that let all the door access-controls communicate with the server kept crashing, and would have to be reset frequently. Since this system is slated for replacement, no one worried about it too much.
A few weeks later, the web site started periodically loosing its connection to the database. Some calls were placed to the hosting to service to see what was up. No real results came of this.
Then a third problem started: The credit card software was timing out on a regular basis. So the users started looking into upgrading that to the latest version of that program.
Three problems, three attempted solutions. What if the problems had a common cause? What might it be? Well, all of these diverse processes are driven by services on one central box - these things could all happen if the server was periodically loosing its network connection. We checked the event log on the network, and indeed found that the door-control software was reporting frequent loss of network connectivity. Our hypothesis looked even better.
But here I fell victim to the bias of "availability". All of the tools my company developed are resident on that one server - so I assumed that problem was located there -- it's the box I'm most familiar with at that site. So we checked the obvious things - the cable and the port on the switch-- and the less obvious, looking for a rougue process running on the box, but everything seemed fine. It's a network problem, in my mind, became synonymous with "This server is messed up," even though we could not find anything wrong with it.
To the rescue came their network guy, who took a broader view of the network. He spent some time looking at their network traffic and found the firewall was being swamped by messages from a random PC that had been possibly taken over by a virus or bot-net scam. That machine was removed from the net, and all the problems went away.
So here's another rule of thumb - it really does help to get additional minds on the problem -- someone else will see past your blind spots.