• SAN Best Practices
January 17th, 2012

Solving Intermittent Slowdowns – The Hard Way or the Easy Way

Jim Bahn

Solving intermittent slowdown problems is a challenge.  Often times, actual SAN problems are intermittent and difficult to reproduce, and unfortunately there is no one “cause” for the issue.  In addition, in order for the impact to manifest itself, all of these issues in the SAN have to occur at the same time, in the same sequence.

Reported by server tools like iostat or perfmon, slowdowns are attributed to “slow I/O”, which is like saying that your cough is causing your cold.  Slow I/O is just a symptom, and it can be caused by any number of things, including, as surprising as it seems, poorly configured servers or applications.  However, the SAN is usually the first to get blamed.

The typical scenario goes something like this … an application begins to demonstrate intermittent slow behavior, and the server or application team does some basic investigation, discovering that some metric, like IOPs or MB/s, has degraded.   The assumption: That the problem is with the I/O subsystem – the SAN.  The issue is reported and the SAN team then spends hours poring over the SAN configuration (zoning, mapping, masking, etc) before resorting to the wait-and-see method, i.e., waiting for the problem to reoccur.  In addition, the SAN team has to simultaneously monitor the full I/O path, with the right instrumentation tools.  Then, when the problem finally reoccurs, the existing tools often only report the same thing that was already known … that IOPs or MB/s dropped.

Next, the team places a call in to the SAN component vendors, who each ask for several things:

  • A detailed description of the problem
  • An estimate of when the problem occurred
  • Log file dumps from each proprietary monitoring tool (taken when the problem reoccurs yet again)
  • Confirmation that the timestamps from the component service processors and from the servers are noted
  • GetConfig from affected servers
  • Diagrams of the infrastructure
  • For UNIX hosts, output from iostat and sar
  • For Windows hosts, output from performance monitor data of all physical disk objects

Keep in mind that sometimes it’s easy to reproduce a problem, but often it is not, and we know of several instances when the problem occurred on a monthly basis. With luck, the problem reoccurs soon and the logs capture all the events in a granular-enough order, with everything time-stamped, so that someone back at the vendor’s HQ can guesstimate the cause/effect of the slowdown. Following this, there is usually another call for more metrics with more care taken to providing the time differences of the various component clocks. This is done because without proper sequencing of events, it’s very hard to know what to blame (which is why your SAN tools, which only provide 5 – 15 minute averages, don’t work). Then, you wait for another reoccurrence.

In the meantime, one vendor points to another vendor’s components that are known to cause problems.  And no one on the server or application side is doing their own investigation because the SAN is typically assumed to be guilty.  Eventually, the customer demands that all the vendors come onsite to arm-wrestle each other.  Finally, someone suggests buying extra links or ports or HBAS or disks, or upgrading hardware to bigger servers, new Storage Arrays with SSDs, or completely refreshing your SAN to newer models, because they claim that the problem is related to the infrastructure not being able to cope with the performance demands. So more hardware is procured, and sometimes the problem goes away for a few weeks or maybe even months. Or it doesn’t. Ouch.

From all the terror described above, believe it or not, there’s another way to handle this. VirtualWisdom’s historical trend reports allow the IT staff to diagnose reported problems without waiting for the problem to occur again.  Fire up VirtualWisdom, move the dashboard slider back to when the problem occurred, and view the playback in time-correlated 1-minute increments.  VirtualWisdom then leads the admins to where the bottlenecks occurred. No log file parsing, no calls to vendors, no finger pointing, and no big purchase orders for equipment you don’t need.  If you’d like to see a 2-minute video of how this works, click here.