Why IPM? Avoiding the Crash

By Brett Lightfoot, Regional Director, APAC

In the high-tech world of monitoring supercars, I am sure many people have heard of the “most expensive car crash” where eight Ferraris, one Lamborghini Diablo, a Nissan GT-R Skyline, three Mercedes Benzes, and one Toyota Prius, were involved in a 14-car pile up in Japan.  Just the repair bill was estimated at $4,000,000, let alone the costs of road crews, emergency workers etc.

most expensive car crash wreckage clean-up

There have been multiple stories put forward about cause and effect that range from a single high-performance car’s driver/user losing control through to the poor Prius driver just being the chink in the works with every other car running at 90mph.

So this had me thinking; how does this relate to your data center?

A data center exists to help deliver a reliable and stable customer experience for your brand.  But it is only as strong as its weakest link.   And to eliminate the weakest link or bottleneck, you must be able to identify, control and do things about deviations from expected behaviours.

For the application delivery chain from user to storage, the ability to monitor, correlate and take the right corrective actions has never been more complex.  It’s far beyond any unassisted human ability. Especially when deviations, such a new VMs or network or storage can be quickly spun up and down without understanding the consequences – potentially a car wreck waiting to happen.

How about legacy monitoring tools?

The multitude of Application Performance Monitoring (APM) products have matured to the point that user experience is widely understood.  And there are many monitoring tools often packaged by the vendor suppliers who can adequately look at their data center component or a few tools which can look at a few different aspects, or several log file analysers.

Is that enough to find weak data center infrastructure links?

Not so much.  First, you can’t assume that infrastructure pieces work in isolation.  As an example, CPU or disk cache could be running high, but are they causing slow response time?  It could simply mean that you are running at a high density, and could be a complete red herring when my application runs slower than expected.  Even if you run a log file analyser in ‘verbose’ mode (not recommended by most vendors) it assumes you know what you are looking for.  Or you may have to resort to averaging and sampling, which can mask the real behaviours/outliers. And imagine how difficult it is to look at two or more sets of logs, even graphically, and try to understand what relates to what.

From a data center viewpoint, the SAN and NAS environment represent huge complexity. It’s not just about reading and writing files whether its file, block, object using a specific protocol, but how do your compute, network and storage interact. They have their own special characteristics based upon the profile of the application workload. For instance, common issues that are not available from APMs or most monitoring tools are queue depths not properly set for the workload, noisy neighbours unexpectedly consuming and congesting resource capacity (remember the Prius), physical layer issues such as Class 3 discards, imbalances in workload from initiator to Target/LUN … and the list goes on.

At Virtual Instruments, we address these issues.  Our VirtualWisdom platform:

  1. Is always on, with no sampling, no masking of issues. It’s out-of-band, real-time, with no agents to add to resource contention
  2. Complements APMs by focusing on how the infrastructure is tuned to support application performance
  3. Associates the underlying infrastructure to applications through dynamic topologies and correlation
  4. Provides analytics about how to tune to avoid the road crash with built-in machine learning based upon hundreds of man years of experience, with answers on what you should be doing while “on the road”
  5. Avoids ‘performance’ over-provisioning, which consumes valuable financial resources without knowing if it’s providing the benefits or meaningful risk reduction

Virtual Instruments is vital to our customers, to bring it all together, in an area where massive resources have been spent. Typically, our customers start with a Health Assessment to find out what they don’t know and how they may be able to do it better, quicker and more cost effectively.   This has helped more than a few to avoid our “I am on fire” Emergency Service engagement.  AKA the ‘crash’.