May 1st, 2012
Jim Bahn People often ask — “Are there any best practices that the troubleshooting experts recommend?” I asked a couple of our top services guys for their recommendations, and I’m sharing them with all of you today:
- Don’t stop looking just because you’ve removed the symptom, because if you do, you’re likely to see the same problems later. Sure, to alleviate the immediate problem, you may have to remove users or applications that are less critical, perhaps stop backups, and remove other potential bottlenecks. While this may fix the immediate problem, it often stops the underlying cause from being discovered.
- Use “real” real-time monitoring for alerts that get you in front of the issues before the application users feel the pain.
- Sometimes you have to broaden your approach beyond what the user is reporting. If you stop there, you will often miss larger issues that may affect other, slightly less latency-sensitive apps.
- As a first step for triage, try to isolate whether the cause is on the server or the SAN. Comparing your baseline Exchange Completion time with ECT during the slowdown, will tell you immediately where to start, and where to stop looking. Your vendors will appreciate it, too.
- Try to find the finest granularity in your historical reporting to see which event preceded another, for cause and effect. A one-minute interval is often not sufficiently granular.
- Look at your historical I/O patterns, busy times of day, multipath configurations, queue depth settings, top talkers, etc. to gain a profile of behavior. Then compare with your healthy baseline, and rule out things that haven’t changed. You might find 6 things that appear to be going wrong, but if only one of those things seem to have occurred when the problem was reported, you can focus on that thing immediately. Later on, you can go back to look at the others.
- When changes are made to fix the incident, you should get immediate feedback. Without an immediate response, customers often take one of two approaches: 1) They delay or stagger fixes until they can determine the effect of each one; 2) Or they make all changes at the same time, and are then left wondering which change fixed the problem.
- Lastly, ask for help sooner rather than later. We’ve heard of problems dragging on for months, vendors getting kicked out of accounts, and literally millions of dollars wasted on adding expensive hardware. Waiting days or weeks to find the root cause of a problem is unacceptable. Bring in a performance pro.
April 20th, 2012
Jim Bahn While it’s generally accepted that SAN storage utilization is low, only a few industry luminaries, such as John Toigo, have talked about the severe underutilization of Fibre Channel (FC) SAN fabrics. The challenge, of course, is that few IT shops have actually instrumented their SANs to enable accurate measurements of fabric utilization. Instead, 100% of enterprise applications get the bandwidth that perhaps only 5% of the applications, wasting CAPEX need.
In dealing with several dozen large organizations, we have found that nearly all FC storage networks are seriously over-provisioned, with average utilization rates well below 10%. Here’s a VirtualWisdom dashboard widget (below) that shows the most heavily utilized storage ports on two storage arrays, taken from an F500 customer. The figures refer to “% utilization.”

Beyond the obvious unnecessary expense, the reality is that with such low utilization rates, simply building in more SAN hardware to address performance and availability challenges does nothing more than add complexity and increase risk. With VirtualWisdom, you can consolidate your ports, or avoid buying new ones, and track the net effect on your application latency to the millisecond. The dashboard widgets below show the “before” and “after” latency figures that resulted from the configuration changes to this SAN, using VirtualWisdom. They demonstrate a negligible effect.

Latency “before”

Latency “after”
Our most successful customers have tripled utilization and have been able to reduce future storage port purchases by 50% or more, saving $100 – $300K per new storage array.
For a more detailed discussion of SAN over-provisioning, click here, or check out this ten-minute video discussing this issue and over-tiering.
April 13th, 2012
Jim Bahn It’s no secret that many enterprise mission critical IT implementations depend on SAP. In 2008, the Standish Group estimated the average cost of an ERP downtime at $888K per hour. If you’re an SAP user, you probably have some idea of your cost of downtime.
What’s surprising to me is that often companies still rely on massive over-provisioning to handle the database growth and ensure that their infrastructure can meet the level of performance and availability required for informal or formal Service Level Agreements. On one level, it’s understandable, because the stakes are so high. But we’re starting to see a trend towards better instrumentation and monitoring, because, while the stakes are high, so are the costs.
The truth is, the performance of SAP is usually not bottlenecked by server-side issues, but rather by I/O issues. Unfortunately, most of today’s monitoring solutions, including the best known APM solutions, have a tough time correlating your applications with your infrastructure. The “link” between the application and the infrastructure is often inferred, or is so high level that deriving actual cause and effect is still a guessing game.
Many of our largest customers de-risk their SAP applications using VirtualWisdom to directly correlate the infrastructure latency to their application instances. In this simple dashboard widget (below), an application owner tracks, in real time, the application latency, in milliseconds, caused by the SAN infrastructure.

With this level of tracking and correlation, many of the largest SAP and VirtualWisdom customers have successfully de-risked their growing, mission-critical SAP deployments.
To hear our Director of Solutions Consulting Alex D’Anna discuss this issue in more detail, I encourage you to attend his 35-minute On-Demand webcast.
March 22nd, 2012
Jim Bahn We get a lot of questions about VDI (virtual desktop infrastructure, or interface). By now, the benefits of VDI are pretty well understood. Despite the benefits and potential OPEX and CAPEX savings, businesses are still averse to its adoption, due to the common problem called the “boot storm.”
Boot storms are large slowdowns that occur when a large amount of end users log into their systems at the same time. This is typical in the morning when everyone starts work. This causes intense concentrated storage I/O, leading desktop users to experience extreme slowness on their virtual desktop to the point where it can become almost unusable. To solve this issue, many vendors suggest the option of investing in expensive SSDs. So much for saving money … one of the big reasons for VDI in the first place.
We’ve found that insight into the SAN fabric and the end-to-end I/O profiles of your VDI deployment can help you ensure adequate desktop performance, even during peak times, by balancing out the load and eliminating any possible physical layer issues.
VDI servers have special I/O profiles, but they share the need with all other application servers, in that, to monitor and analyze performance, you need a single view of the entire infrastructure. In this example dashboard, the administrator can see performance metrics as well as physical layer metrics, which together offer a way to watch for indications of performance problems in the VDI environment.
The custom VirtualWisdom dashboard below shows an end-to-end view of a VDI deployment that incorporates a view of the SAN network. On the left-hand side, we have a view of the throughput and demand of the physical servers, enabling us to immediately identify and correct any imbalances that may exist. In the center, we have metrics that highlight any potential physical layer issues or problems that may be occurring from HBA to switch port to storage port. This allows us to proactively eliminate any potential I/O slowdowns. On the right-hand side, we have a view of the storage infrastructure and how the demand from VDI is affecting the storage ports. This allows us to balance out the I/O load across the correct storage ports, identifying and eliminating any congestion or slowdowns.

This customized VDI infrastructure dashboard also enables us to monitor the centralized desktops backups, ensuring that these are not only successful and timely but also do not affect the rest of the company’s production environment.
Furthermore, with outsourcers and many companies having international staff, boot storms can occur at many different times of the day. Using VirtualWisdom’s unique playback facility, it’s easy to historically trend such throughput and I/O profiles to enable a safe, stable and cost-effective VDI investment and deployment.
March 14th, 2012
Jim Bahn People often don’t understand why their performance monitors don’t help to either predict or find performance problems. Well, the answer to that could take a book, but a simple first step is understanding what IOPS is telling you, and why, in a FC SAN, you need to look at frames per second.
I/Os per second, or IOPS, is commonly recognized as a standard measurement of performance, whether to measure a storage array’s back-end drives or the performance of the SAN. IOPS vary on a number of factors,including a system’s balance of read and write operations; whether the traffic is sequential, random or mixed; the storage drivers; the OS background operations; or even the I/O block size.
Block size is usually determined by the application, with different applications using different block sizes for various circumstances. For example, Oracle will typically use block sizes of 2 KB or 4 KB for online transaction processing, and larger block sizes of 8 KB, 16 KB, or 32 KB, for decision support system workload environments. Exchange 2007 may use an 8 KB block size, SQL may use a minimum of 8 KB, and SAP may use 64 KB, or even more.
In addition, when IOPS is considered as a measurement of performance, it’s standard practice that the throughput — that is to say, MB/sec — is also used. This is due to the different impact they have on performance. For example, an application with only 100MB/sec of throughput, but 20,000 IOPS may not cause bandwidth issues, but with so many small commands, the storage array is put under significant pressure, as its front-end and back-end processors have an immense workload to deal with. Alternatively, if an application has a low number of IOPS but significant throughput, such as long sustained reads, then the pressure will occur on the bandwidth of the SAN links. Despite understanding this relationship, MB/s and IOPS are still insufficient measures of performance when you don’t take into consideration the frames per second.
Why is this? Let’s look at the FC frame. A standard FC frame has a data payload of approx 2K. So if an application has an 8K I/O block size, this will require 4 FC frames to carry that data. In this instance, one I/O is 4 frames. To get a true picture of utilization, looking at IOPS alone is not sufficient because there’s a big difference between applications and their I/O size, with some ranging from 2K to even 256K.
Looking at a metric such as the ratio of frames/sec to Mb/sec, as displayed in this VirtualWisdom dashboard widget, we get a better picture and understanding of the environment and its performance. With reference to this graph of MB/sec to frames/sec ratio, the line graph should never be below the 0.2 of the y-axis, that is, the 2K data payload.

If the ratio falls below this, say at the 0.1 level, as in the widget below, we know that data is not being passed efficiently despite the throughput being maintained, as measured in MB/sec.

This enables you to proactively identify if there are a number of management frames being passed instead of data, as they are busily reporting on the physical device errors that are occurring.
Without taking frames per second into consideration and having an insight into this ratio to MB/s, it’s easy to believe that everything is OK and that data is being passed efficiently, since you see lots of traffic. However, in actuality, all you might be seeing are management frames reporting a problem. By ignoring frames per second, you run the risk of needlessly prolonging troubleshooting and increasing OPEX costs, simply by failing to identify the root cause of the performance degradation of your critical applications.
For a more complete explanation, and an example of how this applies to identifying slow-draining devices, check out this short video.