• SAN Best Practices
February 7th, 2012

Find Application Performance Bottlenecks – Before Buying SSD

Jim Bahn

When your enterprise application user cries for more performance, the easiest route is often simply to buy more SAN hardware.  Even if it doesn’t help, it shows action and commitment.  Sometimes, the storage vendor will recommend SSDs. People look to SSDs for several reasons: faster response time, and higher IOPS with smaller footprint and power consumption.  Your vendor might suggest SSDs as a way to resolve performance issues by creating a higher storage tier.  However, many very serious performance issues we have seen were caused by configuration issues such as queue depth setting, fan-in fan-out ratios, bad cables, or dirty connections.  Many of the bottlenecks we see are a function of the type of company.  With smaller, fast growing companies, we often find bottlenecks in the Inter-Switch Links (ISLs).  This is because the final size and scale of the SAN (just prior to a technology refresh) is not known when they start planning the ISL bandwidths.  Larger companies often have bottlenecks at either the HBA or the storage ports, but have usually gone through at least one technology refresh, and have learned to properly allocate ISL bandwidth.  Or, they over-allocate so they never run into that bottleneck again.

Before deploying expensive SSDs, our recommended best practice is to verify your current infrastructure with a product like VirtualWisdom, and make sure that current performance issues are actually caused by limitations in the storage and not by bad configuration, physical layer problems, host or application issues.  As one of our other bloggers pointed out recently, during a recent Health Check service at a large telecomm company, we discovered some damaged cables that if replaced, would fix most of the performance problems in their data warehouse.  The customer had spent over a million dollars in new hardware to shave latency by 10 milliseconds before he realized that his problem was as simple as bad cables. Talk about throwing money at the problem!

February 2nd, 2012

Finding Application Performance Bottlenecks – Queue Depths

Jim Bahn

It has been pretty well understood for quite some time, that when optimizing application performance in large enterprise environments, the biggest boosts come from the I/O subsystem – the SAN.  Application latency can come from CPUs and memory, but optimizing these may only buy you microseconds.  I/O is measured in milliseconds and that’s where your biggest bottlenecks are, and your biggest opportunities.  But “I/O subsystem” means much more than just disk seek time.  For example, over the past few years, we have noticed that customers very often set their HBA queue depths too high, and we’ve yet to find someone who sets them too low.  For an example of the effect of queue depth settings, check out our January 4, 2012 blog on the subject.  Since this is a “best practices” blog, and as we specialize in I/O monitoring and measurement, we wanted to share our thoughts.  Before you spend a wad of CAPEX on SSD memory or storage ports, or anything else for that matter, assume that your HBA queue depth settings are too high, and test for better settings.  And as we showed you in the January 4th blog, VirtualWisdom is a great way to see, in real time, the effect of your changes.

Since queue depth is per LUN, you have to take the server configuration into account.  The optimal queue depth for a server depends on the number of disks behind its LUNs, and the drive configuration.  For instance, Enterprise Flash Drive configurations require huge amounts of queued I/Os to get maximum throughput from the disk drives.  A LUN with 200 drives will need significantly more queued workload than a LUN with 5 drives to keep every drive busy.  It’s impossible to recommend a single queue depth policy for an entire environment, and each instance must be evaluated separately, but as stated above, the vast majority of customers we work with set them too high. It’s safe to assume that if you selectively lower your queue depths, you’ll reduce the I/O latency of many of your applications without spending a dime on additional hardware.

January 27th, 2012

Finding a slow draining device

Ron Lee

In trouble shooting SAN problems, we hear the term “slow-draining device’ a lot as it can cover a lot of issues. I would like to see how we can find a slow-draining device when the underlying problem is the loss of Buffer-to-Buffer Credit. We will start with a switch-level view of the data then look at things in more detail using our VirtualWisdom platform.

Our scenario starts out with a trouble report of frequent LinkReset events on a link, as shown in the VirtualWisdom Dashboard here:

Since VirtualWisdom records the events in your infrastructure, it is very easy for you to pull up the data at the time of the trouble report. You see the series of small green diamonds indicating LinkReset events. Also, with every LinkReset event, the system also detected Class3 Discards. This indicated that the switch could not forward frames to the intended destination and had dropped those frames without notification to the sender or receiver. This situation will cause extra load to be placed on the fabric as the device attempts to recover from the missing frame or worse, data corruption can occur due to missing data. At this point, we are looking at data at the one-hour summary level. If we need to see more detail, we can switch to a five-minute summary, which looks like:

The next thing to look at in this scenario is the VirtualWisdom ProbeFC8 Event Trend dashboard, which shows:

This dashboard is most interesting in that it shows there are not physical fiber layer problems. There are no CRC errors, Loss of Signal, Code Violations, etc. So we can quickly rule out any physical fiber problems.

Still looking at the five-minute summary level, we then decide to look at the Buffer-to-Buffer Credit parameter to see if there is any problem. So far, we have not seen any activity that correlates with our LinkReset problem. We open that dashboard and see:

This display is typical of many SAN monitoring and performance tools. It shows the data with a five-minute sample. At the bottom, we see our LinkReset, but there is no pattern or correlation with the LinkResets. However, since we are using the hardware-based VirtualWisdom SAN Performance Probe, we can go down to a one-minute summary. When we look at the same time frame using one-minute summaries, we see:

Here the correlation stands out very clearly. At the top, we see the value of the Buffer-to-Buffer credit parameter. As it decays low enough, the system responds with a LinkReset. We see this pattern happening over and over again. Now that we have this information, we can report the problem to operations and it can be fixed.

So let’s review what I just went through. We started with a report of a series of LinkReset events. We were able to easily find the problem in the data recorded by VirtualWisdom. We were able to quickly rule out any physical fiber problems. A look at the data at the typical five-minute summary still did not show any problems. However, when we looked at data at a one-minute summary, we saw the correlation between Buffer-to-Buffer credit loss and the LinkReset. This allowed us to report the specific problem to operations for resolution.

 

January 17th, 2012

Solving Intermittent Slowdowns – The Hard Way or the Easy Way

Jim Bahn

Solving intermittent slowdown problems is a challenge.  Often times, actual SAN problems are intermittent and difficult to reproduce, and unfortunately there is no one “cause” for the issue.  In addition, in order for the impact to manifest itself, all of these issues in the SAN have to occur at the same time, in the same sequence.

Reported by server tools like iostat or perfmon, slowdowns are attributed to “slow I/O”, which is like saying that your cough is causing your cold.  Slow I/O is just a symptom, and it can be caused by any number of things, including, as surprising as it seems, poorly configured servers or applications.  However, the SAN is usually the first to get blamed.

The typical scenario goes something like this … an application begins to demonstrate intermittent slow behavior, and the server or application team does some basic investigation, discovering that some metric, like IOPs or MB/s, has degraded.   The assumption: That the problem is with the I/O subsystem – the SAN.  The issue is reported and the SAN team then spends hours poring over the SAN configuration (zoning, mapping, masking, etc) before resorting to the wait-and-see method, i.e., waiting for the problem to reoccur.  In addition, the SAN team has to simultaneously monitor the full I/O path, with the right instrumentation tools.  Then, when the problem finally reoccurs, the existing tools often only report the same thing that was already known … that IOPs or MB/s dropped.

Next, the team places a call in to the SAN component vendors, who each ask for several things:

  • A detailed description of the problem
  • An estimate of when the problem occurred
  • Log file dumps from each proprietary monitoring tool (taken when the problem reoccurs yet again)
  • Confirmation that the timestamps from the component service processors and from the servers are noted
  • GetConfig from affected servers
  • Diagrams of the infrastructure
  • For UNIX hosts, output from iostat and sar
  • For Windows hosts, output from performance monitor data of all physical disk objects

Keep in mind that sometimes it’s easy to reproduce a problem, but often it is not, and we know of several instances when the problem occurred on a monthly basis. With luck, the problem reoccurs soon and the logs capture all the events in a granular-enough order, with everything time-stamped, so that someone back at the vendor’s HQ can guesstimate the cause/effect of the slowdown. Following this, there is usually another call for more metrics with more care taken to providing the time differences of the various component clocks. This is done because without proper sequencing of events, it’s very hard to know what to blame (which is why your SAN tools, which only provide 5 – 15 minute averages, don’t work). Then, you wait for another reoccurrence.

In the meantime, one vendor points to another vendor’s components that are known to cause problems.  And no one on the server or application side is doing their own investigation because the SAN is typically assumed to be guilty.  Eventually, the customer demands that all the vendors come onsite to arm-wrestle each other.  Finally, someone suggests buying extra links or ports or HBAS or disks, or upgrading hardware to bigger servers, new Storage Arrays with SSDs, or completely refreshing your SAN to newer models, because they claim that the problem is related to the infrastructure not being able to cope with the performance demands. So more hardware is procured, and sometimes the problem goes away for a few weeks or maybe even months. Or it doesn’t. Ouch.

From all the terror described above, believe it or not, there’s another way to handle this. VirtualWisdom’s historical trend reports allow the IT staff to diagnose reported problems without waiting for the problem to occur again.  Fire up VirtualWisdom, move the dashboard slider back to when the problem occurred, and view the playback in time-correlated 1-minute increments.  VirtualWisdom then leads the admins to where the bottlenecks occurred. No log file parsing, no calls to vendors, no finger pointing, and no big purchase orders for equipment you don’t need.  If you’d like to see a 2-minute video of how this works, click here.

January 12th, 2012

Avoiding the Virtual CPU Dilemma – Overprovisioning vCPU to pCPU ratios

Archie Hendryx

2011 was a year where despite the economic constraints everything Big was seemingly good; Big Data, Big Clouds, Big VMs etc. Caught in the industry’s lust for this excess, 2011 was also the year I lost count of how many overprovisioned resources to ‘Big’ Production VMs I witnessed. More often than not this was a typical reaction from System Admins trying to alleviate their fears of potential performance problems to important VMs. It was the year where I began to hear justifications such as “yes we are overprovisioning our production VMs..but apart from the cost savings, overallocating our available underlying resources to a VM isn’t a bad thing, in fact it allows it to be scalable”. Despite this 2011 was also the year where I lost count of the amount of times I had to point out that sometimes overprovisioning a VM does lead to performance problems – specifically when dealing with Virtual CPUs.

VMware refers to CPU as pCPU and vCPU. pCPU or ‘physical’ CPU in its simplest terms refers to a physical CPU core i.e. a physical hardware execution context (HEC) if hyper-threading is unavailable or disabled. If hyperthreading has been enabled then a pCPU would consitute a logical CPU. This is because hyperthreading enables a single processor core to act like two processors i.e. logical processors. So for example, if an ESX 8-core server has hyper-threading enabled it would have 16 threads that appear as 16 logical processors and that would constitute 16 pCPUs.

As for a virtual CPU (vCPU) this refers to a virtual machine’s virtual processor and can be thought of in the same vein as the CPU in a traditional physical server. vCPUs run on pCPUs and by default, virtual machines are allocated one vCPU each. However, VMware have an add-on software module named Virtual SMP (symmetric multi-processing) that allows virtual machines to have access to more than one CPU and hence be allocated more than one vCPU. The great advantage of this is that virtualized multi-threaded applications can now be deployed on multi vCPU VMs to support their numerous processes. So instead of being constrained to a single vCPU, SMP enables an application to use multiple processors to execute multiple tasks concurrently, consequently increasing throughput. So with such a feature and all the excitement of being ‘Big’ it was easily assumed by many that taking advantage of such a feature by provisioning additional vCPUs could only ever be beneficial – but if only it was that simple.

The typical examples I faced entailed performance problems that were either being blamed on the Storage or the SAN and not CPU constraints especially as overall CPU utilization for the ESX server that hosted the VMs would be reported as low. Using Virtual Instruments’ VirtualWisdom I was able to quickly conclude that the problem was not at all related to the SAN or Storage but the hosts themselves. By being able to historically trend and correlate the vCenter, SAN and Storage metrics of the problematic VMs on a single dashboard it was apparent that the high number of vCPUs to each VM was the cause. This was indicated by a high reading of what is termed the ‘CPU Ready’ metric.

To elaborate, CPU Ready is a metric that measures the amount of time a VM is ready to run against the pCPU i.e. how long a vCPU has to wait for an available core when it has work to perform. So while it’s possible that CPU utilization may not be reported as high, if the CPU Ready metric is high then your performance problem is most likely related to CPU. In the instances that I saw, this was caused by customers assigning four vCPUs and in some cases eight to each Virtual Machine. So why was this happening?

Well firstly the hardware and its physical CPU resource is still shared. Coupled with this the ESX Server itself also requires CPU to process storage requests and network traffic etc. Then add the situation that sadly most organizations still suffer from the ‘silo syndrome’ and hence there still isn’t a clear dialogue between the System Admin and the Application owner. The consequence being that while multiple vCPUs are great for workloads that support parallelization but this is not the case for applications that don’t have built in multi-threaded structures. So while a VM with 4 vCPUs will require the ESX server to wait for 4 pCPUs to become available, on a particularly busy ESX server with other VMs this could take significantly longer than if the VM in question only had a single vCPU.

To explain this further let’s take an example of a four pCPU host that has four VMs, three with 1 vCPU and one with 4 vCPUs. At best only the three single vCPU VMs can be scheduled concurrently. In such an instance the 4 vCPU VM would have to wait for all four pCPUs to be idle. In this example the excess vCPUs actually impose scheduling constraints and consequently degrade the VM’s overall performance, typically indicated by low CPU utilization but a high CPU Ready figure. With the ESX server scheduling and prioritising workloads according to what it deems most efficient to run, the consequence is that smaller VMs will tend to run on the pCPUs more frequently than the larger overprovisioned ones. So in this instance overprovisioning was in fact proving to be detrimental to performance as opposed to beneficial. Now in more recent versions of vSphere the scheduling of different vCPUs and de-scheduling of idle vCPUs is not as contentious as it used to be. Despite this, the VMKernel still has to manage every vCPU, a complete waste if the VM’s application doesn’t use them!

To ensure your vCPU to pCPU ratio is at its optimal level and that you reap the benefits of this great feature there are some straightforward considerations to make. Firstly there needs to be dialogue between the silos to fully understand the application’s workload prior to VM resource allocation. In the case of applications where the workload may not be known, it’s key to not overprovision virtual CPUs but rather start with a single vCPU and scale out as and when is necessary. Having a monitoring platform that can historically trend the performance and workloads of such VMs is also highly beneficial in determining such factors. As mentioned earlier CPU Ready is a key metric to consider as well as CPU utilization. Correlating this with Memory and Network statistics, as well as SAN I/O and Disk I/O metrics enables you to proactively avoid any bottlenecks and correctly size your VMs and hence avoid overprovisioning. This can also be extended in considering how many VMs you allocate to an ESX Server and in ensuring that its physical CPU resources are sufficient to meet the needs of your VMs.  As businesses’ key applications become virtualized it’s an imperative that whether they are old legacy single threaded workloads or new multi threaded workloads the correct vCPU to pCPU ratio is allocated. In this instance size isn’t always everything it’s what you do with your CPU that counts.