• SAN Best Practices
May 1st, 2012

SAN Troubleshooting Best Practices

Jim Bahn

People often ask — “Are there any best practices that the troubleshooting experts recommend?” I asked a couple of our top services guys for their recommendations, and I’m sharing them with all of you today:

  1. Don’t stop looking just because you’ve removed the symptom, because if you do, you’re likely to see the same problems later. Sure, to alleviate the immediate problem, you may have to remove users or applications that are less critical, perhaps stop backups, and remove other potential bottlenecks. While this may fix the immediate problem, it often stops the underlying cause from being discovered.
  2. Use “real” real-time monitoring for alerts that get you in front of the issues before the application users feel the pain.
  3. Sometimes you have to broaden your approach beyond what the user is reporting. If you stop there, you will often miss larger issues that may affect other, slightly less latency-sensitive apps.
  4. As a first step for triage, try to isolate whether the cause is on the server or the SAN. Comparing your baseline Exchange Completion time with ECT during the slowdown, will tell you immediately where to start, and where to stop looking. Your vendors will appreciate it, too.
  5. Try to find the finest granularity in your historical reporting to see which event preceded another, for cause and effect. A one-minute interval is often not sufficiently granular.
  6. Look at your historical I/O patterns, busy times of day, multipath configurations, queue depth settings, top talkers, etc. to gain a profile of behavior. Then compare with your healthy baseline, and rule out things that haven’t changed. You might find 6 things that appear to be going wrong, but if only one of those things seem to have occurred when the problem was reported, you can focus on that thing immediately. Later on, you can go back to look at the others.
  7. When changes are made to fix the incident, you should get immediate feedback. Without an immediate response, customers often take one of two approaches: 1) They delay or stagger fixes until they can determine the effect of each one; 2) Or they make all changes at the same time, and are then left wondering which change fixed the problem.
  8. Lastly, ask for help sooner rather than later. We’ve heard of problems dragging on for months, vendors getting kicked out of accounts, and literally millions of dollars wasted on adding expensive hardware. Waiting days or weeks to find the root cause of a problem is unacceptable. Bring in a performance pro.

 

 

April 20th, 2012

Controlling Over-Provisioning of Your Storage Ports

Jim Bahn

While it’s generally accepted that SAN storage utilization is low, only a few industry luminaries, such as John Toigo, have talked about the severe underutilization of Fibre Channel (FC) SAN fabrics.  The challenge, of course, is that few IT shops have actually instrumented their SANs to enable accurate measurements of fabric utilization.  Instead, 100% of enterprise applications get the bandwidth that perhaps only 5% of the applications, wasting CAPEX need. 

In dealing with several dozen large organizations, we have found that nearly all FC storage networks are seriously over-provisioned, with average utilization rates well below 10%.  Here’s a VirtualWisdom dashboard widget (below) that shows the most heavily utilized storage ports on two storage arrays, taken from an F500 customer.  The figures refer to “% utilization.”

Beyond the obvious unnecessary expense, the reality is that with such low utilization rates, simply building in more SAN hardware to address performance and availability challenges does nothing more than add complexity and increase risk.  With VirtualWisdom, you can consolidate your ports, or avoid buying new ones, and track the net effect on your application latency to the millisecond.  The dashboard widgets below show the “before” and “after” latency figures that resulted from the configuration changes to this SAN, using VirtualWisdom.  They demonstrate a negligible effect.

Latency “before”

Latency “after”

Our most successful customers have tripled utilization and have been able to reduce future storage port purchases by 50% or more, saving $100 – $300K per new storage array.

For a more detailed discussion of SAN over-provisioning, click here, or check out this ten-minute video discussing this issue and over-tiering.

April 17th, 2012

Eager Attendees Ready to Learn During Hands-On-Lab Sessions at Spring SNW 2012

Ron Lee

 At the spring Storage Network World (SNW) show in Dallas, I had the pleasure of teaching the hands-on lab session for VirtualWisdom with Andrew Benrey, VI Solutions Consultant, and we had a fantastic response to our “Storage Implications for Server Virtualization” session. We co-presented with Avere and HP 3par, and during the two-hour session, we covered how to use VirtualWisdom to administer and optimize a fiber channel SAN, NAS optimization with the Avere appliance and the use of thin provisioning and reclamation using the HP 3par arrays.

The lab exercises covered all areas of SAN administration. The first exercise looked at how we discover and report physical layer errors. We then looked at queue depth performance, imbalanced paths, and detection of slow-draining devices using buffer-to-buffer credits. In the last exercise, we reviewed a VMware infrastructure showing the virtual machines, fiber channel fabric and SCSI performance.

I found it interesting that for most of the lab sessions, many students picked the VirtualWisdom lab to start with. I believe that with the demand for proactive SAN management, more and more people are finding out about the benefits of VirtualWisdom, and came to the hands-on-lab to see for themselves. When looking at the attendance numbers, our lab was sold out for most sessions. Our most popular session had a sign up list of 52 for 20 seats.  During the six sessions we conducted, we were able to meet and talk with almost 500 attendees in depth about the need for tools like VirtualWisdom and the advantages this platform offers for SAN teams working in a virtualized environment.  Attendees liked the ability to quickly walk through the infrastructure from the ESXi server down to the storage array and spot the anomalies. The ability to go back in time was also of importance. Several customers were in the lab as part of their product evaluation.

Those of you who have seen VirtualWisdom understand how rich our user interface can be. For the lab exercises, I specifically divided up exercises so that the lab attendees had a much simpler and more easily understood interface in which to work. This turned out well as very few of the attendees needed additional help in working with the Dashboard interface.

Storage Network World Hands-On Lab Infrastructure

April 13th, 2012

De-risking SAP Performance and Availability

Jim Bahn

It’s no secret that many enterprise mission critical IT implementations depend on SAP.  In 2008, the Standish Group estimated the average cost of an ERP downtime at $888K per hour. If you’re an SAP user, you probably have some idea of your cost of downtime.

What’s surprising to me is that often companies still rely on massive over-provisioning to handle the database growth and ensure that their infrastructure can meet the level of performance and availability required for informal or formal Service Level Agreements.  On one level, it’s understandable, because the stakes are so high.  But we’re starting to see a trend towards better instrumentation and monitoring, because, while the stakes are high, so are the costs.

The truth is, the performance of SAP is usually not bottlenecked by server-side issues, but rather by I/O issues.  Unfortunately, most of today’s monitoring solutions, including the best known APM solutions, have a tough time correlating your applications with your infrastructure.  The “link” between the application and the infrastructure is often inferred, or is so high level that deriving actual cause and effect is still a guessing game.

Many of our largest customers de-risk their SAP applications using VirtualWisdom to directly correlate the infrastructure latency to their application instances.  In this simple dashboard widget (below), an application owner tracks, in real time, the application latency, in milliseconds, caused by the SAN infrastructure.

With this level of tracking and correlation, many of the largest SAP and VirtualWisdom customers have successfully de-risked their growing, mission-critical SAP deployments.

To hear our Director of Solutions Consulting Alex D’Anna discuss this issue in more detail, I encourage you to attend his 35-minute On-Demand webcast.

April 11th, 2012

Spring 2012: Storage Networking World

Ron Lee

It was great to be at the Storage Networking World (SNW) show in Dallas last week. We saw more customers sending people from the operations and the architecture/planning groups. It’s important for operations and architecture/planning to work together on SAN infrastructure, so it was good to see this and to hear some of the attendee’s remark they were hired to bridge the gap between these groups.

In a panel of CIOs at medium to large companies, all agreed that staffing remains a huge issue.  No one is getting new headcount, yet the number of new technologies they have to work with continues to grow.  Some saw a solution in cross-training IT staff.  One CIO is creating “pods” where architects and planners work closely with operations.  Everyone agreed that even though the effect of training and cross-training staff often results in “poaching,” it was still worth it to have a better-trained staff.  At Virtual Instruments, we agree with this trend and see cross-domain expertise taking on a more of an important role. VirtualWisdom, for instance, is designed for use by everyone in the infrastructure, from the DBAs and server admins to the fabric and storage admins.

Stew Carless, Virtual Instruments Solutions Architect, held a well-attended session on, “Exploiting Storage Performance Metrics to Optimize Storage Management Processes.”  In the session, Stew talked about how using the right instrumentation can go a long way towards eliminating a lot of the guessing game that often accompanies provisioning decisions.

Over at the Hands-on-Lab, Andrew Benrey and I led the Virtual Instruments part of the “Storage Implications for Server Virtualization” session. We had a full house for most of the sessions and we were pleased that many of the lab attendees were familiar with Virtual Instruments before they participated in the lab.

In a real-time illustration of managing the unexpected: The big news at the show came from the U.S. weather service, when a series of tornados ripped through the Dallas area to the east and west of the hotel. The SNW staff and the hotel did an excellent job of gathering everyone on the expo floor and sharing updates on what was happening. After a two-hour interruption, the SNW staff did a great job of getting the conference back underway. The expo exhibitors enjoyed the two hours of a captive audience!

With a couple of exceptions, many of the big vendors weren’t at SNW, which we see as a positive trend.  People come to these events to learn about new things, and frankly, the newest things come from the newest, smallest vendors.  At SNW, the floor was full of smaller, newer vendors who may not have direct sales forces who can blanket continents, but whose fresh insights and new approaches provided valuable insights for the SAN community.  I didn’t hear one end user complain that their favorite big vendor wasn’t there.

The next Storage Network World show will be in Santa Clara this October. We are looking forward to meeting everyone again and to catch up on what’s going on.