• SAN Best Practices
April 17th, 2012

Eager Attendees Ready to Learn During Hands-On-Lab Sessions at Spring SNW 2012

Ron Lee

 At the spring Storage Network World (SNW) show in Dallas, I had the pleasure of teaching the hands-on lab session for VirtualWisdom with Andrew Benrey, VI Solutions Consultant, and we had a fantastic response to our “Storage Implications for Server Virtualization” session. We co-presented with Avere and HP 3par, and during the two-hour session, we covered how to use VirtualWisdom to administer and optimize a fiber channel SAN, NAS optimization with the Avere appliance and the use of thin provisioning and reclamation using the HP 3par arrays.

The lab exercises covered all areas of SAN administration. The first exercise looked at how we discover and report physical layer errors. We then looked at queue depth performance, imbalanced paths, and detection of slow-draining devices using buffer-to-buffer credits. In the last exercise, we reviewed a VMware infrastructure showing the virtual machines, fiber channel fabric and SCSI performance.

I found it interesting that for most of the lab sessions, many students picked the VirtualWisdom lab to start with. I believe that with the demand for proactive SAN management, more and more people are finding out about the benefits of VirtualWisdom, and came to the hands-on-lab to see for themselves. When looking at the attendance numbers, our lab was sold out for most sessions. Our most popular session had a sign up list of 52 for 20 seats.  During the six sessions we conducted, we were able to meet and talk with almost 500 attendees in depth about the need for tools like VirtualWisdom and the advantages this platform offers for SAN teams working in a virtualized environment.  Attendees liked the ability to quickly walk through the infrastructure from the ESXi server down to the storage array and spot the anomalies. The ability to go back in time was also of importance. Several customers were in the lab as part of their product evaluation.

Those of you who have seen VirtualWisdom understand how rich our user interface can be. For the lab exercises, I specifically divided up exercises so that the lab attendees had a much simpler and more easily understood interface in which to work. This turned out well as very few of the attendees needed additional help in working with the Dashboard interface.

Storage Network World Hands-On Lab Infrastructure

April 11th, 2012

Spring 2012: Storage Networking World

Ron Lee

It was great to be at the Storage Networking World (SNW) show in Dallas last week. We saw more customers sending people from the operations and the architecture/planning groups. It’s important for operations and architecture/planning to work together on SAN infrastructure, so it was good to see this and to hear some of the attendee’s remark they were hired to bridge the gap between these groups.

In a panel of CIOs at medium to large companies, all agreed that staffing remains a huge issue.  No one is getting new headcount, yet the number of new technologies they have to work with continues to grow.  Some saw a solution in cross-training IT staff.  One CIO is creating “pods” where architects and planners work closely with operations.  Everyone agreed that even though the effect of training and cross-training staff often results in “poaching,” it was still worth it to have a better-trained staff.  At Virtual Instruments, we agree with this trend and see cross-domain expertise taking on a more of an important role. VirtualWisdom, for instance, is designed for use by everyone in the infrastructure, from the DBAs and server admins to the fabric and storage admins.

Stew Carless, Virtual Instruments Solutions Architect, held a well-attended session on, “Exploiting Storage Performance Metrics to Optimize Storage Management Processes.”  In the session, Stew talked about how using the right instrumentation can go a long way towards eliminating a lot of the guessing game that often accompanies provisioning decisions.

Over at the Hands-on-Lab, Andrew Benrey and I led the Virtual Instruments part of the “Storage Implications for Server Virtualization” session. We had a full house for most of the sessions and we were pleased that many of the lab attendees were familiar with Virtual Instruments before they participated in the lab.

In a real-time illustration of managing the unexpected: The big news at the show came from the U.S. weather service, when a series of tornados ripped through the Dallas area to the east and west of the hotel. The SNW staff and the hotel did an excellent job of gathering everyone on the expo floor and sharing updates on what was happening. After a two-hour interruption, the SNW staff did a great job of getting the conference back underway. The expo exhibitors enjoyed the two hours of a captive audience!

With a couple of exceptions, many of the big vendors weren’t at SNW, which we see as a positive trend.  People come to these events to learn about new things, and frankly, the newest things come from the newest, smallest vendors.  At SNW, the floor was full of smaller, newer vendors who may not have direct sales forces who can blanket continents, but whose fresh insights and new approaches provided valuable insights for the SAN community.  I didn’t hear one end user complain that their favorite big vendor wasn’t there.

The next Storage Network World show will be in Santa Clara this October. We are looking forward to meeting everyone again and to catch up on what’s going on.

 

 

March 29th, 2012

A True Story from Virtual Instruments’ Lab: You Need the Global View

Ron Lee

In our lab here at Virtual Instruments, we run a good size VMware infrastructure, and of course, we use VirtualWisdom to monitor the performance of our lab systems.

Following our own best practices, when we first assembled our lab configuration, we recorded our performance and set alerts accordingly. We checked all our fiber channel links and they were free of physical errors. Overall, we were pretty satisfied, and for several months things ran just fine.

Then one day, we started getting alerts that our write exchange completion times were spiking in the 200-300ms range, from a baseline value of less than 20ms. Similarly, our read exchange completion times were jumping into the 100ms range, against a baseline of less than 10ms. We saw the peaks on the read and write exchange times trend higher as time went on, so we thought we were headed for an outage. We reviewed all our changes, logs, and any info we had. We couldn’t figure out which problems accounted for these slowdowns.

While all this was happening, we received no complaints from the system users — system analysts that review customer databases for issues. We knew that if something was wrong, we would get complaints. We had a silent and future deadly problem happening.

After we verified that our switches, cables and connections were fine, we approached our array vendor. They reviewed their logs on our storage ports and things looked fine. The “aha!” moment came when they started to review the overall array performance. Since VirtualWisdom records the time of each slowdown, it was very easy for the array vendor to look at what was happening. It turns out that our array has dual controllers — we use one controller and our engineering group uses the other. During the times of the slowdowns, the engineering group was running stress tests. The other controller was running at 80% of capacity and our controller was experiencing a large number of cache misses, which resulted in the slowdowns.

So, what can you learn from all of this? First is that when things are initially assembled or are running well, you must baseline your configuration. Unless you know what things are like when systems are running well, you have no idea of where to look. If we did not have a baseline of our configuration, we never would have noticed that the read and write exchange completion times were spiking. Second, by establishing a baseline and leveraging the VirtualWisdom platform, we were able to find and clear the problem before there was ever an outage or complaint. Yes, we don’t get credit for outage avoidance, but it is a lot less stressful for you. Our analysts are doing revenue-generating work, so if they go down, there is a lot of excitement. The last takeaway is that when something happens, it happens for a reason. While everything looked fine to us at the lab level, there were issues occurring one level above that affected us. So back to my comment about the global view. When you are having problems that don’t make sense like we were having in our lab config. Start looking around and see if you are overlooking the fact that you are part of a larger infrastructure.

 

March 9th, 2012

Physical Layer Errors in Your Fiber Channel SAN

Ron Lee

Like the termite commercials…what you can’t see can be hurting you. Your SAN looks fine, no open trouble tickets, and no complaints. The interesting thing about physical layer errors is that they may not cause a link failure but could cause degradation in link performance. If you have a good size SAN, it is very hard to spot these degradations. Also, depending on where the degradation is occurring, problems can show up in one path, in multiple paths or be random as traffic shifts from fabric to fabric. These small slowdowns start adding up and are the cause of lots of aggravation for datacenter managers. We have seen scenarios where one bad cable, one failing SFP or one unconnected port can severely affect the performance of an application.

So where do physical errors come from? VirtualWisdom recognizes four types of physical layer errors:

  • CRC
  • Code Violations
  • Loss of Signal
  • Loss of Sync

The common causes for these problems can be one or more of these problems:

  • Bad or damages cables
  • Dirty optics
  • Bad SFP

These kinds of problems can happen to brand-new, fresh out-of-the-bag parts. In an upcoming post, I will talk about contamination and dirt, but you have to remember that none of these components are built in a clean room and contamination is not a big design consideration on the part of the vendors.

Without a tool like VirtualWisdom, you will have to inspect the user interface on all of your switches. There are counters that show the count for physical errors. You will need to collect this data and keep track of how they are trending for each port. Not a lot of fun and really not practical in today’s datacenters.

VirtualWisdom keeps track of these events and records them over time, so we can create a plot like this:

VirtualWisdom Dashboard

Also, since we are collecting the data, we can set alerts to notify you and to trigger actions. If you get a run of physical errors, we can start a recording so you can see what is going on when they happen. This would be above and beyond the one-minute, five-minute and one-hour summaries we are recording already.

Cleaning up all your physical errors is the first step in improving your SAN reliability and performance. We all understand all too well how a datacenter can grow organically rack by rack and accumulate issues. The same goes for new datacenters. It is good practice to clear out all your physical errors during bring up and before you are under configuration management. Stay tuned for my next post, where I will take a closer look at the tools available for keeping things clean and reducing physical layer errors. These tools are must haves for your datacenter maintenance kit.

January 27th, 2012

Finding a slow draining device

Ron Lee

In trouble shooting SAN problems, we hear the term “slow-draining device’ a lot as it can cover a lot of issues. I would like to see how we can find a slow-draining device when the underlying problem is the loss of Buffer-to-Buffer Credit. We will start with a switch-level view of the data then look at things in more detail using our VirtualWisdom platform.

Our scenario starts out with a trouble report of frequent LinkReset events on a link, as shown in the VirtualWisdom Dashboard here:

Since VirtualWisdom records the events in your infrastructure, it is very easy for you to pull up the data at the time of the trouble report. You see the series of small green diamonds indicating LinkReset events. Also, with every LinkReset event, the system also detected Class3 Discards. This indicated that the switch could not forward frames to the intended destination and had dropped those frames without notification to the sender or receiver. This situation will cause extra load to be placed on the fabric as the device attempts to recover from the missing frame or worse, data corruption can occur due to missing data. At this point, we are looking at data at the one-hour summary level. If we need to see more detail, we can switch to a five-minute summary, which looks like:

The next thing to look at in this scenario is the VirtualWisdom ProbeFC8 Event Trend dashboard, which shows:

This dashboard is most interesting in that it shows there are not physical fiber layer problems. There are no CRC errors, Loss of Signal, Code Violations, etc. So we can quickly rule out any physical fiber problems.

Still looking at the five-minute summary level, we then decide to look at the Buffer-to-Buffer Credit parameter to see if there is any problem. So far, we have not seen any activity that correlates with our LinkReset problem. We open that dashboard and see:

This display is typical of many SAN monitoring and performance tools. It shows the data with a five-minute sample. At the bottom, we see our LinkReset, but there is no pattern or correlation with the LinkResets. However, since we are using the hardware-based VirtualWisdom SAN Performance Probe, we can go down to a one-minute summary. When we look at the same time frame using one-minute summaries, we see:

Here the correlation stands out very clearly. At the top, we see the value of the Buffer-to-Buffer credit parameter. As it decays low enough, the system responds with a LinkReset. We see this pattern happening over and over again. Now that we have this information, we can report the problem to operations and it can be fixed.

So let’s review what I just went through. We started with a report of a series of LinkReset events. We were able to easily find the problem in the data recorded by VirtualWisdom. We were able to quickly rule out any physical fiber problems. A look at the data at the typical five-minute summary still did not show any problems. However, when we looked at data at a one-minute summary, we saw the correlation between Buffer-to-Buffer credit loss and the LinkReset. This allowed us to report the specific problem to operations for resolution.