Blogs Virtual Instruments Blogs
A True Story from Virtual Instruments’ Lab: You Need the Global View
In our lab here at Virtual Instruments, we run a good size VMware infrastructure, and of course, we use VirtualWisdom to monitor the performance of our lab systems.
Following our own best practices, when we first assembled our lab configuration, we recorded our performance and set alerts accordingly. We checked all our fiber channel links and they were free of physical errors. Overall, we were pretty satisfied, and for several months things ran just fine.
Then one day, we started getting alerts that our write exchange completion times were spiking in the 200-300ms range, from a baseline value of less than 20ms. Similarly, our read exchange completion times were jumping into the 100ms range, against a baseline of less than 10ms. We saw the peaks on the read and write exchange times trend higher as time went on, so we thought we were headed for an outage. We reviewed all our changes, logs, and any info we had. We couldn’t figure out which problems accounted for these slowdowns.
While all this was happening, we received no complaints from the system users — system analysts that review customer databases for issues. We knew that if something was wrong, we would get complaints. We had a silent and future deadly problem happening.
After we verified that our switches, cables and connections were fine, we approached our array vendor. They reviewed their logs on our storage ports and things looked fine. The “aha!” moment came when they started to review the overall array performance. Since VirtualWisdom records the time of each slowdown, it was very easy for the array vendor to look at what was happening. It turns out that our array has dual controllers — we use one controller and our engineering group uses the other. During the times of the slowdowns, the engineering group was running stress tests. The other controller was running at 80% of capacity and our controller was experiencing a large number of cache misses, which resulted in the slowdowns.
So, what can you learn from all of this? First is that when things are initially assembled or are running well, you must baseline your configuration. Unless you know what things are like when systems are running well, you have no idea of where to look. If we did not have a baseline of our configuration, we never would have noticed that the read and write exchange completion times were spiking. Second, by establishing a baseline and leveraging the VirtualWisdom platform, we were able to find and clear the problem before there was ever an outage or complaint. Yes, we don’t get credit for outage avoidance, but it is a lot less stressful for you. Our analysts are doing revenue-generating work, so if they go down, there is a lot of excitement. The last takeaway is that when something happens, it happens for a reason. While everything looked fine to us at the lab level, there were issues occurring one level above that affected us. So back to my comment about the global view. When you are having problems that don’t make sense like we were having in our lab config. Start looking around and see if you are overlooking the fact that you are part of a larger infrastructure.