This global company is a major provider of financial services and investment resources that helps individuals and institutions meet their financial objectives. In addition to 300+ mutual funds, this company also offers discount brokerage services, retirement services, estate planning, wealth management, securities execution and clearance, life insurance, and is continually expanding services to meet customer needs.
The company had built out a large set of servers, Fibre Channel SANs, and Tier 1 storage in order to setup a flexible environment for implementing virtualized physical servers. By using VMware ESX, time to market for expanding business applications would be faster and lower-cost than the traditional model of one application per physical server.
As the number of virtualized servers increased, the IT infrastructure and connected Tier 1 storage exhibited serious problems to the point of full production outages requiring reboots of storage and/or servers. So much so that non-production test and development servers had to be shut down to prevent congestion and retries every Sunday evening or periodically when problems appeared. Eventually, all production applications were moved to other environments to mitigate the business impact of the SAN problems.
For a period of eight months, the application, VMware server and storage IT teams, along with personnel from their storage and server vendors worked through a litany of problems all impacting system-wide IT infrastructure performance.
The financial services company found that their point solution monitoring tools for VMware, Fibre Channel switches, and storage arrays could not provide the information needed to resolve a number of problems. These tools could only provide a limited view of the environment and none provided both system-wide, comprehensive visibility as well as definitive insight into the Fibre Channel communications network. The company’s IT personnel realized that they needed to see the ‘end-to-end’ I/O between the servers and the storage arrays to understand the impact that adding VMware virtual machines was having throughout the IT infrastructure, and on the SAN specifically. The company identified Virtual Instruments’ VirtualWisdom as a possible solution to its IT infrastructure and SAN performance monitoring problems. Because the problems centered on VMware ESX servers, the Virtual Instruments ProbeVM was essential as it provided I/O information for each virtual machine and ESX server. I/O utilization data from the virtual machines and ESX servers could be correlated with I/O information from the SAN into a single view to optimize performance and accelerate problem resolution.
The VirtualWisdom Infrastructure Performance Management Platform was installed and monitored by Virtual Instruments Professional Services. Monitoring of the VMware ESX virtual machines and servers together with the SAN revealed a number of issues that were affecting performance and availability
VirtualWisdom showed that part of the latency issue was caused by incorrect Queue Depth settings. During configuration, the queue depth settings were set too high on the storage array Fibre Channel ports. This resulted in increasing latencies for various applications. Virtual Instruments Professional Services worked with the company’s storage administration personnel to develop ‘best practices’ around queue depth settings on the in-house storage.
VirtualWisdom showed that this change did indeed improve latency in the SAN environment without impacting the overall latency as seen by the servers. Nevertheless, it showed that the Queue Depth settings by themselves did not resolve the problem for all applications in the environment or impact the reservation conflict storms seen when VMotion was applied to a SAN under load.
VirtualWisdom demonstrated that the server demand was not evenly distributed across the storage array controller ports. On an array with 32 ports there were two ports that were at or near the upper limit of utilization. A handful of ports had moderate traffic while the remainder had little or no traffic load to speak of (less than 3% on average). With proper layout it was assessed that the same performance could easily be achieved by having only half as many storage ports.
The customer plans to use VirtualWisdom going forward to layout the load evenly across the storage to maximize utilization, optimize performance and avoid unnecessary purchases of additional IT infrastructure components.
One of the working theories by one of the storage vendors was that incompatibility between the switches was causing traffic to be dropped and was the source of the poor performance. VirtualWisdom was able to prove conclusively that no frames were being dropped. The physical layer was healthy and error free and all communication was completing successfully even with performance issues. This enabled the customer to focus on the real root cause of the performance issues and not rely on the theories put forth by competing storage vendors.
VirtualWisdom demonstrated that during higher load times such as virus scans and backups, VMotion and DRS can cause reservation storms that lead to unacceptable levels of latency and failures. More appropriate Queue Depth settings were used to mitigate the impact of these events. Care needs to be used so that these settings don’t affect overall performance when VMotion is not actively moving a VM. VirtualWisdom helped the user select optimal settings that yielded good performance while minimizing the impact of VMotion. Even with proper settings, monitoring with alerts is required to ensure that when multiple clusters access the same storage they don’t cause outages while simply trying to optimize server memory.
The VirtualWisdom Infrastructure Performance Management Platform, along with VI’s Professional Services, was able to help this global financial services company pinpoint the sources of performance issues. This enabled fast mean-time to resolution, which helped drive optimal use of VMware ESC virtual machines and servers into full production. This also had the benefit of increasing server consolidation ratios. Moving forward the company’s application, storage, and VMware IT groups now have the confidence that they can meet new business requirements in a timely and cost-effective fashion.
For more information, visit our IPM Performance page, or learn how VirtualWisdom can improve the performance of your VMware environment: