VirtualWisdom and the Queue solver analytic

By Ravi Prakash, Product Manager

alcoholic drink at bar

Imagine that it is Friday evening and a crowd is gathering outside a bar in Manhattan.  To ensure that all patrons feel comfortable the bar would like to keep the ratio of guests at 50% female to 50% male.  When any guests leave the bar, the bouncer must ensure that from the queue milling outside the right proportion of women to men is admitted. The bouncer performs this task while also ensuring that the city fire code limit of 100 occupants in the bar is not being violated.

crowd drinking at bar

Using this analogy when you have an application conversation from a host to the storage array, the application running on an operating system in the server equipped with a Fibre channel Host Bus Adapter (HBA) communicates via the SAN fabric (comprising Brocade or Cisco switches) to the LUN in the block storage array.  The HBA is one of many bouncers in the storage area network and must ensure that the number of requests going from the server to the storage array should not overwhelm the storage array.  The human bouncer uses his reasoning ability to determine how many guests can queue outside the door with a good probability of entering the bar that evening.  In the same manner the HBA relies on a setting called Queue depth to decide how many outstanding (unanswered) SCSI commands or I/O requests can be in the pipeline for every LUN exposed by the storage array.

You may wonder – Why can’t I just set an optimal value for queue depth in the application server’s operating system and be done with it?  Consider an example that illustrates why setting queue depth manually is a big challenge in itself!

lun storage array

In the example shown above server1 running an application has 2 Fibre Channel HBA and there are 4 paths from the server1 to LUN 0 via 4 ports on the storage array.  If each conversation from an application is defined as an Initiator Target LUN (ITL) then you have 4 ITLs and the operating system (Windows, Linux etc.) on your application server might recommend a default queue depth of 32 per ITL.   For 4 ITLs from server 1 to LUN 0 you now have 128 outstanding I/Os.  This default recommendation for queue depth doesn’t factor in the following variables:

  • What if instead of 2 HBAs you had 4 HBAs per server or instead of just LUN 0 you had access to 100 LUNs on the storage array?
  • What if instead of spinning disk in your storage array you had SSD based all-flash arrays which can process I/O requests in microseconds?
  • What if your application server is connected to more than one vendor’s networked storage? DellEMC may recommend a queue depth of 256 for running SQL Server on XtremIO storage.  What if you have spinning disk-based storage from another vendor who recommends a different value?
  • What if instead of one application server you had 100 application servers?

Queue depth in the application server OS doesn’t take these factors into account! Adding more HBAs or more LUNs to the discussion creates a multiplicative effect on the number of outstanding I/Os.  You cannot increase the number of outstanding I/Os indefinitely as the storage port on your array may have a limit of 2048 outstanding I/Os. When that limit is reached the array sends a Queue full message in a SCSI packet to the server.  This causes your application server to wait a while and retry sending data to the array to avoid receiving a queue full message in future.  This process causes your application user to experience increased latency (response times) and your business begins to be impacted!

Our infrastructure monitoring platform VirtualWisdom has an analytic called Queue Solver that takes away the guesswork and determines the optimal value for your environment, so the storage array doesn’t waste precious resources, nor will the application users perceive any latency.

virtualwisdom analytics dashboard

Consider an example where a customer noticed an unacceptable response time of 15 msec for the application.  The OS on the server had a default queue depth of 32 as shown in the screenshot below.  If the SLA for the application is to get the latency level below 5 msec then Queue Solver completes its analysis and recommends that queue depth be set to 22 and 24.  This may sound counter-intuitive to you!  We normally assume that more items in the queue equates to higher efficiency. However, Queue Solver knows better as it has factored all the variables, so you don’t have to.  When you accept the recommendation and manually make the change to a queue depth of 24 you’ll notice that you have reduced latency to below 5 msec.  Queue solver has taken away the need for your operations team to worry about number of servers, number of HBAs per server, number of storage ports, type of storage array (disk or all-flash) number of LUNs being used by an application server and gives you a recommendation on optimal queue depth based on its own thorough analysis.

virtual wisdom queue solver screenshot

Queue solver is just one of many analytics we’ve built into VirtualWisdom, so you can get to root-case of complex infrastructure issues and have a path to remediation.  We recognize that your operations teams are stretched and don’t have the luxury of becoming experts in the internals of SAN protocols, so we do most of the heavy lifting in our monitoring platform.  In parting, I’d like to mention that the analogy of the bar bouncer to explain queue solver came from one of our data scientists.  Want to learn more about how VirtualWisdom can help identify infrastructure impact on your application SLA?  Give us a call!