By Ryan Perkowski, Senior Product Manager
We at VI are often asked … what are common NAS problems and how do you find them? I know that users are frustrated, simply because I used to be one. They’ve have been forced to live with the pain of getting escalated to, and not finding root causes. They’ve tried reaching out for help, asking their vendors, asking their peers, asking the great Google admin in the cloud, but they can’t seem to improve their stability. So now they are looking at third parties for help. I wanted to share a story of one I ran across recently, a rogue client, which one of our customers in the Midwestern part of the US was wrestling with.
Our story starts with a user complaint. * The help desk started seeing a pattern to the phone calls, and opened a trouble ticket to the VM administrator. The VM admin saw that a datastore was experiencing slow response times, so the storage admin was called on the carpet. The storage admin fired up his favorite array tool and it confirmed that indeed, the NAS filer was performing poorly. He could see that his filer was running low on front-end resources, but didn’t know why.
Our intrepid storage admin was running VirtualWisdom, and had recently installed ProbeNAS. This allowed him to see all the client conversations that were connected to each port on his filer, and he could easily see that one of the clients stuck out like a sore thumb. It turns out, one client was swamping the NAS filer with metadata requests. Look at the graph below, can you see the client that is not like the others?
The graph shows RPC calls, broken down by clients, in IOPS, scrubbed of the legend to remove customer IP addresses. Without going in to deep into the weeds, this client was attempting to do file level based replication. The replication engine was looking at each file’s modified date, and assessing whether it needed to be replicated or not. This approach worked fine when the filesystem only had 500k files in it. Over time though, that filesystem grew, and now holds 3 million files in it. Every time the scan was kicked off (which was originally set to once a minute), the client was asking the array for modified times of every file. Though VERY little data was being transmitted (VERY low throughput), the array was running out of front end resources to handle all the requests.
The solution to the issue wasn’t adding more storage, or upgrading the filers, it was simply reducing the frequency of these scans. Once the change was completed, the complaints disappeared, and I had a very happy storage admin on my hands.