When troubleshooting performance issues on a Netapp storage system, Perfstat is a very useful utility. There are other ways to get performance statistics, but they are not quite as detailed.
The perfstat file at first glance can be rather daunting. Perfstat files get very large, very fast, and may scare away the faint at heart. But, if you spend a little time looking them over, they start to make some sense.
For the purpose of this article, I am going to focus on three areas which will help you to pinpoint the performance problem you be experiencing on your SAN; these areas are Disk, CPU, and Network.
Collecting Perfstats
Step one in this process will be actually collect a perfstat of when the problem is occurring. You will need to download the perfstat executable from now.netapp.com to get started. If you need help finding it on the site, let me know and I will try to post a direct link. After you have downloaded the perfstat executable, you can start collecting your performance information by typing something like:
perfstat -f [hostname] -t 5 -i 1,6 -l [username]:[password] > C:\perfstat.out
For the command above to work, you will need to fill in the hostname, username, and password. I have encased each of them in brackets. The above example will take six 5 minute perfstats waiting one minute in between each one. You can, of course, change the numbers as desired; however, you want to try and keep the number following “-t” fairly low. I recently heard from support at NetApp that too long of an iteration can end up with Skewed results.
After you have obtained your perfstat you can open it with a text editor. I prefer to use something like TextPad, which can be downloaded at www.textpad.com, since it can open these perfstat files almost instantly.
The perfstat is broken into various sections such as:
- CPU Statistics
- Network Statistics
- Aggregate Statistics
- etc…
I feel the three I named above are the most basic things to watch for, so that again is what I will focus on. In order to find the CPU Statistics, simply do a search for CPU Statistics in your text editor. If you are using Notepad or Wordpad, press CTRL + F to perform a search. If you are using Textpad you can do a search by pressing F5.
CPU Statistics
After you perform your search you will be in the CPU Statistics section. If you scroll up a little bit you will see a line labeled Start TIme. That will give you the time and date in which this sample was taken. The time format for Netapp is in GMT. That means you will have to make the appropriate adjustments for your time zone. Example: for Pacific Time you would subtract 8 hours.
When determining if you are having CPU related issues, pay attention to the idle time line. If this is a high number, you can discount the CPU as the culprit to your slowness issues. The idle time is what percentage of the time the CPU is sitting there not doing anything. A value of 95 would indicate the CPU was running at only 5% utilization at the time the sample was taken.
Just because this one sample shows low CPU usage does not mean you are not having and CPU performance issues. Remember the command we issued to run the perfstat earlier was:
perfstat -f [hostname] -t 5 -i 1,6 -l [username]:[password] > C:\perfstat.out
The -i stands for iterations. Notice we have a 1,6. This means to pause for one minute, then repeat 6 times. This means if we look through our perfstat we will find 6 CPU Statistics sections. If we are suspecting CPU performance issues we will want to look at each of these before we rule out CPU performance.
Network Statistics
When you suspect your network may be the culprit, perform a search for “ Network Interface Statistics” within your perfstat file. In the Network Statistics section, you will see the following column headings:
“iface side bytes packets multicasts errors collisions pkt drops “
Each of the columns Explained:
iface – The network adapter
side – Is it sending or receiving data?
bytes – How many bytes/second are being sent or received
packets – How many Packets/second are being sent/received
errors – Are there any errors?
Collisions – Are there any collisions?
pkt drops – Are any packets being dropped?
A couple of things to look at right off the bat, Errors and Pkt drops are bad. If you are seeing anything other than zero, you may want to look into that.
In order to troubleshoot other network problems, you may need to know a little about the hardware and configuration you are dealing with. Example: a FAS 2020 only has two 1 Gbps network ports per controller. The theoretical capacity of a 1 Gbps network port is 134,217,728 Bytes/second. If you see anything close to this in the bytes column you are probably experiencing a network bottleneck, unless you know you have 10 Gbps network ports.
Under the iface column, if you see anything with “vif” in the name, that stands for Virtual Interface. These Virtual Interfaces can complicate things a little bit. a vif can be in failover mode, or load balanced mode. If the vif is in fail-over mode, and you have two 1 Gbps network ports, then the max throughput of the port is 1 Gbps or 134,217,728 bytes/second. If they are in load balanced mode, then the max throughput is 2 Gbps. You will need to keep this in mind when analyzing the perfstat.
Disk Bottlenecks
Disk contention is a very common cause for performance problems on a SAN. My guess is it is the most common bottleneck on a SAN. To get to the disk section of a perfstat, do a search for: “Disk Statistics”
When you are looking at the disk statistics, you will have the following column headings:
disk ut% xfers ureads–chain-usecs writes–chain-usecs cpreads-chain-usecs greads–chain-usecs gwrites-chain-usecs
Underneath the column headings, you will have another line. This line will show the name of the aggregate in question. This is important if you have more than one aggregate per controller. which most people will have.
The main columns we want to focus on are the ut% and xfers columns. The ut% column is the percentage of time the given disk is busy. The xfers column is how many transfers per second or IOPS that particular disk is experiencing.
If a single disk has a high reading, that is not necessarily a bad thing. You want to focus on the aggregate as a whole, and look at the approximate averages.
A SATA drive can support approximately 40 IOPS before it starts having performance degradation. Whereas a 15000 RPM fiber channel drive can support about 300 IOPS before it starts to experience degraded performance. If you don’t know what type of drives you have, just focus on the ut% column. This column is what percentage of the time the drive remains idle. As long as these numbers are under 50%, you are probably fine. Anything over 50% and you will probably see a performance hit.
Conclusion
Perfstat is a very useful tool, and I have only covered the basics of using it. Unfortunately, the perfstats can be a little difficult to read, but once you figure them out, they will make sense.
Perfstat Analysis Utility
In order to simplify the process of interpreting perfstat files, I have written a utility to break perfstat files down into more manageable chunks. To use the utility, simply copy the utility into the same directory as your perfstat file. Run the utility, enter the name of your controller, enter the filename of your perfstat, and you are done.
After the utility runs, you will have three extra files in your perfstat directory. One for CPU statistics, one for network statistics, and one for disk statistics. If there is anything else you would like this utility to do, let me know and I will see what I can do to expand on it. This utility only works with Netapp 7-Mode
There is a link to this utility at the bottom of the page