Tuesday, March 26, 2013

Why Application Slow?

We've all encountered situations where an application is slow and the network gets blamed. I've been having some fun working with our Terry Slattery on consulting work to determine why a specific six applications are slow.  He's come up with some good insights into the applications at this particular site. And we've been talking about some of the reasons why applications might be slow. Yes, it might be the network. It also might be the application, particularly if the application writer or toolkit is oblivious to what it is doing in network terms. 

I started brainstorming to come up with a list of ideas for things that could make an application slow, breaking it out by whether the cause is an application or a network problem. Some of these are items Terry touched upon in his recent blogs. I was thinking about blogging about them individually or in small groups, then decided a check-list of things to consider might be useful.

Please add your own favorite application slowness causes as comments to this blog!

Application Causes of Slowness

  • Many round trips (times the round trip time -- you can't change the speed of light)
  • DB match row by row vs. page or streaming approach (a common cause of many round trips)
  • Reading many files on a CIFS or NFS drive (CIFS can be slow, and directory recursion is round-trip intensive)
  • Opening many TCP connections 
  • Pulling a lot of data across the network unnecessarily, e.g. fat client or server-based join rather than DB-based join or stored procedure 
  • Synchronous replication with latency and locking 
  • Making many Active Directory, LDAP or DNS calls (un-cached*) 
  • Overloaded / slow AD, LDAP, or DNS server
  • Broadcast/multicast Altiris image distribution: poorly planned groups can clobber your WAN
  • High traffic between different locations -- lack of location awareness or uncautious VMotion to lighten main datacenter load
  • Massive numbers of Unix scripting shell invocations
  • Server performance 
  • Lack of RAM or Microsoft handles on server
  • Resource locks, lock contention
  • Application that does Reverse DNS logging rather than IP logging, coupled with use of NAT (seeSolving ASA Slowness). 
In general, the thing that makes troubleshooting all this challenging is getting good information about the application. It helps to know where the chatty ("ping-pong") traffic occurs, where the massive data transfers occur (if any), or where lots of CIFS/NFS files get accessed. For that matter, it helps to know which other servers the main application talks to, and roughly why it is talking to each of them. One could hope that application documentation would cover that. I have yet to see it do so. 

Network Side of Things 

If you don't have comprehensive monitoring of all network devices, servers, and links, you're flying with your eyes closed. Pervasive monitoring and a pro-active stance are the best way to not have to go do heavy research when the network is blamed for application slowness. As an article I saw said, MTTR depends on MTTI, MTTB = 0. That is, Mean Time to Blame (the network) is 0 seconds, until you establish Innocence (MTTI) the real repair effort often doesn't start. The better shops I've worked with have both network and server people doing research in parallel, that way it speeds up problem resolution should the application be at fault. 
Here are some things to look out for on the network side
  • Sharing WAN or MAN links with Internet traffic and no QoS de-prioritizing the non-business Internet traffic
  • Bufferbloat (see Terry's blog Application Analysis Using TCP Retransmissions, Part 2)
  • Client side buffer tuning / lack of tuning causing poor TCP throughput
  • Link congestion (covert over-subscription / micro-bursts) 
  • Retransmissions = symptom of congestion, most visible on the server side 
  • Overruns / oversubscription of ASICs, backplane, device L2/L3 switching performance, etc. 
  • Server to network problems, e.g. duplex mismatch 
  • Poorly deployed / overloaded Packeteer
  • Inappropriate QoS / policing / shaping
  • MTU sized wrong and fragmentation
A couple of those need more explanation.
End system MTU, TCP buffers, and TCP parameters can be tuned. There is a lot of advice and mis-advice on the web about TCP buffer tuning. Increasing buffers on end systems can help. See also bufferbloat (above). For an introduction to the topic, see http://en.wikipedia.org/wiki/TCP_tuning. The national supercomputer sites used to have good information on this, and now sites dealing with massive high-speed and international data transfers for scientific data probably do have similar info. 

Retransmissions: it is hard to see these in the network. Windows will report retransmissions per second, which I consider fairly useless (what's normal, what's high -- can't tell unless you have stored history). But retrans / second divided by packets (segments) / second gives retrans / segments transmitted, which can easily be turned into a percentage, something you can more readily threshold across a number of servers without setting different thresholds for each one. If the TCP MIB is supported, it will tell you segments retransmitted and total segments sent, which are all you need. I like to look at retransmission since that tells me whether something bad is happening -- it covers the case where my normal reporting is missing something in terms of drops or other counters, e.g. internal drops due to crypto capacity being exceeded. 

Packeteer: I've had limited and unhappy experience with them. My impression is they can easily introduce an additional problem source that can be hard to troubleshoot, in part due to poor documentation about what the various types of QoS policies actually do under the hood. I once got told "we don't document that -- take the class or bring in a consultant". Not a good reply -- might as well say "our documentation lacks detail".  I don't want to see each flow, that's playing whack-a-mole. I've seen Packeteers easily overwhelmed by auto-discovery of too many flows. Then there's having to fork-lift upgrade them when you upgrade the WAN link speed. I've also seen a Packeteer with network-based crypto, where we had to figure out how to account for crypto headers, and also needed to turn on LLQ in the Cisco router to complement the Packeteer VoIP QoS. I ended up feeling it would have been a whole lot simpler to just do the QoS on the routers. 

QoS: my favorite source of Cisco IOS QoS confusion is the different units for the priority and bandwidth commands (Kbps), shaping and policing (bps). Also the burst parameter for shape is in bits, for police it is in bytes. The Nexus allows you to specify the units, reducing by one the number of easy ways to get it wrong. If you shape thinking the units are Kbps, you are allowing 1/1000 as much traffic as you intend to. That'll really slow things down! 

Real-World Examples

Some real-world "war stories" might amuse.
I still recall from a while ago the site that had to do a quick datacenter core and access switch upgrade after the application folks upgraded Lotus Notes and performance was terrible. I heard that 2-3 weeks later someone realized the new version did a lot more LDAP calls and was killing the LDAP server, also impacting other applications. My conclusion: instrument the heck out of your LDAP, AD, DNS servers and the links to them, they have datacenter-wide impact.
There's also the government agency that was doing nightly data rollups to HQ, and they were taking 23 hours and a climbing number of minutes to complete. Obviously that wasn't going to work much longer. It turned out the problem was the backup was a Unix shell script with nested shell loops, that did GZIP and FTP transfer one file at a time. I heard that just doing TAR on the lowest level directory, GZIP on that, and FTP of the compressed file got the data rollup down to something like 1 hour and change, by just reducing the number of individual shell invocations. The person consulting on this noted that there were other optimizations possible ('expr' command sub-shell invocations, etc.), moved some constants out of loops, wrote a small C program to do something in the middle of the remaining nested loops more efficiently -- and got the task down to 7 minutes, as I recall it. Moral to the story: if the application is slow and does something over and over, that repeated operation is where you need to be most efficient.

No comments:

YouTube Channel