Over the years when a performance problem comes up there is always some speculation that it’s a network issue and not the database (can’t be us!). I always ask a few quick questions to see if I can see a reason to pursue the network angle:
- Is the problem affecting multiple database servers?
- Is the problem affecting multiple databases and/or multiple applications?
- Is the problem experienced in a particular geographic location?
I won’t say it’s never the network, but usually when it’s the network everything is slow or down. Check, ask them to check, but assume it’s a database or application issue is my rule of thumb.
But.
I worked with a client where everything was running fine. The server in question had the IP changed to meet some security requirements, it came back up fine and all seemed ok, except that jobs were taking 3x to 5x as long. Nothing changed but the IP. How could that cause a problem? Seems like network doesn’t it?
Network team swore it wasn’t them. No way could changing an IP affect performance. A more likely culprit given the reason for the change was the firewall. Firewall team swore it was not them. Database team goes back to look again, sees nothing wrong on the server. Changes IP back to old IP, performance is fine.
So what do you do now? No one sees a cause, but clearly something is wrong.
They flipped the IP back again, performance drops immediately. I still thought the firewall had to be the problem. I’m not a firewall guy so I’m pushing for details, what kind of rules are running, which rules are getting hit, etc, looking for clues. Finally with some arm twisting I have the firewall taken offline, removing it from the equation. Performance still bad. Firewall team mad. And yes, the database team was still sad.
Now we go back to the network team. We’ve proved that it’s running fine with old IP, miserable with the new IP. After agreeing to look again, reluctantly, because it makes no sense, they find the problem. The new segment had packet inspection enabled, the old segment did not. The high amount of data being transferred was maxing out the switch and that was the bottleneck. Turned it off, presto, all was well again.
So for once it was the network. I’ll probably never see that root cause again, but now I know to ask about it, just in case.