I was stung by this again today. Two virtual servers on the same hardware trying to communicate with each other. One running apache2 & php, and the other a MySQL database server. This happened just after a server reboot.
CPU, disk I/O and memory usage were all minimal on both servers and on the host, and yet the LAMP application was performing like a dog with one leg.
Fortunately I’ve seen this before, and (after banging my head on my keyboard a few times) I recognised the problem. Xen networking.
Basically, when the operating system sends a packet through the network, it computes a checksum of the data so the recipient can tell if it arrived intact. On a physical machine, the operating system offloads this operation to the network card as its chipset can handle it. However, in a virtual machine, there is no physical card to hand the checksumming off to - the network card is just an abstraction in software, and so this is terribly inefficient.
This problem only affects two virtual machines on the same hardware talking over the virtual network. [correction: Actually, it affects all network traffic, it’s just more noticeable between two adjacent VMs] So, not usually a problem, but when the application server needs to talk to it’s database server and they both happened to be on the same hardware it makes a huge performance difference.
To fix, use ethtool. First, check the settings (do this on each domU):
# ethtool -k eth0
Offload parameters for eth0:
Cannot get device rx csum settings: Operation not supported
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: off
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
TX checksumming is on. Turn it off:
ethtool -K eth0 tx off
And verify the result:
# ethtool -k eth0
Offload parameters for eth0:
Cannot get device rx csum settings: Operation not supported
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
I figured this out before, and fixed it, but when I rebooted the server the change was lost. I’ve now fixed it permanently in /etc/rc.local (you could also do this in a post-up network script, although rc.local will run after networking is up anyway).
I’ve already put this into the configuration management system (we’re using puppet - I’ll make that the topic of another post), but these are some old VMs that are not yet automated.
So - fixed. And the servers are humming along even better than before now after a hardware upgrade.