A while back Kendal Van Dyke (b | l | t) asked me a great question regarding the ideal VMware vSphere networking configuration for a SQL Server 2012 AlwaysOn Availability Group configuration. That’s a great topic for a blog post, so let’s go!
Normally, a fairly stock setup can work without any major issues, but in systems with heavier activity, these tips can prevent or fix potential networking issues that can lead to system instability.
This section deals primarily specifically with VMware’s vSphere 5.1 hypervisor, but the overwhelming majority of these items apply not only to other vSphere versions, but to other hypervisors as well.
First, ensure that every physical network adapter that you are using is configured in a redundant manner. This should go without saying, but it is pretty amazing how often I have to correct this for others. If you are short on physical adapters, consider creating more port groups that correspond to new VLANs so you can keep the number of physical NICs low.
Next, ensure that your virtual network port groups are set to automatically fail over AND fail back. Now, you might have reasons for not doing this in your environment, but I generally use this as a rule unless a compelling reason against this recommendation is presented.
Now, at the heart of the AlwaysOn Availability Group is a Windows Server Failover Cluster (WSFC). AlwaysOn Availability Groups use this for the quorum and the virtual IP address manager. Therefore, the network architecture is critical to get right up front. I usually configure a heartbeat network interface with a second NIC for all of the servers participating in this environment. It is not required, but could prevent background noise from potentially interrupting traffic.
I want a dedicated VLAN (or last close to it as the environment allows) so that it can further isolate the traffic that is so critical for the health and functionality of the cluster communication.
I always validate the performance of all network paths. I use a precompiled version of iperf for Windows (available to download here) to run a series of client-server network tests to validate that the network is performing as expected.
If 10GbE networking is present, consider using jumbo frames (MTU 9000) to increase the raw throughput performance. This is not a blanket recommendation. I have found environments that benefited by enabling it, but have found others that did not. Talk with your network administrators about how difficult for them it is to configure jumbo frames, and test end-to-end jumbo frame changes to see how your environment responds.
For VMware-based Windows servers, all network adapters should be set to use the VMXNET3 network driver. Period. It outperforms the other types, and I consider this a blanket recommendation for all Windows VMs on VMware.
Ask your VMware administrator if multi-NIC vMotion is configured for the hosts where the SQL Server virtual machines reside. This feature allows for the fastest possible vMotions, which mean quicker migration times for the VMware administrators. More details on how to enable this feature can be found at the VMware blogs.
Also, if you find that the SQL Server performance degrades for a short time during a vMotion operation, it could be suffering from a VMware feature called “Stun During Page Send (SDPS), a feature that is designed to help reduce the amount of memory change so that vMotion can complete by slowing the VM momentarily before a vMotion completes. If you find that the cluster fails over unexpectedly during vMotion cutovers, you might need to disable this feature. You can remove this feature by following this VMware KB link, but make sure to test first before enabling it on production servers.
You might also need to increase thresholds within WSFC itself to allow for the latencies during various vSphere operations if your environment is taking too long. The two values are CrossSubnetDelay and CrossSubnetThreshold. More details on changing these two values can be found here.
Finally, always monitor your virtual and physical NIC bandwidth consumption. Bad things can happen if these are overloaded. Even though a virtual NIC might be relatively lightly utilized, the physical NIC might be overloaded and no error log within Windows will show you this state. This is especially true of blade chassis, where the network adapters presented to the host can be shared with other blades in the same chassis. Check PerfMon to see if counters exist to monitor these levels, and if not, get reports from your infrastructure administrators on the ongoing state of these uplinks.
I have a few tweaks that I have in reserve in case I need them. Network IO Control (NetIOC) can be used to help prioritize this VM’s traffic over other VMs (reference documentation, page 34). Another trick is to utilize SplitRX mode (reference documentation, pages 35-36). Another is to disable interrupt coalescing (reference documentation, page 36).
Now, these tweaks are almost always not necessary, but if your traffic is so substantial that additional tuning is required, these tips could help your environment remain stable under great load.
Virtualizing your business-critical workloads is a lot of fun! Hopefully these tips can help you on your journey to complete datacenter virtualization. Thanks Kendal!