TCP MSS & IP MTU considerations when using DMVPN

Paranoid disclaimer

I make no claims of the accuracy of this post. This information was gathered by reading Cisco documentation and testing in a lab environment. These settings were eventually deployed to a production environment, these settings worked in the situation I was in. You are responsible in every way for confirming my information if you happen to use it in a real world scenario.

This being my first post, I don’t really have any CLI output to show. I’ll work on editing this at some point in the near future, to include some CLI output from a lab environment.

A while back I was working an issue where we were seeing high CPU usage on our DMVPN routers, and generally poor performance for the users at the WAN sites which were connected to the hub site via DMVPN.

After a bit of investigation I found additional issues.

  • The fragmentation counters on the tunnel interfaces were very high.
  • The anti-replay protection was dropping a lot of packets (or fragments of those packets) because it was seeing fragments arrive more than 1024 bytes out of order.
  • Stress testing the circuits (using IPERF with multiple threads) would literally crush the spoke router at the site I was testing at. (99% CPU usage, sometimes I would also get EIGRP to flap). The circuits were never saturated in my tests either; the router was just maxed out.

So I started working on figuring why we were seeing these issues. After some research and some tinkering  in the lab I found the following.

  1. The IP MTU we had configured on the tunnel interfaces on the hub and spoke routers was too high. (It was set to 1490).
  2. The TCP MSS value was set to high (it was set to the max it could be 1460).
  3. There was also a miss configuration in the ipsec transform set that made it so we were doing ESP and AH. Basically every Security Association was set up twice with both methods (this increased overhead)

Crypto

This was an easy issue to fix. I just needed to modify the ipsec transform set to only build an ESP SA instead of an ESP and AH SA.
This is the incorrect transform set.

crypto ipsec transform-set DMVPN ah-sha-hmac esp-aes 256 esp-sha-hmac

This is the corrected transform set

crypto ipsec transform-set dmvpn esp-aes 256 esp-sha-hmac
With that out of the way it was time to look at the next issue, the fragmentation.

Fragmentation

You should read this document from Cisco if you want to know the full details of what I’m going to try and summarize below.

http://www.cisco.com/c/en/us/support/docs/ip/generic-routing-encapsulation-gre/25885-pmtud-ipfrag.html

Basically what was happening is the TCP MSS was negotiating to the maximum allowed 1460, and the IP MTU on the tunnel interfaces was set at 1490. This was causing fragmentation for large packets.

To understand why, let’s look at the overhead involved with doing GRE, and encryption.

Bytes of overhead in our scenario

let’s look at this purely from a TCP perspective. So in addition to the list below there is going to be 20 bytes of overhead for TCP and another 20 bytes for the original IP header.

Additional overhead:

  • GRE: 28 bytes
    1. IP header used for GRE: 20 bytes
    2. GRE header: 4 bytes
    3. GRE tunnel key: 4 bytes
  • ESP: 38 to 53 bytes
    1. ESP header: 10 to 25 bytes
      1. SPI: 4 bytes
      2. Sequence number: 4 bytes
      3. Pad: up to 15 bytes
      4.  Pad length: 1 byte
      5. Next Header: 1 byte
    2. AES-256 IV: 16 bytes
    3. ESP-SHA: 12 bytes

Adding everything up we get 141 bytes of overhead on each segment of payload.

If the TCP MSS is set to default (1460) and the IP MTU on the tunnel interface is 1490 we run into issues because the tunnel interface (before encryption) will accept packets up to 1490 bytes in size.

However the tunnel needs to add the GRE header, in this case an additional 28 bytes. The original packet is already at 1500 bytes (1460 payload, 20 bytes TCP, 20 bytes IP). So the tunnel interface needs to fragment the original packet into two packets just to fit the GRE header on.

After that happens, it is sent to the crypto process, which will add 56 bytes (plus another 20 because of tunnel mode) of overhead to each fragment. The larger of the two fragments (from earlier) will once again, be over the IP MTU on the physical interface (1500 bytes). So the encrypted fragment is actually fragmented again. We now have three fragments for the original one.

The one fragment that was fragmented again after encryption will need to be buffered and reassembled before decryption on the other end. While the smaller of the two original fragments will be received and decrypted immediately.

Look at this picture from the Cisco documentation for a good explanation of everything that happens.
Also once again, I recommend reading this.
http://www.cisco.com/c/en/us/support/docs/ip/generic-routing-encapsulation-gre/25885-pmtud-ipfrag.html

fragmentation

Fragmentation  Fix

The fix was fairly strait forward. I just needed to figure out all of our overhead, and then adjust the TCP MSS and IP MTU of the tunnel interfaces to a low enough size, so after GRE and IPSEC overhead the final packet was still under or equal to 1500 bytes (the physical layer MTU).

  • Set the IP MTU of the tunnel interface down to 1410.
  • Set the TCP MSS down to 1360 (I know that 1410 minus 40 equals 1370. But I went with 1360 account for any extra TCP options in the TCP header).

With these settings the final packet won’t ever need to be fragmented. The expected size of the final packet should be around 1484 bytes. It would be possible to tune this a little more to get right up to 1500 or closer to it. But then there wouldn’t be any buffer room for additional TCP options etc….

  1. Original packet 1400 bytes
    1. Original payload: 1360
    2. TCP header: 20
    3. IP header: 20
  2. GRE header: 28 bytes
  3. IPSEC overhead: 56 bytes

After these changes were made on the hub and spoke routers that made up the DMVPN network, performance increased, CPU usage dropped and leveled out, and fragmentation counters were not incrementing very much. Here is some real data I gathered using IPERF showing the change in performance.

performance