ECMP and Load Sharing on IP Fragments

Here is how IP fragmentation works:

A Router receives a large IP packet that exceeds the MTU of its egress interface.
The router divides the packet into smaller fragments, each with a size that is equal to its interfaces MTU egress interface.
The router sets the “more fragments” flag in the IP header of all fragments except for the last fragment and sends it over the network.
When the fragments reach their destination, the receiving host reassembles the original packet by concatenating the data payloads of the fragments in the correct order.

Note: There is often confusion between IP fragmentation and TCP segmentation. It is important to understand the differences between the two:

TCP segmentation involves dividing data into smaller chunks for transmission over a TCP connection.
IP fragmentation occurs at the IP layer and involves breaking up large packets into smaller units that can be transmitted over a network.
IP fragmentation is typically performed by network devices such as routers and firewalls, while TCP segmentation is done by end user devices.

It is important to note that the TCP or UDP header is only present in the first fragment. The following fragments will contain the entire IP header, including the “fragment offset” field, which indicates the position of the fragment in the original packet. This information is used by the receiving host or router to reassemble the original packet.

There may be cases where, due to ECMP hashing, the first datagram containing a protocol header is routed to a different device (such as a server or NAT device) for load balancing purposes. This can prevent the reassembly of the fragments and the inability to perform NAT/Inspection…

Now, the interesting part : which path will a fragment follow in a load-balanced network?

We know that :

Only the first fragment contains the high layer headers (TCP and UDP ports for instance), which can cause problems with IPS, Firewalls, NAT and any statefull devices in the path, but also for reassembly
To avoid polarization (phenomenon of packets being consistently sent over one path in a network, while other paths remain underutilized), we configure the Hashing algorithm to take Layer 4 informations (TCP and UDP ports) into the account

Now, imagine we have multiple paths to the same destination, and there is some kind of IPS in the Path. If the IPS doesn’t see the First Fragment it may drop the packet thinking it is some attack

Fragment behaviour on ECMP path on Cisco devices :

In this first Lab, the CSR1000v router has Load sharing enabled using L4 information

Router(config)#ip cef load-sharing algorithm include-ports source destination

Router#show cef state | i per
include-ports source destination per-destination load sharing algorithm, id F6E776D8

ECMP appears to be working as expected based on the CEF viewpoint :

Router#sh ip cef 2.2.2.2
2.2.2.0/24
nexthop 10.1.1.1 GigabitEthernet1
nexthop 10.3.3.1 GigabitEthernet2
Router#sh ip cef 2.2.2.2 internal
2.2.2.0/24, epoch 2, flags [rnolbl, rlbls], RIB[B], refcnt 6, per-destination sharing
sources: RIB
feature space:
IPRM: 0x00018000
Broker: linked, distributed at 4th priority
ifnums:
GigabitEthernet1(7): 10.1.1.1
GigabitEthernet2(8): 10.3.3.1
path list 7FDD622B4BC8, 7 locks, per-destination, flags 0x269 [shble, rif, rcrsv, hwcn, bgp]
path 7FDD69B940D8, share 1/1, type recursive, for IPv4
recursive via 10.1.1.1[IPv4:Default], fib 7FDD0112C868, 1 terminal fib, v4:Default:10.1.1.1/32
path list 7FDD622B4DA8, 2 locks, per-destination, flags 0x49 [shble, rif, hwcn]
path 7FDD69B94418, share 1/1, type adjacency prefix, for IPv4
attached to GigabitEthernet1, IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
path 7FDD69B941A8, share 1/1, type recursive, for IPv4
recursive via 10.3.3.1[IPv4:Default], fib 7FDD0112C968, 1 terminal fib, v4:Default:10.3.3.1/32
path list 7FDD622B4E48, 2 locks, per-destination, flags 0x49 [shble, rif, hwcn]
path 7FDD69B944E8, share 1/1, type adjacency prefix, for IPv4
attached to GigabitEthernet2, IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
output chain:
loadinfo 80007FDCFDF08D68, per-session, 2 choices, flags 0003, 7 locks
flags [Per-session, for-rx-IPv4]
16 hash buckets
< 0 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
< 1 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
< 2 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
< 3 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
< 4 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
< 5 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
< 6 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
< 7 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
< 8 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
< 9 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
<10 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
<11 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
<12 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
<13 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
<14 > IP adj out of GigabitEthernet1, addr 10.1.1.1 7FDD6A026158
<15 > IP adj out of GigabitEthernet2, addr 10.3.3.1 7FDD6A026370
Subblocks:
None

In virtual environments, it can be difficult to emulate fragmentation as many virtual devices do not support it. One option is to generate IP fragmentation directly from the Linux testing machine using the “scapy” tool.

Using “scapy” from the Linux testing machine, we can send two packets that take different paths:

This packet is routed through Router1:

> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.2″)/UDP(sport=7234,dport=123))

This packet is routed through Router2. The only difference is that the destination port (dport) was changed to ‘999’ instead of the previous ‘123’.

> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.2″)/UDP(sport=7234,dport=999))

Now, let’s generate the same packet but this time with the “flags=0x1” flag added, indicating that the packet is a fragment and that more fragments will follow (MF set).

> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=999))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=9979))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=4498))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=4489))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=4429))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=4499))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=6799))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=6799))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=6799))
> > > send(IP(src=”1.1.1.1″,dst=”2.2.2.222″,flags=0x1)/UDP(sport=7234,dport=60001))

As a result of adding the “flags=0x1” flag, all traffic is now passing through Router1.

No matter what variations were tried, the Cisco algorithm always routed traffic through R1 when the More Fragment Flag was set using “flags=0x1,” indicating that the Cisco algorithm will always revert to using L3 functions in this case.

We can see that all packets are passing through the same router.

Now, let’s try generating the first, middle, and last fragments to ensure that Cisco’s behavior remains unchanged.

First Fragment
> > > send(IP(src=”1.1.1.1″, dst=”2.2.2.222″, flags=0x1, frag=0)/UDP(sport=7234, dport=999))

Middle Fragment
> > > send(IP(src=”1.1.1.1″, dst=”2.2.2.222″, flags=0x1, frag=1)/UDP(sport=7234, dport=999))

Last Fragment
> > > send(IP(src=”1.1.1.1″, dst=”2.2.2.222″, flags=0x0, frag=2)/UDP(sport=7234, dport=999))

It is interesting that Cisco has taken this into consideration, however, this type of detail is not publicly available and we were unable to find any references on this topic (if you are aware of any, please leave a comment).

You can download the pcap and the lab (eve-ng) here :

Note: To view fragments in a Wireshark capture, you need to uncheck the “reassemble packets” option.

Fragment behaviour on L2 and L3 LAG path on Cisco devices :

The same test has been conducted on L2 and L3 LAG ports on Cisco devices, and the same behavior has been observed.

In summary :

It appears that the Cisco algorithm takes the specific case of “first fragment” into consideration and uses L3 hashing computation to avoid related problems. It is not clear if other vendors, such as Arista, Juniper, or Aruba… have similar behavior.
- Please leave a comment on the post if you are aware of any behavior of other vendors.
Fragmentation should be avoided whenever possible, but it may be inevitable in some situations, particularly when using overlay technologies. It is important to carefully plan and test for this behavior and take it into account in the design process when possible.
This test was conducted on Cisco CSR1000v and XRv9K devices and the results were the same.

Mehdi SFAR (CCDE 2021:3, CCIE #51583)

All posts

Latest posts

Quick REVIEW

CONFIGURATION REVIEW

Deep-DIVE REVIEW

Interview a Candidate

Our Team

Join Us

Contact us

ECMP and Load Sharing on IP Fragments

Here is how IP fragmentation works:

Now, the interesting part : which path will a fragment follow in a load-balanced network?

Submit a Comment Cancel reply