Issue observed from received fragments #24

angus19 · 2021-04-14T10:42:00Z

Test shows nat46 goes wrong with upstream v4 fragments or downstream v6 fragments.

Upstream v4 fragments : address translation (v4->v6) is okay for the first fragment but not okay for the rests.

Downstream v6 fragments : L4 pseudoheader checksum (v6->v4) is not correctly recalculated in the first fragment so that NAT44/DNAT does not take effect even if fragments get reassembled.

Thanks for your time.

ayourtch · 2021-04-14T12:20:35Z

I vaguely remember (it's been 7 years ago) not paying too much attention to fragments for two reasons:

back then IPv6 extension headers of any kind had a pretty miserable success rate.
dealing with fragments can not be done statelessly without the loss of generality, so I'd say the best answer in that case is "use IPv6 transport". (we have to keep in mind this little hack is intended just a bolt-on for the transition).

That said, maybe I was wrong on one or both counts, so I am happy to consider any patches that address the issue without introducing state....

angus19 · 2021-04-15T02:40:32Z

Thanks for replying.

BTW do you think reassembling fragments in nat46 before doing anything to v4->v6 or v6->v4 is good?

There is a default mtu setting for nat46 and 16K (16384) is current value: nat46_netdev_setup.
Guessed that you try to avoid any fragments which are processed by nat46.

According to my understanding recent kernel has a higher possibility of feeding fragments to nat46 regardless of 16K mtu.
That's why I wonder if IPv4/IPv6 defragmentation is a must.

Thanks.

ayourtch · 2021-04-15T08:44:29Z

https://datatracker.ietf.org/doc/rfc7872/?include_text=1 discussed the real-world measurements with respect to IPv6 fragments, so for most practical purposes the IPv6 fragments on the internet in the context of this code aren’t usable. Reassembling them at any point in time is a DoS vector, it would be nice to avoid doing that in the middle - hence the comment about the patch :-)

It also now comes back to me that when I tested this module it was in conjunction with iptables doing NAT, and they did perform the reassembly on the IPv4 side, IIRC. It was a while ago, Maybe now the situation has changed.

if the reassembly is to be done, I wonder if it might make sense to have a separate module doing just that, with a similar approach as this module, but just policy-route the fragments into an egress interface that does the reassembly and reinjects the packets back...

With fragments we kill the performance anyway, and this kind of approach feels quite clean architecturally, since that block can be reused elsewhere. (I am shooting off the hip assuming this functionality is not yet there - but chances are pretty high there is some existing code that can do it)

What do you think ?

angus19 · 2021-04-15T10:44:27Z

The extra block sounds good... even if my knowledge is limited :Q

Back to the issue itself.

Upstream v4 fragments : if the "psid" in xlate_map_v4_to_v6 is known for not first v4 fragments then address translations are all fine.
Downstream v6 fragments : before calling csum_ipv6_unmagic in nat46_ipv6_input for first fragment, if the "l3_infrag_payload_len" can be filled with proper upper-layer packet length (from original pseudoheader's info, possible?) then first translated v4 fragment should be fine with having remagicked/unmagicked L4 checksum.

Wonder if there is an alternative solution to not consider IPv4/IPv6 defragmentation but count on above investigations.

Thanks.

ayourtch · 2021-04-15T11:16:45Z

Back to the issue itself.

1. Upstream v4 fragments : if the "psid" in xlate_map_v4_to_v6 is known for not first v4 fragments then address translations are all fine.

It is either not required if we are not sharing IPv4 addresses (i.e. psid is zero bits), or we are calculating it from l4id being the port:

  if (!pl4id && psid_bits_len) {
    nat46debug(5, "xlate_map_v4_to_v6: l4id required for MAP domain %pI4/%d (ea-len %d)", &rule->v4_pref, rule->v4_pref_len, rule->ea_len);
    return 0;
  }

  /* zero out the IPv6 address */
  memset(pipv6, 0, 16);

  psid = (ntohs(l4id) >> (16 - psid_bits_len - rule->psid_offset)) & (0xffff >> (16 - psid_bits_len));

So if we are sharing the IPv4 address we need to get this L4 info from somewhere.

The only somewhat-stateless approach I could see to this, is to have a "big enough" table that would be indexed by a hash of src_ip,dst_ip, and would contain pairs of (timestamp, psid), with the timestamp being refreshed whenever the first fragment passes by, and then attempting to blindly use the psid in noninitial fragments if the timestamp of the matching entry is not too old, and hope for the best...

But this is a pretty nasty approach in that it will replace a very simple to debug failure mode "no fragments for you, kthxbye", into something that will work in the lab, and maybe on small production workloads, but will intermittently break at high loads. I don't want to make the life of the support folks miserable :-)

2. Downstream v6 fragments : before calling csum_ipv6_unmagic in nat46_ipv6_input for first fragment, if the "l3_infrag_payload_len" can be filled with proper upper-layer packet length (from original pseudoheader's info, possible?) then first translated v4 fragment should be fine with having remagicked/unmagicked L4 checksum.

Yeah, so it is about the same problem here (needing some info from the first fragment) with the same possible solution and the same caveats...

What do you think ?

ejordangottlieb · 2021-04-15T22:37:43Z

@angus19 @ayourtch this thread seems to be addressing an exclusive stateless NAT46 with no consideration for a use case with a preceding stateful NAT44 function (MAP-T CE). A defrag is required before the NAT44 function (PSID oriented when IPv4 is shared) at which point it is NAT46 translated and linux kernel ipv6 fragmented if outgoing MTU dictates. My take is that we must always have defrag support and let NAT46 only handle unfragmented packets (as it currently does). The stateless use case is only useful for a BR or middlebox translator. I guess both cases could be supported by adding another operational mode.

The other thing I want to point out is in the shared IPv4 case the IPv6 extension header fragment ID is not guaranteed to get a value from the PSID range as the kernel fragmentation function has no notion of the PSID. This is a desirable behavior in the real world for regulatory reasons. @ayourtch would that be a heavy lift? I think it would require something outside of nat46 and therefore a heavy lift.

What do you think?

ayourtch · 2021-04-15T23:02:28Z

@ejordangottlieb Yeah IIRC the iptables NAT was doing reassembly. And as I noted in the link to the RFC - IPv6 fragments don’t really work on the internet...

So, which exactly use case is being broken by the current “fragment-ignorant” behavior ?

ejordangottlieb · 2021-04-15T23:38:12Z

@ayourtch agree on Internet IPv6 fragments and I realize you are fully aware my next point. The MAP-T use case typically operates over a provider managed network and once it gets translated by the BR the fragment use case is actually IPv4 in nature. One simple use case is a large no df-bit set ping. There are a reasonable number of IPv4 endpoints that will respond. My other concern is a poorly managed UDP (esp over / dtls / wireguard) DF-bit 0 based VPN installation where operator does not factor in nat46 payload impact.

To answer your question there are two issues. One non-regulatory corner case where there is a possible fragment ID collision between two shared IPv4 installations. On the regulatory/legal side the BR must enforce a non "fragment-ignorant" implementation so activities such as abuse referrals (non realtime) can be traced to the source. This prevents someone from forging a non initial fragment (no port or identifier) with an alternate users PSID frag identifier.

Cheers,
J

angus19 · 2021-04-16T06:32:03Z

Feel like the defragmentation is still a non-nasty choice after a round of discussion. It may not be perfect but is a practical solution for now. Thanks.

ayourtch · 2021-04-26T07:22:15Z

https://www.potaroo.net/ispcol/2021-04/v6frag.html - a new data point, which I thought you all might find interesting.

ejordangottlieb · 2021-04-27T15:22:27Z

@ayourtch I did find this datapoint interesting but have the opinion that it is addressing a different use case with different environmental considerations. While packets with the fragmentation header may have a significant level of packet loss when traversing the Internet, this should not be the case on a well managed network. Specifically, if you are going to provide MAP-T services you need to ensure that all your devices between the CE and BR are certified for packets with the fragmentation header. The other item to consider is the challenge for the network operator to provide as much parity between the native NAT44 service and MAP-T. So even supporting large DF-bit 0 pings and DF-bit 0 UDP VPN traffic while uncommon is still desirable to support.

Regards,

J

ayourtch · 2021-04-27T15:58:22Z

@ejordangottlieb - am I understanding you right that you talking about the case where there is no fragments on the internet portion of the route, and there are fragments on the managed portion of the network ?

ejordangottlieb · 2021-04-27T19:48:11Z

@ayourtch I am referring to fragments on both sides of the BR (V6 and V4). But the fragments on the Internet side are exclusively IPv4 fragments. Fragment forwarding on the V6 side of the BR is exclusive to the operators network and therefore "well managed" and therefore can be certified to support this traffic.

All that being said as I think about the current behavior CE solution behavior I have come up with another problem. It is a scenario where a series of IPv4 fragments sent from the UE to the MAP-T CE that will result in a post defrag packet size greater then the outgoing interface on the MAP-T CE. The NAT46 module will not see the fragments (due to netfilter defrag) and will performs the stateless translation with IPv6 pseudo-header based transport level checksum and pass to the Linux kernel for IPv6 based fragmentation.

The problem I envision is that a stateless BR implementation will not have a defrag (so far I don't see anything in RFC7915 about this) function and will perform the translation to IPv4 fragments with a transport layer checksum that is IPv6 pseudo-header based (assuming no checksum re-calc). In this scenario the transport checksum check will fail at the destination IPv4 node. One mitigation strategy is to do an incremental checksum on the checksum containing fragment.

Let me know if this all makes sense?

-J

ejordangottlieb · 2021-04-28T13:28:15Z

I wanted to clarify on the above observation that nothing is problematic with the current NAT46 approach. The module should do the layer-4 checksum as stated above using an IPv6 pseudo-header (for one it is needed for IPv4-mapped IPv6 flows). In looking at RFC7915 again the checksum handling is stated in an implicit fashion. I also checked one particular BR implementation and it is performing the checksum calc using the IPv4 psuedo header when translating IPv6 with fragment extension headers to IPv4 fragment.

ayourtch · 2021-04-28T13:47:51Z

@ejordangottlieb - yes, this scenario will present the problem. I remember thinking about it. But, to that type of situation, I tend to bring up the old anecdote:

"doctor, it hurts when i do this"
"stop doing that".

What you describe is a valid potential case, but the amount of complexity, fragility, and attack surface that any fragment handling will add, in my view far far far exceeds any benefits that it brings.

Applications (IPSec, namely) have had the solutions for not creating the large fragmented packets in the first place. For at least a decade. If someone is reluctant to adopt these, I don't see why offloading that off them is a sound idea, given the above tradeoffs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue observed from received fragments #24

Issue observed from received fragments #24

angus19 commented Apr 14, 2021 •

edited

Loading

ayourtch commented Apr 14, 2021

angus19 commented Apr 15, 2021 •

edited

Loading

ayourtch commented Apr 15, 2021

angus19 commented Apr 15, 2021

ayourtch commented Apr 15, 2021

ejordangottlieb commented Apr 15, 2021

ayourtch commented Apr 15, 2021

ejordangottlieb commented Apr 15, 2021

angus19 commented Apr 16, 2021 •

edited

Loading

ayourtch commented Apr 26, 2021

ejordangottlieb commented Apr 27, 2021

ayourtch commented Apr 27, 2021

ejordangottlieb commented Apr 27, 2021

ejordangottlieb commented Apr 28, 2021

ayourtch commented Apr 28, 2021

Issue observed from received fragments #24

Issue observed from received fragments #24

Comments

angus19 commented Apr 14, 2021 • edited Loading

ayourtch commented Apr 14, 2021

angus19 commented Apr 15, 2021 • edited Loading

ayourtch commented Apr 15, 2021

angus19 commented Apr 15, 2021

ayourtch commented Apr 15, 2021

ejordangottlieb commented Apr 15, 2021

ayourtch commented Apr 15, 2021

ejordangottlieb commented Apr 15, 2021

angus19 commented Apr 16, 2021 • edited Loading

ayourtch commented Apr 26, 2021

ejordangottlieb commented Apr 27, 2021

ayourtch commented Apr 27, 2021

ejordangottlieb commented Apr 27, 2021

ejordangottlieb commented Apr 28, 2021

ayourtch commented Apr 28, 2021

angus19 commented Apr 14, 2021 •

edited

Loading

angus19 commented Apr 15, 2021 •

edited

Loading

angus19 commented Apr 16, 2021 •

edited

Loading