[pve-devel] BUG in vlan aware bridge

Josef Johansson josef at oderland.se
Thu Oct 14 07:40:47 CEST 2021


This is one of the commits in includes/net/ip.h...

I'd say someone should look over this and fix it :)

commit 93fdd47e52f3f869a437319db9da1ea409acc07e
Author: Herbert Xu <herbert at gondor.apana.org.au>
Date:   Sun Oct 5 12:00:22 2014 +0800

    bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
    
    As we may defragment the packet in IPv4 PRE_ROUTING and refragment
    it after POST_ROUTING we should save the value of frag_max_size.
    
    This is still very wrong as the bridge is supposed to leave the
    packets intact, meaning that the right thing to do is to use the
    original frag_list for fragmentation.
    
    Unfortunately we don't currently guarantee that the frag_list is
    left untouched throughout netfilter so until this changes this is
    the best we can do.
    
    There is also a spot in FORWARD where it appears that we can
    forward a packet without going through fragmentation, mark it
    so that we can fix it later.
    
    Signed-off-by: Herbert Xu <herbert at gondor.apana.org.au>
    Signed-off-by: David S. Miller <davem at davemloft.net>

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index a615264cf01a..4063898cf8aa 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -404,6 +404,7 @@ static int br_nf_pre_routing_finish_bridge(struct
sk_buff *skb)
                              ETH_HLEN-ETH_ALEN);
             /* tell br_dev_xmit to continue with forwarding */
             nf_bridge->mask |= BRNF_BRIDGED_DNAT;
+            /* FIXME Need to refragment */
             ret = neigh->output(neigh, skb);
         }
         neigh_release(neigh);
@@ -459,6 +460,10 @@ static int br_nf_pre_routing_finish(struct sk_buff
*skb)
     struct nf_bridge_info *nf_bridge = skb->nf_bridge;
     struct rtable *rt;
     int err;
+    int frag_max_size;
+
+    frag_max_size = IPCB(skb)->frag_max_size;
+    BR_INPUT_SKB_CB(skb)->frag_max_size = frag_max_size;
 
     if (nf_bridge->mask & BRNF_PKT_TYPE) {
         skb->pkt_type = PACKET_OTHERHOST;
@@ -863,13 +868,19 @@ static unsigned int br_nf_forward_arp(const struct
nf_hook_ops *ops,
 static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 {
     int ret;
+    int frag_max_size;
 
+    /* This is wrong! We should preserve the original fragment
+     * boundaries by preserving frag_list rather than refragmenting.
+     */
     if (skb->protocol == htons(ETH_P_IP) &&
         skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
         !skb_is_gso(skb)) {
+        frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
         if (br_parse_ip_options(skb))
             /* Drop invalid packet */
             return NF_DROP;
+        IPCB(skb)->frag_max_size = frag_max_size;
         ret = ip_fragment(skb, br_dev_queue_push_xmit);
     } else
         ret = br_dev_queue_push_xmit(skb);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index b6c04cbcfdc5..2398369c6dda 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -305,10 +305,14 @@ struct net_bridge
 
 struct br_input_skb_cb {
     struct net_device *brdev;
+
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
     int igmp;
     int mrouters_only;
 #endif
+
+    u16 frag_max_size;
+
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
     bool vlan_filtered;
 #endif

Med vänliga hälsningar
Josef Johansson

On 10/14/21 07:14, Josef Johansson wrote:
> Hi,
>
> I did some more digging searching for 'bridge-nf-call-iptables
> fragmentation'
>
> Found these forum posts:
>
> https://forum.proxmox.com/threads/net-bridge-bridge-nf-call-iptables-and-friends.64766/
>
> https://forum.proxmox.com/threads/linux-bridge-reassemble-fragmented-packets.96432/
>
> And this patch, which seems like they at least TRIED to get it fixed ;)
>
> https://lists.linuxfoundation.org/pipermail/bridge/2019-August/012185.html
>
> Med vänliga hälsningar
> Josef Johansson
>
> On 10/13/21 16:32, VELARTIS Philipp Dürhammer wrote:
>> If you Stop pve firewall service and echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables (you stop the netfilter hook)
>> Then it works for me also with taged tap devices and vlan aware bridge. I think it is a kernel bug.
>> What I don’t understand why not more people are reporting it...
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Josef Johansson <josef at oderland.se> 
>> Gesendet: Mittwoch, 13. Oktober 2021 16:19
>> An: VELARTIS Philipp Dürhammer <p.duerhammer at velartis.at>; 'pve-devel at lists.proxmox.com' <pve-devel at lists.proxmox.com>
>> Betreff: Re: AW: [pve-devel] BUG in vlan aware bridge
>>
>> Hi,
>>
>> I can confirm that s > 12000 does not work on either
>>
>> size, tap(untagged, mtu 1500)->vlan-aware bridge(mtu 9000)->bond(mtu 9000), tap(tagged, mtu1500)->vlan-aware bridge(mtu 9000)->bond(mtu 9000)
>>
>> s > 12000, doesn't work, doesn't work
>>
>> s > 8000 , works, doesn't work
>>
>>
>> The traffic(one packet defragmented) is just dropped between bridge and tap. I tried my NOTRACK and it didn't have any affect.
>>
>>
>> We have either a bug in my mellanox cards here or the kernel. I don't think this is a normal case.
>>
>> Med vänliga hälsningar
>> Josef Johansson
>>
>> On 10/13/21 15:53, VELARTIS Philipp Dürhammer wrote:
>>> And what happens if you use packet size > 9000? this should still 
>>> work...(because it gets fragmented)
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: pve-devel <pve-devel-bounces at lists.proxmox.com> Im Auftrag von 
>>> Josef Johansson
>>> Gesendet: Mittwoch, 13. Oktober 2021 13:37
>>> An: pve-devel at lists.proxmox.com
>>> Betreff: Re: [pve-devel] BUG in vlan aware bridge
>>>
>>> Hi,
>>>
>>> AFAIK it's netfilter that is doing defragmenting so that it can firewall.
>>>
>>> If you specify
>>>
>>> iptables -t raw -I PREROUTING -s 77.244.240.131 -j NOTRACK
>>>
>>> iptables -t raw -I PREROUTING -s 37.16.72.52 -j NOTRACK
>>>
>>> you should be able to make it ignore your packets.
>>>
>>>
>>> As a datapoint I could ping fine from a MTU 1500 host, over MTU 9000 vlan-aware bridges with firewalls to another MTU 1500.
>>>
>>> As you would assume the package is defragmented over MTU 9000 links and fragmented again over MTU 1500 devices.
>>>
>>> Med vänliga hälsningar
>>> Josef Johansson
>>>
>>> On 10/13/21 11:22, VELARTIS Philipp Dürhammer wrote:
>>>> HI,
>>>>
>>>>
>>>> Yes i think it has nothing to do with the bonds but with the vlan aware bridge interface.
>>>>
>>>> I see this with ping -s 1500
>>>>
>>>> On tap interface: 
>>>> 11:19:35.141414 62:47:e0:fe:f9:31 > 54:e0:32:27:6e:50, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 64, id 39999, offset 0, flags [+], proto ICMP (1), length 1500)
>>>>     37.16.72.52 > 77.244.240.131: ICMP echo request, id 2182, seq 4, 
>>>> length 1480
>>>> 11:19:35.141430 62:47:e0:fe:f9:31 > 54:e0:32:27:6e:50, ethertype IPv4 (0x0800), length 562: (tos 0x0, ttl 64, id 39999, offset 1480, flags [none], proto ICMP (1), length 548)
>>>>     37.16.72.52 > 77.244.240.131: ip-proto-1
>>>>
>>>> On vmbr0:
>>>> 11:19:35.141442 62:47:e0:fe:f9:31 > 54:e0:32:27:6e:50, ethertype 802.1Q (0x8100), length 2046: vlan 350, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 39999, offset 0, flags [none], proto ICMP (1), length 2028)
>>>>     37.16.72.52 > 77.244.240.131: ICMP echo request, id 2182, seq 4, 
>>>> length 2008
>>>>
>>>> On bond0 its gone....
>>>>
>>>> But who is in charge of fragementing the packets normally? The bridge itself? Netfilter?
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: pve-devel <pve-devel-bounces at lists.proxmox.com> Im Auftrag von 
>>>> Stoyan Marinov
>>>> Gesendet: Mittwoch, 13. Oktober 2021 00:46
>>>> An: Proxmox VE development discussion <pve-devel at lists.proxmox.com>
>>>> Betreff: Re: [pve-devel] BUG in vlan aware bridge
>>>>
>>>> OK, I have just verified it has nothing to do with bonds. I get the same behavior with vlan aware bridge, bridge-nf-call-iptables=1 with regular eth0 being part of the bridge. Packets arrive fragmented on tap, reassembled by netfilter and then re-injected in bridge assembled (full size).
>>>>
>>>> I did have limited success by setting net.bridge.bridge-nf-filter-vlan-tagged to 1. Now packets seem to get fragmented on the way out and back in, but there are still issues:
>>>>
>>>> 1. I'm testing with ping -s 2000 (1500 mtu everywhere) to an external box. I do see reply packets arrive on the vm nic, but ping doesn't see them. Haven't analyzed much further.
>>>> 2. While watching with tcpdump (inside the vm) i notice "ip reassembly time exceeded" messages being generated from the vm.
>>>>
>>>> I'll try to investigate a bit further tomorrow.
>>>>
>>>>> On 12 Oct 2021, at 11:26 PM, Stoyan Marinov <stoyan at marinov.us> wrote:
>>>>>
>>>>> That's an interesting observation. Now that I think about it, it could be caused by bonding and not the underlying device. When I tested this (about an year ago) I was using bonding on the mlx adapters and not using bonding on intel ones.
>>>>>
>>>>>> On 12 Oct 2021, at 3:36 PM, VELARTIS Philipp Dürhammer <p.duerhammer at velartis.at> wrote:
>>>>>>
>>>>>> HI,
>>>>>>
>>>>>> we use HP Server with Intel Cards or the standard hp nic ( ithink 
>>>>>> also intel)
>>>>>>
>>>>>> Also I see the I did a mistake:
>>>>>>
>>>>>> Setup working:
>>>>>> tapX (UNtagged) <- -> vmbr0 <- - > bond0
>>>>>>
>>>>>> is correct. (before I had also tagged)
>>>>>>
>>>>>> it should be :
>>>>>>
>>>>>> Setup not working:
>>>>>> tapX (tagged) <- -> vmbr0 <- - > bond0
>>>>>>
>>>>>> Setup working:
>>>>>> tapX (untagged) <- -> vmbr0 <- - > bond0
>>>>>>
>>>>>> Setup also working:
>>>>>> tapX < - - > vmbr0v350 < -- > bond0.350 < -- > bond0
>>>>>>
>>>>>> -----Ursprüngliche Nachricht-----
>>>>>> Von: pve-devel <pve-devel-bounces at lists.proxmox.com> Im Auftrag von 
>>>>>> Stoyan Marinov
>>>>>> Gesendet: Dienstag, 12. Oktober 2021 13:16
>>>>>> An: Proxmox VE development discussion <pve-devel at lists.proxmox.com>
>>>>>> Betreff: Re: [pve-devel] BUG in vlan aware bridge
>>>>>>
>>>>>> I'm having the very same issue with Mellanox ethernet adapters. I don't see this behavior with Intel nics. What network cards do you have?
>>>>>>
>>>>>>> On 12 Oct 2021, at 1:48 PM, VELARTIS Philipp Dürhammer <p.duerhammer at velartis.at> wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>>
>>>>>>> i am playing around since days because we have strange packet losses.
>>>>>>> Finally I can report following (Linux 5.11.22-4-pve, Proxmox 7, all devices MTU 1500):
>>>>>>>
>>>>>>> Packet with sizes > 1500 without VLAN working well but at the moment they are Tagged they are dropped by the bond device.
>>>>>>> Netfilter (set to 1) always reassembles the packets when they arrive a bridge. But they don't get fragmented again I they are VLAN tagged. So the bond device drops them. If the bridge is NOT Vlan aware they also get fragmented and it works well.
>>>>>>>
>>>>>>> Setup not working:
>>>>>>>
>>>>>>> tapX (tagged) <- -> vmbr0 <- - > bond0
>>>>>>>
>>>>>>> Setup working:
>>>>>>>
>>>>>>> tapX (tagged) <- -> vmbr0 <- - > bond0
>>>>>>>
>>>>>>> Setup also working:
>>>>>>>
>>>>>>> tapX < - - > vmbr0v350 < -- > bond0.350 < -- > bond0
>>>>>>>
>>>>>>> Have you got any idea where to search? I don't understand who is 
>>>>>>> in charge of fragmenting packages again if they get reassembled by 
>>>>>>> netfilter. (and why it is not working with vlan aware bridges)
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> pve-devel mailing list
>>>>>>> pve-devel at lists.proxmox.com
>>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>>>>>>
>>>>>> _______________________________________________
>>>>>> pve-devel mailing list
>>>>>> pve-devel at lists.proxmox.com
>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>>>>> _______________________________________________
>>>>>> pve-devel mailing list
>>>>>> pve-devel at lists.proxmox.com
>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>>>> _______________________________________________
>>>>> pve-devel mailing list
>>>>> pve-devel at lists.proxmox.com
>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>>> _______________________________________________
>>>> pve-devel mailing list
>>>> pve-devel at lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>>> _______________________________________________
>>>> pve-devel mailing list
>>>> pve-devel at lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>> _______________________________________________
>>> pve-devel mailing list
>>> pve-devel at lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel




More information about the pve-devel mailing list