[pve-devel] [PATCH access-control/cluster/docs/gui-tests/manager/network/proxmox{, -ve-rs, -perl-rs} v2 00/57] Add SDN Fabrics
Stefan Hanreich
s.hanreich at proxmox.com
Mon Apr 7 11:39:49 CEST 2025
On 4/7/25 10:53, Friedrich Weber wrote:
> On 04/04/2025 18:28, Gabriel Goller wrote:
>> This series allows the user to add fabrics such as OpenFabric and OSPF over
>> their clusters.
>>
>> This series relies on:
>> https://lore.proxmox.com/pve-devel/20250404135522.2603272-1-s.hanreich@proxmox.com/T/#mf4cf46c066d856cea819ac3e79d115a290f47466
>
> Thanks for the v2, I like this feature a lot!
>
> Unfortunately, one problem I noticed while testing this is that it may
> break pre-existing FRR configs (such as full-mesh Ceph clusters set up
> according to [1]) when making seemingly unrelated SDN changes. I already
> quickly discussed this with Stefan, posting here in case others have
> input as well.
>
> Steps to reproduce:
>
> - on PVE 8.3 (without these patches), set up Ceph full mesh with
> OpenFabric as described in [1], includes custom /etc/frr/frr.conf
> - also use some SDN feature, e.g. a VLAN zone with a Vnet
> - install patched packages, systemctl restart pveproxy pvedaemon
> - make a fabric-unrelated change in the SDN config, e.g. change tag of
> the VLAN zone Vnet
> - apply SDN config
>
> =>
> SDN stack writes out a nearly-empty /etc/frr/frr.conf on all nodes and
> thus takes down the full mesh:
>
> # cat /etc/frr/frr.conf
> frr version 10.2.1
> frr defaults datacenter
> hostname fabric159
> log syslog informational
> service integrated-vtysh-config
> !
> !
> line vty
>
> It seems to also disable the fabricd daemon in /etc/frr/daemons:
>
> # grep fabric /etc/frr/daemons
> fabricd=no
> fabricd_options="-A 127.0.0.1 --dummy_as_loopback"
> # vtysh -c 'show openfabric route'
> fabricd is not running
>
> It makes sense that one cannot use both our fabrics integration and
> custom FRR configs, but the above SDN config change is not related to
> fabrics, so we should probably avoid touching the frr.conf if possible.
> The wiki article [1] does warn that the full mesh doesn't work in
> combination with EVPN, but unfortunately doesn't mention an inherent
> incompatibility with the SDN stack as a whole.
For context: The initial issue here was that we previously did *not*
re-write the FRR configuration when you had an EVPN controller and
deleted it afterwards. So the FRR configuration actually lingered around
after deleting the EVPN controller.
That's because FRR config writing was bound to the EVPN controller. If
you didn't have one, the configuration wouldn't get written at all. In
my refactoring of the FRR config generation, I changed this to always
write the FRR config. That was intended to fix the bug mentioned above.
The mitigation I see is:
Read the previous running configuration before applying the new one.
Then, if the previous configuration contained any FRR-related entities
*or* the new configuration contains FRR-related entities: regenerate the
FRR config, otherwise leave as is. That would restore the previous
behavior and should fix this regression.
The only thing that would then change compared to before is that if you
*only* had an IS-IS and/or BGP controller before (which did not generate
any FRR configuration without an EVPN controller), reapplying with any
of those in your configuration will overwrite the full-mesh
configuration as well, since those cause a FRR configuration write as
well now.
We could further restrict it to specific FRR types (EVPN controller and
fabrics I'd say), but that would re-introduce the behavior mentioned
above where EVPN, BGP and IS-IS routers linger around when deleting an
EVPN controller (and having no fabrics).
More information about the pve-devel
mailing list