[pve-devel] [PATCH cluster/docs/manager/network/proxmox{, -ve-rs, -firewall, -perl-rs} 00/52] Add SDN Fabrics

Friedrich Weber f.weber at proxmox.com
Thu Apr 3 15:44:42 CEST 2025


On 03/04/2025 12:21, Gabriel Goller wrote:
> On 03.04.2025 10:30, Friedrich Weber wrote:
>> On 28/03/2025 18:12, Gabriel Goller wrote:
>>> This series allows the user to add fabrics such as OpenFabric and
>>> OSPF over
>>> their clusters.
>>>
>>> Overview
>>> ========
>>>
>>> This series allows the user to create routed networks ('fabrics')
>>> across their
>>> clusters, which can be used as the underlay network for a EVPN
>>> cluster, or for
>>> creating Ceph full mesh clusters easily.
>>>
>>> This patch series adds the initial support for two routing protocols:
>>> * OpenFabric
>>> * OSPF
>>
>> I tested a bit with packages provided Gabriel built for me (thanks!),
>> both OSPF and OpenFabric, and also set up a Ceph full mesh over
>> OpenFabric.
>> Overall it looked quite smooth! I didn't notice huge issues, but have
>> some minor points below:
>>
>> - I think the error message when frr+frr-pythontools is not installed
>> looked a bit scary. It's on me for not reading the docs, but still,
>> might be nice to have a friendlier error message in that case :)
> 
> Umm which message exactly do you mean? If I uninstall frr and
> frr-pythontools, I get:
> 
>     WARN: missing /usr/lib/frr/frr-reload.py. Please install frr-
> pythontools package

On a fresh installation without frr + frr-pythontools, I get the
following on srvreload:

> TASK ERROR: can't open '/etc/frr/daemons' - No such file or directory

Same if I `apt purge frr frr-pythontools` -- I guess because this one
actually removes /etc/frr.

Admittedly that's not very scary after all and somewhat
self-explanatory, but still not as nice as the error message you quote.

>> - having already added one node, and then adding another using the "Add
>> Node" dialog, it has happened multiple times that I kept "Node" at the
>> default first node (which I already had defined) while I thought I was
>> configuring the second one, and only noticed when I submitted and got
>> "node already exists". And then, when I change the "Node" to the correct
>> one, I lost my form input :) I understand that we need to reload when
>> changing "Node" (the other node might have other interfaces), but to
>> avoid the above, maybe the dialog could preselect a node that is not yet
>> defined?
> 
> Yep, this is already on our todo-list. Should be as simple as passing
> an array of already configured nodes down to the NodeEdit component and
> then disallow them in the pveNodeSelector using 'disallowNodes'.

OK, thanks :)

>> - when removing a fabric, the IP addresses defined on the interfaces
>> remain until the next reboot. I guess the reason is that ifupdown2
>> doesn't remove IP addresses when the corresponding stanza vanishes. Not
>> sure if this can be easily fixed -- if not, maybe this would be worth a
>> note in the docs?
> 
> Umm, I think `ifreload -a` should remove all the addresses? At least it
> works on my machine :)
> 
> But I'll check again.

I took a closer look -- seems I can only reproduce this if
/etc/network/interfaces contains an empty `iface INTERFACE inet manual`
stanza for the interface. Without such a stanza, the IP address is
removed correctly.

>> - regarding the hello/csnp intervals: it would be nice to mention what
>> the
>> default values are. Also, probably not relevant for this patch series,
>> but
>> wanted to mention anyway: For running a Ceph full mesh over a fabric,
>> one probably wants to set relatively low values here (as our wiki guide
>> does [3])? If there is a guide in the future for setting up Ceph full
>> mesh
>> over fabric, would be nice if the guide would mention that.
> 
> Yep, fixed this. Added the default values in the docs for v2.

Thanks!

>> - when I remove hello interval+multiplier and the csnp via the GUI, I get
>> the following warning in the journal:
>>
>>> Apr 03 10:20:50 fabric159 pveproxy[9244]: Use of uninitialized value
>>> $id in concatenation (.) or string at /usr/share/perl5/PVE/API2/
>>> Network/SDN/Fabrics.pm line 330.
>>> Apr 03 10:21:02 fabric159 pveproxy[9246]: Use of uninitialized value
>>> $id in concatenation (.) or string at /usr/share/perl5/PVE/API2/
>>> Network/SDN/Fabrics.pm line 330.
>>> Apr 03 10:21:02 fabric159 pveproxy[9246]: Use of uninitialized value
>>> $id in concatenation (.) or string at /usr/share/perl5/PVE/API2/
>>> Network/SDN/Fabrics.pm line 330.
> 
> I don't think this is related to the hello-interval and multiplier
> values. AFAICT this is because of the permissions, which are completely
> overhauled in v2.

OK, I see -- I can try to test this again in v2.

>> - after setting up an OSPF fabric in a 3-node full mesh, I couldn't ping
>> the loopback addresses until I rebooted all nodes. I've attached the
>> task logs of the srvreloads and the ospf.cfg below [1]. After a reboot,
>> the pings work fine. Could it be because an OSPF with the same area
>> existed previously?
> 
> How long did you wait, sometimes they take a while to converge, usually
> ospf more than openfabric. Could also be that some routes are cached/not
> removed properly. Could you also paste the frr.conf if you still have
> the cluster (`cat /etc/frr/frr.conf`)? Also can you reproduce this? Does
> a `systemctl restart frr` fix it as well?

I just tried it again and it seems to be reproducible: Set up OSPF on a
fresh full-mesh 3-node cluster, waited 10 minutes after the srvreload,
the routes didn't come up. I've attached the frr.conf's [1].
After systemctl restart frr, the routes came up in a minute.

I also have a snapshot of the cluster pre-reboot, if you want to take a
look at it.

>> - probably a user error, but: after setting up an OpenFabric fabric and
>> rebooting, the routes didn't come up automatically. My openfabric.cfg is
>> in [2]. systemctl status frr shows the following:
>>
>>> Apr 03 10:02:20 fabric159 systemd[1]: Started frr.service - FRRouting.
>>> Apr 03 10:02:21 fabric159 fabricd[699]: [NBV6R-CM3PT] OpenFabric:
>>> Needed to resync LSPDB using CSNP!
>>> Apr 03 10:03:48 fabric159 fabricd[699]: [QBAZ6-3YZR3] OpenFabric:
>>> Could not find two T0 routers
>>
>>> Apr 03 10:02:23 fabric160 systemd[1]: Started frr.service - FRRouting.
>>> Apr 03 10:02:24 fabric160 fabricd[674]: [MZS0T-YRAMC] OpenFabric:
>>> Initial synchronization on ens19 complete.
>>> Apr 03 10:03:48 fabric160 fabricd[674]: [QBAZ6-3YZR3] OpenFabric:
>>> Could not find two T0 routers
>>
>>> Apr 03 10:02:19 fabric161 systemd[1]: Started frr.service - FRRouting.
>>> Apr 03 10:02:21 fabric161 fabricd[681]: [MZS0T-YRAMC] OpenFabric:
>>> Initial synchronization on ens20 complete.
>>> Apr 03 10:03:48 fabric161 fabricd[681]: [QBAZ6-3YZR3] OpenFabric:
>>> Could not find two T0 routers
>>
>> Maybe I'm just too impatient, but estarting frr and waiting for ~30
>> seconds fixes it.
> 
> Yeah, as I said sometimes converging takes a while, especially when
> older routes are around. The logs are just warnings that this isn't a
> proper "spine-leaf" topo and the isis tier couldn't be determined—this
> shouldn't change anything though.
> 
> Will look into it though.
> 

OK -- let me know if I should test this again.

One more thing I just noticed now: After installing the packages, it
seems like the directory /etc/pve/sdn/fabrics isn't created and creating
a new fabric in the GUI fails with

> add sdn fabric failed: unable to open file
'/etc/pve/sdn/fabrics/ospf.cfg.tmp.9220' - No such file or directory (500)

But a manual `systemctl restart pveproxy pvedaemon` seems to create it.

[1]
frr.conf on fabric159:

frr version 10.2.1
frr defaults datacenter
hostname fabric159
log syslog informational
service integrated-vtysh-config
!
router ospf
 ospf router-id 172.16.0.159
exit
!
interface dummy_12345
 ip ospf area 12345
 ip ospf passive
exit
!
interface ens19
 ip ospf area 12345
exit
!
interface ens20
 ip ospf area 12345
exit
!
access-list ospf_12345_ips permit 172.16.0.0/24
!
route-map ospf permit 100
 match ip address ospf_12345_ips
 set src 172.16.0.159
exit
!
ip protocol ospf route-map ospf
!
line vty

frr.conf on fabric160:

frr version 10.2.1
frr defaults datacenter
hostname fabric160
log syslog informational
service integrated-vtysh-config
!
router ospf
 ospf router-id 172.16.0.160
exit
!
interface dummy_12345
 ip ospf area 12345
 ip ospf passive
exit
!
interface ens19
 ip ospf area 12345
exit
!
interface ens20
 ip ospf area 12345
exit
!
access-list ospf_12345_ips permit 172.16.0.0/24
!
route-map ospf permit 100
 match ip address ospf_12345_ips
 set src 172.16.0.160
exit
!
ip protocol ospf route-map ospf
!
line vty

frr.conf on fabric161:

frr version 10.2.1
frr defaults datacenter
hostname fabric161
log syslog informational
service integrated-vtysh-config
!
router ospf
 ospf router-id 172.16.0.161
exit
!
interface dummy_12345
 ip ospf area 12345
 ip ospf passive
exit
!
interface ens19
 ip ospf area 12345
exit
!
interface ens20
 ip ospf area 12345
exit
!
access-list ospf_12345_ips permit 172.16.0.0/24
!
route-map ospf permit 100
 match ip address ospf_12345_ips
 set src 172.16.0.161
exit
!
ip protocol ospf route-map ospf
!
line vty




More information about the pve-devel mailing list