[pve-devel] [WIP v2 cluster/network/manager/qemu-server/container 00/10] Add support for DHCP servers to SDN

Stefan Lendl s.lendl at proxmox.com
Fri Oct 27 14:26:02 CEST 2023

Thomas Lamprecht <t.lamprecht at proxmox.com> writes:

> Am 23/10/2023 um 14:40 schrieb Stefan Lendl:
>> I am currently working on the SDN feature.  This is an initial review of
>> the patch series and I am trying to make a strong case against ephemeral
>> DHCP IP reservation.
> Stefan Hanreich's reply to the cover letter already mentions upserts, those
> will avoid basically all problems while allowing for some dynamic changes.

I totally agree with upserts and my patches add this functionality.

>> The current state of the patch series invokes the IPAM on every VM/CT
>> start/stop to add or remove the IP from the IPAM.
>> This triggers the dnsmasq config generation on the specific host with
>> only the MAC/IP mapping of that particular host.
>> From reading the discussion of the v1 patch series I understand this
>> approach tries to implement the ephemeral IP reservation strategy. From
>> off-list conversations with Stefan Hanreich, I agree that having
>> ephemeral IP reservation coordinated by the IPAM requires us to
>> re-implement DHCP functionality in the IPAM and heavily rely on syncing
>> between the different services.
>> To maintain reliable sync we need to hook into many different places
>> where the IPAM need to be queried.  Any issues with the implementation
>> may lead to IPAM and DHCP local config state running out of sync causing
>> network issues duplicate multiple IPs.
> The same is true for permanent reservations, wherever that reservation is
> saved needs to be in sync with IPAM, e.g., also on backup restore (into a
> new env), if subnets change their configured CIDRs, ...

Yes, agreed but it's arguably less states and situation that need to be

The current implementation had a different state per node and depended
on the online/offline state of the guest.

It is currently not allowed to change the CIDR of a subnet.

>> Furthermore, every interaction with the IPAM requires a cluster-wide
>> lock on the IPAM. Having a central cluster-wide lock on every VM
>> start/stop/migrate will significantly limit parallel operations.  Event
>> starting two VMs in parallel will be limited by this central lock. At
>> boot trying to start many VMs (ideally as much in parallel as possible)
>> is limited by the central IPAM lock even further.
> Cluster wide locks are relatively cheap, especially if one avoids having
> a long critical section, i.e., query IPAM while still unlocked, then
> read and update the state locked, if the newly received IP is already
> in there then simply give up lock again and repeat.
> We also have a clusters wide lock for starting HA guests, to set the
> wanted ha-resource state, that is no issue at all, you can start/stop
> many orders of magnitudes more VMs than any HW/Storage could cope with.
>> I argue that we shall not support ephemeral IPs altogether.
>> The alternative is to make all IPAM reservations persistent.
>> Using persistent IPs only reduces the interactions of VM/CTs with the
>> IPAM to a minimum of NIC joining a subnet and NIC leaving a subnet. I am
>> deliberately not referring to VMs because a VM may be part of multiple
>> VNets or even multiple times in the same VNet (regardless if that is
>> sensible).
> Yeah, talking about vNICs / veth's is the better term here, guests are
> only indirectly relevant.
>> Cases the IPAM needs to be involved:
>> - NIC with DHCP enabled VNet is added to VM config
>> - NIC with DHCP enabled VNet is removed from VM config
>> - NIC is assigned to another Bridge
>>   can be treated as individual leave + join events
> and:
> - subnet config is changed
> - vNIC changes from SDN-DHCP managed to manual, or vice versa
>   Albeit that can almost be treated like vNet leave/join though
>> Cases that are explicitly not covered but may be added if desired:
>> - Manually assign an IP address on a NIC
>>   will not be automatically visible in the IPAM
> This sounds like you want to save the state in the VM config, which I'm
> rather skeptical about, and would try hard to avoid. We also would need
> to differ between bridges that are part of DHCP-managed SDN and others,
> as else a user could set some IP but nothing would happen.

I am sorry, my explanation was not clear here. I do not want to store IP
inside the VM config.  I agree that this would not be ideal.  If a user
configures an IP from inside the VM, we have no way of tracking that IP.

For now, every added vNIC gets an IP from the IPAM, and if the guest is
configured to use DHCP, it will get this IP from the DHCP server.

If the user decides to manually configure the IP, he will have to
reserve it in the IPAM, and mark the IP as "manual".
This will prevent the IPAM from allocating the IP again and keep the
IP/MAC mapping even if the VM is destroyed.

This is not implemented yet, but sketched out with Mira off-list.

>> - Manually change the MAC on a NIC
>>   don't do that > you are on your own.
> FWIW, a clone is such a change, and we have to support that, otherwise
> the MAC field needs to get some warning hints or even become read-only
> in the UI.
>>   Not handled > change in IPAM manually
>> Once an IP is reserved via IPAM, the dnsmasq config can be generated
>> stateless and idempotent from the pve IPAM and is identical on all nodes
>> regardless if a VM/CT actually resides on that node or is running or
>> stopped.  This is especially useful for VM migration because the IP
>> stays consistent without spacial considering.
> That should be orthogonal to the feature set, if we have all the info
> saved somewhere else
> But this also speaks against having it in the VM config, as that would
> mean that every node needs to parse every guests' config periodically,
> which is way worse than some cluster lock and breaks with our base
> axiom that guests are owned by their current node, and only by that,
> and a node should not really alter behavior dependent on some "foreign"
> guest.
>> Snapshot/revert, backup/restore, suspend/hibernate/resume cases are
>> automatically covered because the IP will already be reserved for that
>> MAC.
> Not really, restore to another setup is broken, one could resume the
> VM after having changed CIDRs of a subnet, making that broken too, ...
>> If the admin wants to change, the IP of a VM this can be done via the
>> IPAM API/UI which will have to be implemented separately.
> Providing Overrides can be fine, but IMO that all should be still in
> the SDN state, not per-VM one, and ideally use a common API.
>> A limitation of this approach vs dynamic IP reservation is that the IP
>> range on the subnet needs to be large enough to hold all IPs of all,
>> even stopped, VMs in that subnet. This is in contrast to default DHCP
>> functionality where only the number of actively running VMs is limited.
>> It should be enough to mention this in the docs.
> In production setups it should not matter _that_ much, but it might
> be a bit of a PITA if one has a few "archived" VMs or the like, but
> that alone would
>> I will further review the code an try to implement the aforementioned
>> approach.
> You can naturally experiment, but I'd also try the upsert proposal from
> Stefan H., as IMO that sounds like a good balance.

More information about the pve-devel mailing list