[pve-devel] [PATCH manager v4 3/6] added Config for Shared Filesystem Directories

Thu May 4 12:21:53 CEST 2023

Am 04/05/2023 um 10:57 schrieb Dominik Csapak:
> On 5/4/23 10:42, Thomas Lamprecht wrote:
>> Am 04/05/2023 um 10:31 schrieb Dominik Csapak:
>>> On 5/4/23 10:13, Thomas Lamprecht wrote:
>>>> Am 03/05/2023 um 13:26 schrieb Dominik Csapak:
>>>>> just a short comment since this series overlaps a bit with my
>>>>> cluster resource mapping series (i plan on sending a v4 soon).
>>>>>
>>>>> i'd prefer to have the configuration endpoints for mapping bundled in a subdirectory
>>>>> so instead of /nodes/<node>/dirs/ i'd put it in /nodes/<node>/mapping/dirs/
>>>>> (or /nodes/<node>/map/dirs )
>>>>>
>>>>> @thomas, @fabian, any other input on that?
>>>>>
>>>>
>>>> huh? aren't mappings per definition cluster wide i.e. /cluster/resource-map/<mapping-id>
>>>> than then allows to add/update the mapping of a resource on a specific node?
>>>> A node specific path makes no sense to me, at max it would if adding/removing a mapping
>>>> is completely decoupled from adding/removing/updaing entries to it – but that seems
>>>> convoluted from an usage POV and easy to get out of sync with the actual mapping list.
>>>
>>> in markus series the mapping are only ever per node, so each node has it's
>>> own dir mapping
>>
>> Every resource maping is always per node, so that's not really changing anything.
>>
> 
> i meant there is no cluster aspect in the current version of markus series at all

only if you lock down the vm hard to be kept at the node its mapping was configured.

> 
>> Rather what about migrations? Would be simpler from migration and ACL POV to have
>> it cluster wide,
> 
> sure, but how we get the info during migration is just an implementation detail and
> for that shouldn't matter where the configs/api endpoints are

no it really isn't an implementation detail how ACLs govern access to mappings, which
in turn dictactes where the ACL object paths should be, which then again ahs implications
for where the API endpoints and configs should be – node level isn't the one.

>>>
>>> in my series, the actual config was cluster-wide, but the api endpoint to configure
>>> them were sitting in the node path (e.g. /node/<nodes>/hardware-map/pci/*)
>>  > Please no.
> 
> was that way since the very first version, would have been nice to get that feedback
> earlier
> 

IIRC I always talked about this being on cluster level, especially w.r.t. to making it
more general resource mapping, maybe I didn't looked to closely and was thrown off
by the series name, i.e., "add *cluster-wide* hardware device mapping" and on the ACL
path discussion (which IIRC never had any node-specific param in there, and thus only
work if it's either a clusterwide config or if ID uniquness is guaranteed by the pmxcfs
like we do for VMIDs)

Anyhow, sorry, I'll try to check for such things more closely earlier in the future.

>>>
>>> we *could* put them into the cluster path api, but we'd need to send a node parameter
>>> along and forward it there anyway, so that wouldn't really make a difference
>>
>> no need for that, see above.
> 
> there is a need for the node parameter, because you always need to know for which node the
> mapping is anyway ;) or did you mean the 'forward' part?

yes, I certainly meant the forward part – otherwise my "Every resource mapping is always
per node" sentence and mentioned the node selector above would be really odd. A
node:local-id map always needs to exist, I figured that's the core thing this series wants
to solve, but as it's not a hard requirement to forward every request (and a huge PITA
to get in sync if done that way, or at least would reaquire using a cfs_domain_lock),
so no (hard) need for proxying.

>>>
>>> for reading the mappings, that could be done there, but in my series in the gui at least,
>>> i have to make a call to each node to get the current state of the mapping
>>> (e.g. if the pci device is still there)
>>
>> For now not ideal but ok, in the future I'd rather go in the direction of broadcasting
>> some types of HW resources via pmxcfs KV and then this isn't an issue anymore.
> 
> yeah can be done, but the question is if we want to broadcast the whole pci/usb list
> from all nodes? (i don't believe that scales well?)

how is one or two dozen of PCI IDs + some meta info, broadcasted normally once per boot
(or statd start) per node a scaling problem? Especially compared to the amount of data we
broadcast otherwise it's rather nelligible.

> 
>>
>>>
>>> if a mapping exists (globally) is not interesting most of the time, we only need to know
>>> if it exists at a specific node
>>
>> that's looking at it backwards, the user and ACL only care for global mapings, how
>> the code implements that is then, well an implementation detail.
> 
> ACL yes, the user must configure the mapping on a vm (which sits on a specific node)
> and there the mapping must exist on that node
> 

Where they also don't care for local-nodeness, they just select a mapping (no need to hard
error on the mapping not being usable on the current node, VM can be migrated before the
start – albeit it would certainly improve UX if it was visually hinted); otherwise they
could've just used a local ID directly.

>>
>>>
>>> also, after seeing markus' patches, i also leaned more in the direction of splitting
>>> my global mapping config into a config per type+node (so node1/usb.conf, node1/pci.conf,
>>
>> no, please no forest^W jungle of config trees :/
>>
>> A /etc/pve/resource-map/<type>.conf must be enough, even a /etc/pve/resource-map.conf
>> should be tbh., but I could imagine that splitting per resource type makes some (schema)
>> things a bit easier and reduce some bug potential, so not to hard feelings on having one
>> cluster-wide per type; but really not more.
>>
> 
> sure a single config is ok for me (only per type would be weird, since resuing the

Not sure if I'd find that flat out weird, depends on the whole design and planned future 
direction and if it's OK that the ID used for an PCI mapping can never be reused for some
directory or other relatively unrelated mapping. FWIW, <type> here could also be the higher
level resource type (group), i.e., hardware.cfg for PCI/USB, one for local file system stuff
like directories.

FWIW, section config have a disadvantage if very different config types are mixed, as the
schema is then also shown as sum of all possible types in API/man page (even storage.cfg,
while handling similar things, is already very confusing); possibly solvable on api/manpage
doc generation though; just mentioning it as I had that in my mind already a few times
and as we should the improve sometime, and expanding use for wildly different types it
might be better to do it sooner than later (but not directly related to this).

> sectionconfig would only ever have single type and we'd have to encode the
> nodename somehow)

Ideally we wouldn't have to encode the nodename, as doing so would make especially using
a single config weird (i.e., it would make the <id>+<node> the sectionID from a config POV
but not from a usage POV, which would then be weird for a e.g., "pci: foo:nodeA" and a
"usb: foo:nodeB".

It should be tried to have a single config entry per ID, i.e. very handwavy:

pci: id
    comment foo
    allow-multi-function 1
    other-map-wide-e.g-enforcement-setting-a 1
    ...
    map nodeA:3e:00.0,nodeB:0a:00.0

Or better, with the long contemplated array support (which not necesarrily you'd need to
spearhead)

pci: id
    ...
    map node=A,id=3e:00.0
    map node=B,id=3e:01.0

I read (but not in extended detail) "[PATCH common v3 2/3] add PVE/HardwareMap" and above
is probably not expressive enough on it's own, but otoh, the formats used in the patch
seem a bit to flexible (i.e., makes not much sense to allow mdev on one map entry of one
node but not on the one from another node – in practice that can hardly work for VMs I
guess and would limit any future live-migration caps?). Also why's the ID that split up
there?

I'd rather try to keep the map to the minimal ID info and some general restrictions (e.g.,
if some special use is allowed), but other details that only apply on using the mapped
device in a guest would still be handled there (as long as the mapping allows it).

Sorry for the a bit higher level and slightly hand-wavy description, if you list specific
problems and required things I can try to formalize the config format I'd envision to then
be also useful for other resource mappings a bit more.

And just to bring it at the "table", in theory we _maybe_ could split naming a node device
and the cluster-wide mapping, i.e., the mapping would be, well, cluster-wide and naming
devices plus some extra "how can/should it be used" would then be node-specific (possible
not even in /etc/pve) – not saying I'd prefer that, but that's something that could be
explored before finalizing on a design; I don't want to rush this as we have to support
something big and certainly poplular as this basically forever, and while config migrations
can be done somehow it's basically always a pita.