[pve-devel] [RFC cluster 1/1] add generic data broadcast interface

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Apr 26 16:53:38 CEST 2019


Am 4/26/19 um 12:08 PM schrieb Dominik Csapak:
> On 4/26/19 10:43 AM, Thomas Lamprecht wrote:
>> Am 4/26/19 um 8:21 AM schrieb Dominik Csapak:
>>> similar to how we handle the cluster wide tasklist and rrd data,
>>> have an interface that can sync data across the cluster
>>>
>>> this data is only transient and will not be written to disk
>>>
>>> we can use this for a number of things, e.g. getting the locks of the
>>> guests clusterwide, listing ceph services across the cluster, etc.
>>
>> How?? One needs to write it on lock, and if it's temporarily it does not
>> seem really that great for things like locks?? Also you start to broadcast
>> everything twice to all, the lock info is already there in the memdb store
>> no need to do some heuristic parallel heuristics for that..
>>
> 
> locks was just an example, but i would have let the pvestatd broadcast the locks a second time (where we already have the config parsed), just for the case to show it in /cluster/resources, this is already not synchronous for remote nodes, so having a little delay there does not change the general behavior and every bit of information we have
> there of the other cluster members gets read with a similar mechanism
> (just the rrd data instead of the custom lock one)

hmm, doubt that letting all nodes over the whole cluster broadcast
such _already known_ info, readily available in the memdb, every few
seconds is a good approach.. For things not already there, it could be
OK, as said, and it's not just a little bit async - this approach may
well set a lock in the state and if the change back (temp. non-quorate)


> 
>>
>>>
>>> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
>>> ---
>>>   data/PVE/Cluster.pm | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 47 insertions(+)
>>>
>>> diff --git a/data/PVE/Cluster.pm b/data/PVE/Cluster.pm
>>> index 5af11e6..ea36a0a 100644
>>> --- a/data/PVE/Cluster.pm
>>> +++ b/data/PVE/Cluster.pm
>>> @@ -540,6 +540,53 @@ sub get_nodelist {
>>>       return [ keys %$nodelist ];
>>>   }
>>>   +# best effort data store for cluster
>>> +# this data is gone if the pmxcfs is restarted, but only the local data,
>>
>> if one is restarted or if all are? Else, who cleans up locks, or other data,
>> on node failure (in any way possible), else this _will_ have very confusing
>> outcomes...
> 
> the way pmxcfs handles the kv store is that for the local node it looks up the local kvhash (which is empty after a restart) but for remote
> nodes looks them up in the cluster synchronized kvhash of the node
> 
> i was not sure if this is intended or a bug, but i treated it like it is intended. the whole thing is also not really documented (imported from svn)

code is the docs here, I'd say, else ask Dietmar.
One could invalid the table on parition changes for leavers,
but for RRD this was prob. not of use, same with ceph services
states.

> 
> so we have to either lookup the local data seperately
> or make sure we broadcasted it once before looking it up,
> or just live with it that after a restart, the data is
> not current until we broadcast it the first time

yeah, for this we have the syncing mechanisms, where file updates
get resolved, and thus the lockstate is valid thereafter, the
status mechanisms are just not viable for it.

> 
>>
>>> +# so we should not use this for very important data
>>> +sub broadcast_data {
>>
>> broadcast_node_kv ?
> 
> ok
> 
>>
>>> +    my ($key, $data) = @_;
>>> +
>>> +    my $size = length(encode_json($data));
>>> +    if ($size >= (32 * 1024)) {
>>> +    warn "data for '$key' too big\n";
>>> +    return;
>>> +    }
>>> +
>>> +    eval {
>>> +    &$ipcc_update_status($key, $data);
>>
>> so we just allow overwriting existing keys? how about a prefix like we have
>> for "rrd/"
> 
> yeah makes sense
> 
>>
>>> +    };
>>> +
>>> +    warn $@ if $@;
>>> +}
>>> +
>>> +sub get_data {
>>
>> get_node_kv ?
> 
> ok
> 
>>
>>> +    my ($key, $nodename) = @_;
>>> +
>>> +    my $res = {};
>>> +    my $sub = sub {
>>
>> maybe a name which actually says something would help reading this ;-)
> 
> yeah, i guess sometimes i am a bit too lazy with names ;)
> 
>>
>>> +    my ($node) = @_;
>>> +    eval {
>>> +        my $raw = &$ipcc_get_status($key, $node);
>>
>> use new coderef call syntax, don't mix it in the same patch..
>> (above &$ vs below $sub->())
> 
> of course (copy-is-my-hobby)
> 
>>
>>> +        my $data = decode_json($raw) if $raw;
>>> +        $res->{$node} = $data;
>>> +    };
>>> +    my $err = $@;
>>> +    syslog('err', $err) if $err;
>>> +    };
>>> +
>>> +    if ($nodename) {
>>> +    $sub->($nodename);
>>> +    } else {
>>> +    my $nodelist = get_nodelist();
>>> +
>>> +    foreach my $node (@$nodelist) {
>>> +        $sub->($node);
>>> +    }
>>
>> so this is not arbitrary data indexed by key, but a value store indexed with
>> ($key, $node), i.e., same key on different nodes can hold different data, can
>> be OK, but this needs to be noted, i.e., through more specific method names
> 
> ok, i just wanted to not loose the node info without becoming to specific about the structure of the data, but yes a different name makes senses, maybe we can have multiple versions of this with different structures? only if we need it of course

value == data in my comment, did not mean to reduce it to string,
it was just a note to make it clear to any possible user that the
key alone _does not_ identifies data, but the tuple of nodename
and key does.

> 
>>
>> not yet really sure about this, or better said, I'm sure I don't want this for
>> locks, but for things like ceph service states it could be OK.
> 
> yeah i am not sure about this either, hence i sent it as RFC ;)
> i would prefer it for locks over the config parsing in the api call though

hmm, I'm not sure, this approach increases cluster traffic (corosync),
the other didn't, as it just used what's here already...

You just moved the overhead from loading through IPCC + parsing (read only)
on the wire by writing states out, I'm not 100% sure it's better you get

* edge triggered,
* constant writes of information already known to the pmxcfs (at least with
  locks)
* possibility that data is completely out of date and wrong (statd hangs,
  restart but no sync, ...?)

against 

* (currently) slower with a lot of VMs, mostly due to the nature of getting
  all configs one after the other first
* always correct, if quorate
* possibility to make fast by adding IPCC calls helping with getting data in
  bulk from VM configs (as you know I'm on this, and have a somewhat working
  prototype, but currently not too much time)


So can we agree that this won't be the lock, other config, or pmxcfs file
backed information exchanger, but let's work it in direction of things like
the ceph service state broad cast (for which this seems like a really nice
and simple solution, reusing what we have) and possible (ephemeral)
notifications, and the like, not already handled by files from pmxcfs?

> 
>>
>>> +    }
>>> +
>>> +    return $res;
>>> +}
>>> +
>>>   # $data must be a chronological descending ordered array of tasks
>>>   sub broadcast_tasklist {
>>>       my ($data) = @_;
>>>
>>
> 
> 





More information about the pve-devel mailing list