[pve-devel] [RFC cluster 1/1] add generic data broadcast interface

Fri Apr 26 12:08:24 CEST 2019

On 4/26/19 10:43 AM, Thomas Lamprecht wrote:
> Am 4/26/19 um 8:21 AM schrieb Dominik Csapak:
>> similar to how we handle the cluster wide tasklist and rrd data,
>> have an interface that can sync data across the cluster
>>
>> this data is only transient and will not be written to disk
>>
>> we can use this for a number of things, e.g. getting the locks of the
>> guests clusterwide, listing ceph services across the cluster, etc.
> 
> How?? One needs to write it on lock, and if it's temporarily it does not
> seem really that great for things like locks?? Also you start to broadcast
> everything twice to all, the lock info is already there in the memdb store
> no need to do some heuristic parallel heuristics for that..
> 

locks was just an example, but i would have let the pvestatd broadcast 
the locks a second time (where we already have the config parsed), just 
for the case to show it in /cluster/resources, this is already not 
synchronous for remote nodes, so having a little delay there does not 
change the general behavior and every bit of information we have
there of the other cluster members gets read with a similar mechanism
(just the rrd data instead of the custom lock one)

> 
>>
>> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
>> ---
>>   data/PVE/Cluster.pm | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 47 insertions(+)
>>
>> diff --git a/data/PVE/Cluster.pm b/data/PVE/Cluster.pm
>> index 5af11e6..ea36a0a 100644
>> --- a/data/PVE/Cluster.pm
>> +++ b/data/PVE/Cluster.pm
>> @@ -540,6 +540,53 @@ sub get_nodelist {
>>       return [ keys %$nodelist ];
>>   }
>>   
>> +# best effort data store for cluster
>> +# this data is gone if the pmxcfs is restarted, but only the local data,
> 
> if one is restarted or if all are? Else, who cleans up locks, or other data,
> on node failure (in any way possible), else this _will_ have very confusing
> outcomes...

the way pmxcfs handles the kv store is that for the local node it looks 
up the local kvhash (which is empty after a restart) but for remote
nodes looks them up in the cluster synchronized kvhash of the node

i was not sure if this is intended or a bug, but i treated it like it is 
intended. the whole thing is also not really documented (imported from svn)

so we have to either lookup the local data seperately
or make sure we broadcasted it once before looking it up,
or just live with it that after a restart, the data is
not current until we broadcast it the first time

> 
>> +# so we should not use this for very important data
>> +sub broadcast_data {
> 
> broadcast_node_kv ?

ok

> 
>> +    my ($key, $data) = @_;
>> +
>> +    my $size = length(encode_json($data));
>> +    if ($size >= (32 * 1024)) {
>> +	warn "data for '$key' too big\n";
>> +	return;
>> +    }
>> +
>> +    eval {
>> +	&$ipcc_update_status($key, $data);
> 
> so we just allow overwriting existing keys? how about a prefix like we have
> for "rrd/"

yeah makes sense

> 
>> +    };
>> +
>> +    warn $@ if $@;
>> +}
>> +
>> +sub get_data {
> 
> get_node_kv ?

ok

> 
>> +    my ($key, $nodename) = @_;
>> +
>> +    my $res = {};
>> +    my $sub = sub {
> 
> maybe a name which actually says something would help reading this ;-)

yeah, i guess sometimes i am a bit too lazy with names ;)

> 
>> +	my ($node) = @_;
>> +	eval {
>> +	    my $raw = &$ipcc_get_status($key, $node);
> 
> use new coderef call syntax, don't mix it in the same patch..
> (above &$ vs below $sub->())

of course (copy-is-my-hobby)

> 
>> +	    my $data = decode_json($raw) if $raw;
>> +	    $res->{$node} = $data;
>> +	};
>> +	my $err = $@;
>> +	syslog('err', $err) if $err;
>> +    };
>> +
>> +    if ($nodename) {
>> +	$sub->($nodename);
>> +    } else {
>> +	my $nodelist = get_nodelist();
>> +
>> +	foreach my $node (@$nodelist) {
>> +	    $sub->($node);
>> +	}
> 
> so this is not arbitrary data indexed by key, but a value store indexed with
> ($key, $node), i.e., same key on different nodes can hold different data, can
> be OK, but this needs to be noted, i.e., through more specific method names

ok, i just wanted to not loose the node info without becoming to 
specific about the structure of the data, but yes a different name makes 
senses, maybe we can have multiple versions of this with different 
structures? only if we need it of course

> 
> not yet really sure about this, or better said, I'm sure I don't want this for
> locks, but for things like ceph service states it could be OK.

yeah i am not sure about this either, hence i sent it as RFC ;)
i would prefer it for locks over the config parsing in the api call though

> 
>> +    }
>> +
>> +    return $res;
>> +}
>> +
>>   # $data must be a chronological descending ordered array of tasks
>>   sub broadcast_tasklist {
>>       my ($data) = @_;
>>
>