[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Wed Nov 6 16:10:42 CET 2019

On 11/6/19 3:49 PM, Thomas Lamprecht wrote:
> On 11/6/19 8:36 AM, Dominik Csapak wrote:
>> On 11/5/19 6:33 PM, Thomas Lamprecht wrote:
>>> On 11/5/19 1:51 PM, Dominik Csapak wrote:
>>>> +    my $entry = {
>>>> +    time => $time,
>>>> +    ops_r => $pgmap->{read_op_per_sec},
>>>> +    ops_w => $pgmap->{write_op_per_sec},
>>>> +    bytes_r => $pgmap->{read_bytes_sec},
>>>> +    bytes_w => $pgmap->{write_bytes_sec},
>>>
>>> can all be undefined if no activity is on the ceph cluster
>>
>> thats not that tragic, the gui can handle this, but yes
>> we do not have to add undef values to the entry in the first place
> 
> I'd rather set it to 0 if not defined, as this is the implicit value
> it has then. Else you have undefined datapoints in graphs..

ok

> 
>>>> +    push @$perf_cache, $entry;
>>>> +
>>>> +    # remove entries older than $keep_seconds
>>>> +    $perf_cache = [grep { $time - $_->{time} < $keep_seconds } @$perf_cache ];
>>>
>>> why not just keep N entries, and remove all those which are  outside
>>> of that length restriction, something like (untested):
>>>
>>> my $cache_depth = 512;
>>>
>>> my $cache_depth = scalar @$cache;
>>> if ($cache_depth > $max_cache_depth) {
>>>       splice(@$cache, 0, $cache_depth - $max_cache_depth);
>>> }
>>>
>>> probably faster and limits this on what we actually want, maximal
>>> entry count. As each entry has the time saved anyway a client can
>>> still cut-off after a certain time, if desired.
>>
>> yeah makes sense, although i would not use such a high number; 32-64
>> would probably be enough.
> 
> You already have the data, and the overhead for keeping this is roughly
> 5 * 8 bytes, so with a bit of overhead ~ 50bytes, times 512 is 25 KB, which
> is the last of our memory problems with perl... Having longer data windows
> can be useful when investigating.. Maybe it could still be worth to do this
> with RRD...

the way my patch works, would mean a higher overhead, since we
jsonencode the hash and store it as a string, which means a single 
datapoint is '{"time":123123123,"ops_r":1234123, [..]'
but yeah if we have a special ipcc call for this, we can optimize here
and only send the necessary data

> 
>>
>>>
>>>> +
>>>> +    my $data = encode_json($perf_cache);
>>>> +    PVE::Cluster::broadcast_node_kv("ceph-perf", $data);
>>>
>>> not sure if we want this, I mean those are pve ceph-server stats and so
>>> their the same on the whole cluster.. So we're uselessly broadcasting
>>> this (nodecount - 1) times..
>>>
>>> A lockless design to have only one sender per node would be to just let
>>> the node with the highest node-id from a quorate partition broadcast this
>>> in a cluster wide status hashtable?
>>> If we can broadcast we're quorate, and thus we can trust the membership
>>> info, and even for the short race possibility where a node with higher
>>> priority becomes quorate we or it will effectively just overwrite other,
>>> also valid values, to this hash.
>>
>> mhmm i am not so sure about this for various reasons:
>>
>> * i did not want to touch pve-cluster/pmxcfs for this and reuse the existing interface. could we reuse the ipcc calls for a clusterwide kvstore? afaiu if we omit the nodename in 'ipcc_get_status' we
>> get the clusterwide content?
> 
> No, it's all always per node. A cluster wide, not per-node, status hashtable
> needs to be added, but that would be useful in general.

sure, makes sense

> 
>>
>> * letting only one node broadcast is a bit problematic, since we cannot
>> be sure of any node to be able to reach the ceph portion of the cluster,
>> since not all nodes have to be part of it (i have seen such setups in
>> the wild, altough strange). also since the ceph cluster is mostly
>> independent of the pve-cluster, we cannot be sure that a quorate
>> pve cluster member can also reach the quorate portion of the ceph
>> monitors. letting all nodes broadcast its data would eliminate
>> such gaps in data then.
> 
> librados is still installed, ceph.conf still available so that's not
> true, or?

yes librados and ceph config is available, but that does not mean the
cluster is designed so that all nodes can reach the monitor nodes...
e.g.:

5 nodes with node0-node2 ceph nodes, node3 a 'compute' node, and
node4 is a node in the same cluster, but shares only the 'pve cluster
network' with the others, not the ceph or vm network.. this node will
never be able to reach the ceph monitors...

> The issue of "not reaching a quorate ceph monitor" is also unrelated,
> and, if valid at all, should be solved transparently in librados
> (reconnect to others)

not really, since we cannot guarantee that the quorate partition
of the pve cluster has anything to do with the ceph network

e.g if the ceph network is on a completely different switch
and the node with the highest id (or some different node chosen to 
transmit that data) has a broken cable there...
(i know not redundant but still). all nodes can be happily quorate
from the perspective of pve, but the one node is not be able
to connect to the ceph portion of the cluster at all...

> 
>>
>> * having multiple nodes query it, distributes the load between
>> the nodes, especially when considering my comment on patch 2/4
>> where i suggested that we reduce the amount of updates here and
>> since the pvestatd loops are not synchronized, we get more datapoints
>> with less rados calls per node
> 
> makes no sense, you multiply the (cluster traffic) load between nodes not
> reduce it.. All nodes producing cluster traffic for this is NAK'd by me.

i am not really up to speed about the network traffic the current
corosync/pmxcfs versions produce, but i would imagine if we have
1 node syncing m datapoints, it should be roughly the same as
n nodes syncing m/n datapoints ? we could scale that with the number of 
nodes for example...

> 
>>
>>>
>>>
>>>> +}
>>>> +
>>>> +sub get_cached_perf_data {
>>>
>>>> +
>>>> +    # only get new data if the already cached one is older than 10 seconds
>>>> +    if (scalar(@$perf_cache) > 0 && (time() - $perf_cache->[-1]->{time}) < 10) {
>>>> +    return $perf_cache;
>>>> +    }
>>>> +
>>>> +    my $raw = PVE::Cluster::get_node_kv("ceph-perf");
>>>> +
>>>> +    my $res = [];
>>>> +    my $times = {};
>>>> +
>>>> +    for my $host (keys %$raw) {
>>>
>>> why multi host? Those stats are the same ceph-clusterwide, AFAICT, distributed
>>> through MgrStatMonitor PAXOS child class. E.g., I never saw different values if
>>> I executed the following command cluster wide at the same time:
>>>
>>> perl -we 'use PVE::RADOS; PVE::RPCEnvironment->setup_default_cli_env();
>>> my $rados = PVE::RADOS->new();
>>> my $stat = $rados->mon_command({ prefix => "status" })->{pgmap};
>>> print "w/r: $stat->{write_bytes_sec}/$stat->{read_bytes_sec}\n";'
>>>
>>
>> see my above comment, the update calls are (by chance) not done at the same time
> 
> becomes obsolete once this is once per cluster, also I normally don't
> want to have guaranteed-unpredictable time intervals in this sampling.

i still see a problem with selecting one node as the source of truth
(for above reasons) and in every scenario, we will have (at least some 
times) not even intervals (think pvestatd updates that take longer, 
network congestion, nodes leaving/entering the quorate partition, etc.)

also the intervals are not unpredictable (besides my point above)
they are just not evenly spaced...

> 
>>
>>>> +    my $tmp = eval { decode_json($raw->{$host}) };
>>>> +    for my $entry (@$tmp) {
>>>> +        my $etime = $entry->{time};
>>>> +        push @$res, $entry if !$times->{$etime}++;
>>>> +    }
>>>> +    }
>>>> +
>>>> +    $perf_cache = [sort { $a->{time} <=> $b->{time} } @$res];
>>>> +    return $perf_cache;
>>>> +}
>>>> +
>>>>    1;
>>>>
> 
>