[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Thomas Lamprecht t.lamprecht at proxmox.com
Wed Nov 6 15:49:02 CET 2019


On 11/6/19 8:36 AM, Dominik Csapak wrote:
> On 11/5/19 6:33 PM, Thomas Lamprecht wrote:
>> On 11/5/19 1:51 PM, Dominik Csapak wrote:
>>> +    my $entry = {
>>> +    time => $time,
>>> +    ops_r => $pgmap->{read_op_per_sec},
>>> +    ops_w => $pgmap->{write_op_per_sec},
>>> +    bytes_r => $pgmap->{read_bytes_sec},
>>> +    bytes_w => $pgmap->{write_bytes_sec},
>>
>> can all be undefined if no activity is on the ceph cluster
> 
> thats not that tragic, the gui can handle this, but yes
> we do not have to add undef values to the entry in the first place

I'd rather set it to 0 if not defined, as this is the implicit value
it has then. Else you have undefined datapoints in graphs..

>>> +    push @$perf_cache, $entry;
>>> +
>>> +    # remove entries older than $keep_seconds
>>> +    $perf_cache = [grep { $time - $_->{time} < $keep_seconds } @$perf_cache ];
>>
>> why not just keep N entries, and remove all those which are  outside
>> of that length restriction, something like (untested):
>>
>> my $cache_depth = 512;
>>
>> my $cache_depth = scalar @$cache;
>> if ($cache_depth > $max_cache_depth) {
>>      splice(@$cache, 0, $cache_depth - $max_cache_depth);
>> }
>>
>> probably faster and limits this on what we actually want, maximal
>> entry count. As each entry has the time saved anyway a client can
>> still cut-off after a certain time, if desired.
> 
> yeah makes sense, although i would not use such a high number; 32-64
> would probably be enough.

You already have the data, and the overhead for keeping this is roughly
5 * 8 bytes, so with a bit of overhead ~ 50bytes, times 512 is 25 KB, which
is the last of our memory problems with perl... Having longer data windows
can be useful when investigating.. Maybe it could still be worth to do this
with RRD...

> 
>>
>>> +
>>> +    my $data = encode_json($perf_cache);
>>> +    PVE::Cluster::broadcast_node_kv("ceph-perf", $data);
>>
>> not sure if we want this, I mean those are pve ceph-server stats and so
>> their the same on the whole cluster.. So we're uselessly broadcasting
>> this (nodecount - 1) times..
>>
>> A lockless design to have only one sender per node would be to just let
>> the node with the highest node-id from a quorate partition broadcast this
>> in a cluster wide status hashtable?
>> If we can broadcast we're quorate, and thus we can trust the membership
>> info, and even for the short race possibility where a node with higher
>> priority becomes quorate we or it will effectively just overwrite other,
>> also valid values, to this hash.
> 
> mhmm i am not so sure about this for various reasons:
> 
> * i did not want to touch pve-cluster/pmxcfs for this and reuse the existing interface. could we reuse the ipcc calls for a clusterwide kvstore? afaiu if we omit the nodename in 'ipcc_get_status' we
> get the clusterwide content?

No, it's all always per node. A cluster wide, not per-node, status hashtable
needs to be added, but that would be useful in general.

> 
> * letting only one node broadcast is a bit problematic, since we cannot
> be sure of any node to be able to reach the ceph portion of the cluster,
> since not all nodes have to be part of it (i have seen such setups in
> the wild, altough strange). also since the ceph cluster is mostly
> independent of the pve-cluster, we cannot be sure that a quorate
> pve cluster member can also reach the quorate portion of the ceph
> monitors. letting all nodes broadcast its data would eliminate
> such gaps in data then.

librados is still installed, ceph.conf still available so that's not
true, or?
The issue of "not reaching a quorate ceph monitor" is also unrelated,
and, if valid at all, should be solved transparently in librados
(reconnect to others)

> 
> * having multiple nodes query it, distributes the load between
> the nodes, especially when considering my comment on patch 2/4
> where i suggested that we reduce the amount of updates here and
> since the pvestatd loops are not synchronized, we get more datapoints
> with less rados calls per node

makes no sense, you multiply the (cluster traffic) load between nodes not
reduce it.. All nodes producing cluster traffic for this is NAK'd by me.

> 
>>
>>
>>> +}
>>> +
>>> +sub get_cached_perf_data {
>>
>>> +
>>> +    # only get new data if the already cached one is older than 10 seconds
>>> +    if (scalar(@$perf_cache) > 0 && (time() - $perf_cache->[-1]->{time}) < 10) {
>>> +    return $perf_cache;
>>> +    }
>>> +
>>> +    my $raw = PVE::Cluster::get_node_kv("ceph-perf");
>>> +
>>> +    my $res = [];
>>> +    my $times = {};
>>> +
>>> +    for my $host (keys %$raw) {
>>
>> why multi host? Those stats are the same ceph-clusterwide, AFAICT, distributed
>> through MgrStatMonitor PAXOS child class. E.g., I never saw different values if
>> I executed the following command cluster wide at the same time:
>>
>> perl -we 'use PVE::RADOS; PVE::RPCEnvironment->setup_default_cli_env();
>> my $rados = PVE::RADOS->new();
>> my $stat = $rados->mon_command({ prefix => "status" })->{pgmap};
>> print "w/r: $stat->{write_bytes_sec}/$stat->{read_bytes_sec}\n";'
>>
> 
> see my above comment, the update calls are (by chance) not done at the same time

becomes obsolete once this is once per cluster, also I normally don't
want to have guaranteed-unpredictable time intervals in this sampling.

> 
>>> +    my $tmp = eval { decode_json($raw->{$host}) };
>>> +    for my $entry (@$tmp) {
>>> +        my $etime = $entry->{time};
>>> +        push @$res, $entry if !$times->{$etime}++;
>>> +    }
>>> +    }
>>> +
>>> +    $perf_cache = [sort { $a->{time} <=> $b->{time} } @$res];
>>> +    return $perf_cache;
>>> +}
>>> +
>>>   1;
>>>






More information about the pve-devel mailing list