[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Wed Nov 6 08:36:36 CET 2019

On 11/5/19 6:33 PM, Thomas Lamprecht wrote:
> On 11/5/19 1:51 PM, Dominik Csapak wrote:
>> add a helper to cache the ceph performance data inside pmxcfs
>> with broadcast_node_kv, and also a helper to read it out
>>
>> merge the data from all nodes that sent performance data
>>
>> the '$perf_cache' variable actually serves two purposes,
>> the writer (will be pvestatd) uses it to broadcast only its values,
>> and the reader (pvedaemon) uses it to cache all nodes data
>>
>> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
>> ---
>> merging the data on read seems like a good idea, since we have the data
>> and it should make little sense to throw any way, but i noticed some
>> weird glitches when the pvestat update calls are near each other, since
>> the timestamps are then e.g. ..01, ..02, ..03, ..11, ..12, etc.
>>
>> this looks slightly weird on the extjs charts since you have some
>> clustered data point at some timestamps but not others...
>>
>> we could of course always only use the data from the current node,
>> this way we would have a more consistent interval
>>
>>   PVE/Ceph/Services.pm | 54 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 54 insertions(+)
>>
>> diff --git a/PVE/Ceph/Services.pm b/PVE/Ceph/Services.pm
>> index 45eb6c3f..70cb9b9a 100644
>> --- a/PVE/Ceph/Services.pm
>> +++ b/PVE/Ceph/Services.pm
>> @@ -360,4 +360,58 @@ sub destroy_mgr {
>>       return undef;
>>   }
>>   
>> +my $perf_cache = [];
> 
> rather pass that to the method as parameter, so that the caller has it in scope,
> not the module..

make sense, ofc pvestatd/pvedaemon have to have that in the api module 
then to be useful

> 
>> +
>> +sub cache_perf_data {
>> +    my ($keep_seconds) = @_;
>> +
>> +    $keep_seconds //= 5*60; # 5 minutes
>> +
>> +    my $rados = PVE::RADOS->new();
> 
> wouldn't it make sense to cache this? Else we fork on each call of this[0],
> which doesn't seems to performant, fork isn't particularly cheap..
> 
> [0]: https://git.proxmox.com/?p=librados2-perl.git;a=blob;f=PVE/RADOS.pm;h=11af8a69ee1f3564538c856464908810cfa24eec;hb=HEAD#l124
> 

thought about this, but if we do, all changes to the ceph configuration
(e.g. monitor ip changes/adds/deletions, etc) will make the rados call 
fail, since we do not reconnect automatically, so we would have to 
handle that in the caller, but yes could be worth it
(has different problems, such as 'how often do we retry it, before 
giving up?')

>> +    my $status = $rados->mon_command({ prefix => "status" });
> 
> quite a bit of information, but the mgr(s) caches it anyway, and there
> seem no easy way to get just the PGDigest::print_summary info easily over
> librados2 directly, at least I found none, albeit the API docs seem to
> be pretty lacking.. So for now I'd say this is OK

this was my conclusion also, a small comment would have made sense here...

> 
>> +
>> +    my $pgmap = $status->{pgmap};
>> +
>> +    my $time = time();
> 
> it's a bit a shame that the iostats in pgmap have no timestamp.. :/
> 
>> +    my $entry = {
>> +	time => $time,
>> +	ops_r => $pgmap->{read_op_per_sec},
>> +	ops_w => $pgmap->{write_op_per_sec},
>> +	bytes_r => $pgmap->{read_bytes_sec},
>> +	bytes_w => $pgmap->{write_bytes_sec},
> 
> can all be undefined if no activity is on the ceph cluster

thats not that tragic, the gui can handle this, but yes
we do not have to add undef values to the entry in the first place

> 
>> +    };
>> +
>> +    push @$perf_cache, $entry;
>> +
>> +    # remove entries older than $keep_seconds
>> +    $perf_cache = [grep { $time - $_->{time} < $keep_seconds } @$perf_cache ];
> 
> why not just keep N entries, and remove all those which are  outside
> of that length restriction, something like (untested):
> 
> my $cache_depth = 512;
> 
> my $cache_depth = scalar @$cache;
> if ($cache_depth > $max_cache_depth) {
>      splice(@$cache, 0, $cache_depth - $max_cache_depth);
> }
> 
> probably faster and limits this on what we actually want, maximal
> entry count. As each entry has the time saved anyway a client can
> still cut-off after a certain time, if desired.

yeah makes sense, although i would not use such a high number; 32-64
would probably be enough.

> 
>> +
>> +    my $data = encode_json($perf_cache);
>> +    PVE::Cluster::broadcast_node_kv("ceph-perf", $data);
> 
> not sure if we want this, I mean those are pve ceph-server stats and so
> their the same on the whole cluster.. So we're uselessly broadcasting
> this (nodecount - 1) times..
> 
> A lockless design to have only one sender per node would be to just let
> the node with the highest node-id from a quorate partition broadcast this
> in a cluster wide status hashtable?
> If we can broadcast we're quorate, and thus we can trust the membership
> info, and even for the short race possibility where a node with higher
> priority becomes quorate we or it will effectively just overwrite other,
> also valid values, to this hash.

mhmm i am not so sure about this for various reasons:

* i did not want to touch pve-cluster/pmxcfs for this and reuse the 
existing interface. could we reuse the ipcc calls for a clusterwide 
kvstore? afaiu if we omit the nodename in 'ipcc_get_status' we
get the clusterwide content?

* letting only one node broadcast is a bit problematic, since we cannot
be sure of any node to be able to reach the ceph portion of the cluster,
since not all nodes have to be part of it (i have seen such setups in
the wild, altough strange). also since the ceph cluster is mostly
independent of the pve-cluster, we cannot be sure that a quorate
pve cluster member can also reach the quorate portion of the ceph
monitors. letting all nodes broadcast its data would eliminate
such gaps in data then.

* having multiple nodes query it, distributes the load between
the nodes, especially when considering my comment on patch 2/4
where i suggested that we reduce the amount of updates here and
since the pvestatd loops are not synchronized, we get more datapoints
with less rados calls per node

> 
> 
>> +}
>> +
>> +sub get_cached_perf_data {
> 
>> +
>> +    # only get new data if the already cached one is older than 10 seconds
>> +    if (scalar(@$perf_cache) > 0 && (time() - $perf_cache->[-1]->{time}) < 10) {
>> +	return $perf_cache;
>> +    }
>> +
>> +    my $raw = PVE::Cluster::get_node_kv("ceph-perf");
>> +
>> +    my $res = [];
>> +    my $times = {};
>> +
>> +    for my $host (keys %$raw) {
> 
> why multi host? Those stats are the same ceph-clusterwide, AFAICT, distributed
> through MgrStatMonitor PAXOS child class. E.g., I never saw different values if
> I executed the following command cluster wide at the same time:
> 
> perl -we 'use PVE::RADOS; PVE::RPCEnvironment->setup_default_cli_env();
> my $rados = PVE::RADOS->new();
> my $stat = $rados->mon_command({ prefix => "status" })->{pgmap};
> print "w/r: $stat->{write_bytes_sec}/$stat->{read_bytes_sec}\n";'
> 

see my above comment, the update calls are (by chance) not done at the 
same time

>> +	my $tmp = eval { decode_json($raw->{$host}) };
>> +	for my $entry (@$tmp) {
>> +	    my $etime = $entry->{time};
>> +	    push @$res, $entry if !$times->{$etime}++;
>> +	}
>> +    }
>> +
>> +    $perf_cache = [sort { $a->{time} <=> $b->{time} } @$res];
>> +    return $perf_cache;
>> +}
>> +
>>   1;
>>
>