[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers
Thomas Lamprecht
t.lamprecht at proxmox.com
Tue Nov 5 18:33:56 CET 2019
On 11/5/19 1:51 PM, Dominik Csapak wrote:
> add a helper to cache the ceph performance data inside pmxcfs
> with broadcast_node_kv, and also a helper to read it out
>
> merge the data from all nodes that sent performance data
>
> the '$perf_cache' variable actually serves two purposes,
> the writer (will be pvestatd) uses it to broadcast only its values,
> and the reader (pvedaemon) uses it to cache all nodes data
>
> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
> ---
> merging the data on read seems like a good idea, since we have the data
> and it should make little sense to throw any way, but i noticed some
> weird glitches when the pvestat update calls are near each other, since
> the timestamps are then e.g. ..01, ..02, ..03, ..11, ..12, etc.
>
> this looks slightly weird on the extjs charts since you have some
> clustered data point at some timestamps but not others...
>
> we could of course always only use the data from the current node,
> this way we would have a more consistent interval
>
> PVE/Ceph/Services.pm | 54 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 54 insertions(+)
>
> diff --git a/PVE/Ceph/Services.pm b/PVE/Ceph/Services.pm
> index 45eb6c3f..70cb9b9a 100644
> --- a/PVE/Ceph/Services.pm
> +++ b/PVE/Ceph/Services.pm
> @@ -360,4 +360,58 @@ sub destroy_mgr {
> return undef;
> }
>
> +my $perf_cache = [];
rather pass that to the method as parameter, so that the caller has it in scope,
not the module..
> +
> +sub cache_perf_data {
> + my ($keep_seconds) = @_;
> +
> + $keep_seconds //= 5*60; # 5 minutes
> +
> + my $rados = PVE::RADOS->new();
wouldn't it make sense to cache this? Else we fork on each call of this[0],
which doesn't seems to performant, fork isn't particularly cheap..
[0]: https://git.proxmox.com/?p=librados2-perl.git;a=blob;f=PVE/RADOS.pm;h=11af8a69ee1f3564538c856464908810cfa24eec;hb=HEAD#l124
> + my $status = $rados->mon_command({ prefix => "status" });
quite a bit of information, but the mgr(s) caches it anyway, and there
seem no easy way to get just the PGDigest::print_summary info easily over
librados2 directly, at least I found none, albeit the API docs seem to
be pretty lacking.. So for now I'd say this is OK
> +
> + my $pgmap = $status->{pgmap};
> +
> + my $time = time();
it's a bit a shame that the iostats in pgmap have no timestamp.. :/
> + my $entry = {
> + time => $time,
> + ops_r => $pgmap->{read_op_per_sec},
> + ops_w => $pgmap->{write_op_per_sec},
> + bytes_r => $pgmap->{read_bytes_sec},
> + bytes_w => $pgmap->{write_bytes_sec},
can all be undefined if no activity is on the ceph cluster
> + };
> +
> + push @$perf_cache, $entry;
> +
> + # remove entries older than $keep_seconds
> + $perf_cache = [grep { $time - $_->{time} < $keep_seconds } @$perf_cache ];
why not just keep N entries, and remove all those which are outside
of that length restriction, something like (untested):
my $cache_depth = 512;
my $cache_depth = scalar @$cache;
if ($cache_depth > $max_cache_depth) {
splice(@$cache, 0, $cache_depth - $max_cache_depth);
}
probably faster and limits this on what we actually want, maximal
entry count. As each entry has the time saved anyway a client can
still cut-off after a certain time, if desired.
> +
> + my $data = encode_json($perf_cache);
> + PVE::Cluster::broadcast_node_kv("ceph-perf", $data);
not sure if we want this, I mean those are pve ceph-server stats and so
their the same on the whole cluster.. So we're uselessly broadcasting
this (nodecount - 1) times..
A lockless design to have only one sender per node would be to just let
the node with the highest node-id from a quorate partition broadcast this
in a cluster wide status hashtable?
If we can broadcast we're quorate, and thus we can trust the membership
info, and even for the short race possibility where a node with higher
priority becomes quorate we or it will effectively just overwrite other,
also valid values, to this hash.
> +}
> +
> +sub get_cached_perf_data {
> +
> + # only get new data if the already cached one is older than 10 seconds
> + if (scalar(@$perf_cache) > 0 && (time() - $perf_cache->[-1]->{time}) < 10) {
> + return $perf_cache;
> + }
> +
> + my $raw = PVE::Cluster::get_node_kv("ceph-perf");
> +
> + my $res = [];
> + my $times = {};
> +
> + for my $host (keys %$raw) {
why multi host? Those stats are the same ceph-clusterwide, AFAICT, distributed
through MgrStatMonitor PAXOS child class. E.g., I never saw different values if
I executed the following command cluster wide at the same time:
perl -we 'use PVE::RADOS; PVE::RPCEnvironment->setup_default_cli_env();
my $rados = PVE::RADOS->new();
my $stat = $rados->mon_command({ prefix => "status" })->{pgmap};
print "w/r: $stat->{write_bytes_sec}/$stat->{read_bytes_sec}\n";'
> + my $tmp = eval { decode_json($raw->{$host}) };
> + for my $entry (@$tmp) {
> + my $etime = $entry->{time};
> + push @$res, $entry if !$times->{$etime}++;
> + }
> + }
> +
> + $perf_cache = [sort { $a->{time} <=> $b->{time} } @$res];
> + return $perf_cache;
> +}
> +
> 1;
>
More information about the pve-devel
mailing list