[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Wed Nov 6 16:37:35 CET 2019

On 11/6/19 4:10 PM, Dominik Csapak wrote:
> On 11/6/19 3:49 PM, Thomas Lamprecht wrote:
>> On 11/6/19 8:36 AM, Dominik Csapak wrote:
>>> On 11/5/19 6:33 PM, Thomas Lamprecht wrote:
>>>> On 11/5/19 1:51 PM, Dominik Csapak wrote:
>>>>> +
>>>>> +    my $data = encode_json($perf_cache);
>>>>> +    PVE::Cluster::broadcast_node_kv("ceph-perf", $data);
>>>>
>>>> not sure if we want this, I mean those are pve ceph-server stats and so
>>>> their the same on the whole cluster.. So we're uselessly broadcasting
>>>> this (nodecount - 1) times..
>>>>
>>>> A lockless design to have only one sender per node would be to just let
>>>> the node with the highest node-id from a quorate partition broadcast this
>>>> in a cluster wide status hashtable?
>>>> If we can broadcast we're quorate, and thus we can trust the membership
>>>> info, and even for the short race possibility where a node with higher
>>>> priority becomes quorate we or it will effectively just overwrite other,
>>>> also valid values, to this hash.
>>>
>>> mhmm i am not so sure about this for various reasons:
>>>
>>> * i did not want to touch pve-cluster/pmxcfs for this and reuse the existing interface. could we reuse the ipcc calls for a clusterwide kvstore? afaiu if we omit the nodename in 'ipcc_get_status' we
>>> get the clusterwide content?
>>
>> No, it's all always per node. A cluster wide, not per-node, status hashtable
>> needs to be added, but that would be useful in general.
> 
> sure, makes sense
> 
>>
>>>
>>> * letting only one node broadcast is a bit problematic, since we cannot
>>> be sure of any node to be able to reach the ceph portion of the cluster,
>>> since not all nodes have to be part of it (i have seen such setups in
>>> the wild, altough strange). also since the ceph cluster is mostly
>>> independent of the pve-cluster, we cannot be sure that a quorate
>>> pve cluster member can also reach the quorate portion of the ceph
>>> monitors. letting all nodes broadcast its data would eliminate
>>> such gaps in data then.
>>
>> librados is still installed, ceph.conf still available so that's not
>> true, or?
> 
> yes librados and ceph config is available, but that does not mean the
> cluster is designed so that all nodes can reach the monitor nodes...
> e.g.:
> 
> 5 nodes with node0-node2 ceph nodes, node3 a 'compute' node, and
> node4 is a node in the same cluster, but shares only the 'pve cluster
> network' with the others, not the ceph or vm network.. this node will
> never be able to reach the ceph monitors...
> 

You know which nodes hosts ceph. You even can limit this to monitor
nodes, and do the same there (lowest or highest node-id sends, if none
of those are quorate the monitor probably isn't either, and even if,
it simply does not hurt)

>> The issue of "not reaching a quorate ceph monitor" is also unrelated,
>> and, if valid at all, should be solved transparently in librados
>> (reconnect to others)
> 
> not really, since we cannot guarantee that the quorate partition
> of the pve cluster has anything to do with the ceph network
> 

we can eliminate this as much as it's not realistically happening,
this is a trade-off, sure, but I rather have that in an edge case
than suboptimal-always

> e.g if the ceph network is on a completely different switch
> and the node with the highest id (or some different node chosen to transmit that data) has a broken cable there...
> (i know not redundant but still). all nodes can be happily quorate
> from the perspective of pve, but the one node is not be able
> to connect to the ceph portion of the cluster at all...

yeah and then what? That you do not see the IOstats is the least of
your problems in such a case, and if you use a solution like I proposed
above (monitors are senders) this is complete void anyway, as it's
just a local connection..

I rather have local-only SHM caches all fed from monitors only, than
pmxcfs full-cluster-spam.

>>>
>>> * having multiple nodes query it, distributes the load between
>>> the nodes, especially when considering my comment on patch 2/4
>>> where i suggested that we reduce the amount of updates here and
>>> since the pvestatd loops are not synchronized, we get more datapoints
>>> with less rados calls per node
>>
>> makes no sense, you multiply the (cluster traffic) load between nodes not
>> reduce it.. All nodes producing cluster traffic for this is NAK'd by me.
> 
> i am not really up to speed about the network traffic the current
> corosync/pmxcfs versions produce, but i would imagine if we have
> 1 node syncing m datapoints, it should be roughly the same as
> n nodes syncing m/n datapoints ? we could scale that with the number of nodes for example...

what? if each nodes sync whatever bytes, all nodes send and receive that many,
so you get (n-1) * (n-1) * (payload bytes + overhead) where overhead with crypto
and all is not exactly zero. While a single sender means one (n-1) term less, i.e.
O(n^2) vs. O(n) ..
Plus, with the monitor nodes are senders one saves "n - (monitor_count)" status
sends too.

>>
>>>
>>>>
>>>>
>>>>> +}
>>>>> +
>>>>> +sub get_cached_perf_data {
>>>>
>>>>> +
>>>>> +    # only get new data if the already cached one is older than 10 seconds
>>>>> +    if (scalar(@$perf_cache) > 0 && (time() - $perf_cache->[-1]->{time}) < 10) {
>>>>> +    return $perf_cache;
>>>>> +    }
>>>>> +
>>>>> +    my $raw = PVE::Cluster::get_node_kv("ceph-perf");
>>>>> +
>>>>> +    my $res = [];
>>>>> +    my $times = {};
>>>>> +
>>>>> +    for my $host (keys %$raw) {
>>>>
>>>> why multi host? Those stats are the same ceph-clusterwide, AFAICT, distributed
>>>> through MgrStatMonitor PAXOS child class. E.g., I never saw different values if
>>>> I executed the following command cluster wide at the same time:
>>>>
>>>> perl -we 'use PVE::RADOS; PVE::RPCEnvironment->setup_default_cli_env();
>>>> my $rados = PVE::RADOS->new();
>>>> my $stat = $rados->mon_command({ prefix => "status" })->{pgmap};
>>>> print "w/r: $stat->{write_bytes_sec}/$stat->{read_bytes_sec}\n";'
>>>>
>>>
>>> see my above comment, the update calls are (by chance) not done at the same time
>>
>> becomes obsolete once this is once per cluster, also I normally don't
>> want to have guaranteed-unpredictable time intervals in this sampling.
> 
> i still see a problem with selecting one node as the source of truth
> (for above reasons) and in every scenario, we will have (at least some times) not even intervals (think pvestatd updates that take longer, network congestion, nodes leaving/entering the quorate partition, etc.)

you can still have that if you do this per node, so I don't see your
point.

> 
> also the intervals are not unpredictable (besides my point above)
> they are just not evenly spaced...

Didn't you just said "not even intervals", so how are they not
guaranteed unpredictable if, e.g., 16 nodes all sends that stuff..
That's for sure not equidistant - a single node having control over
this is as best it can get.