[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Nov 8 07:53:57 CET 2019


On 11/7/19 8:49 AM, Dominik Csapak wrote:
>>> yes librados and ceph config is available, but that does not mean the
>>> cluster is designed so that all nodes can reach the monitor nodes...
>>> e.g.:
>>>
>>> 5 nodes with node0-node2 ceph nodes, node3 a 'compute' node, and
>>> node4 is a node in the same cluster, but shares only the 'pve cluster
>>> network' with the others, not the ceph or vm network.. this node will
>>> never be able to reach the ceph monitors...
>>>
>>
>> You know which nodes hosts ceph. You even can limit this to monitor
>> nodes, and do the same there (lowest or highest node-id sends, if none
>> of those are quorate the monitor probably isn't either, and even if,
>> it simply does not hurt)
>>
> 
> this makes sense and is very doable, something like:
> 
> if (quorate && monitor_node) {
>     // get list of quorate nodes
> 
>     // get list of monitor nodes
>     
>     // check if i am the lowest/highest quorate monitor node
>     // if yes, collect/broadcast data
> }
> 
> the trade-off remains (this could happen to be a pve quorate node
> where the monitor is not quorate), but in a 'normal' cluster
> with 3 monitors the chance would be 1/3 (which is ok) and
> converges against 1/2 for very many monitors (unlikely)

What? you don't have a probability of 0.33 or 0.5 for a fallout...
Systems reliability and statistics doesn't work like that.

> 
> but yes as you say below, in such a scenario, the admin has
> different problems ;)
> 
>>>>>
>>>>> * having multiple nodes query it, distributes the load between
>>>>> the nodes, especially when considering my comment on patch 2/4
>>>>> where i suggested that we reduce the amount of updates here and
>>>>> since the pvestatd loops are not synchronized, we get more datapoints
>>>>> with less rados calls per node
>>>>
>>>> makes no sense, you multiply the (cluster traffic) load between nodes not
>>>> reduce it.. All nodes producing cluster traffic for this is NAK'd by me.
>>>
>>> i am not really up to speed about the network traffic the current
>>> corosync/pmxcfs versions produce, but i would imagine if we have
>>> 1 node syncing m datapoints, it should be roughly the same as
>>> n nodes syncing m/n datapoints ? we could scale that with the number of nodes for example...
>>
>> what? if each nodes sync whatever bytes, all nodes send and receive that many,
>> so you get (n-1) * (n-1) * (payload bytes + overhead) where overhead with crypto
>> and all is not exactly zero. While a single sender means one (n-1) term less, i.e.
>> O(n^2) vs. O(n) ..
>> Plus, with the monitor nodes are senders one saves "n - (monitor_count)" status
>> sends too.
>>
> 
> i obviously did not convey my idea very well...
> yes, as my patch currently is, you are right that we have
> (n-1)*(n-1)*size network traffic
> 
> but what i further tried to propose was a mechanic by which
> each node only sends the data on the nth iteration of the loop
> (where n could be fixed or even dependent on the (quorate) nodecount)
> 
> so that each node only sends 1/n datapoints per pvestatd loop (on average)

more complicated and more drawbacks, do not see how that helps...
So if a part of the nodes are offline, rebooted, what not, we just
get no statistics for some points??

Also you need to coordinate the time when this starts between all of them.
and adapt pvestatd scheduling to that.. Really not sure about this.

> 
>>>>>
>>>>> see my above comment, the update calls are (by chance) not done at the same time
>>>>
>>>> becomes obsolete once this is once per cluster, also I normally don't
>>>> want to have guaranteed-unpredictable time intervals in this sampling.
>>>
>>> i still see a problem with selecting one node as the source of truth
>>> (for above reasons) and in every scenario, we will have (at least some times) not even intervals (think pvestatd updates that take longer, network congestion, nodes leaving/entering the quorate partition, etc.)
>>
>> you can still have that if you do this per node, so I don't see your
>> point.
>>
>>>
>>> also the intervals are not unpredictable (besides my point above)
>>> they are just not evenly spaced...
>>
>> Didn't you just said "not even intervals", so how are they not
>> guaranteed unpredictable if, e.g., 16 nodes all sends that stuff..
>> That's for sure not equidistant - a single node having control over
>> this is as best it can get.
>>
> 
> we mean the same, but express us differently ;)
> 
> the timestamps 1,2,9,11,12,19,21,22,29,... are not equidistant (what i meant with 'not evenly spaced') but they are also not 'unpredictable' (what i understand as 'randomly spaced')

scheduling is unpredictable, parts of it are even taken in as
entropy for the random subsystem, so I'd guess that this holds.

Now, that's for one system, adding clustering + network and doing
something like you proposed will have very strange effects.

E.g., one node hangs a bit so the stats always miss entries dividable
through nodeid, then it get's through and injected and the graph can
look completely different at an instant.. I'd like my graphs to be
monotonic...

> 
> in summary, we have two proposals with different trade-offs:
> 
> 1. have one node selected (in a sane/stable way) which is the only one who updates
>    pros: mostly regular updates of the data
>          we can select the most sensible node ourselves
>    con: if we select a 'bad' node, we have no data at all, as long as we select that node


Again, just go with the following rules:

* all nodes get hash entry with - $nodeid as value
* drop nodes which are not quorate
* nodes which have a monitor get "total-node" subtracted from their $value
* take the one with  the lowest value

This results in preferring monitor nodes, but also fallsback to others
as "you can only win" strategy if none of those are available.





More information about the pve-devel mailing list