[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Thu Nov 7 08:49:03 CET 2019

i only partially quoted, since most is already clear

>>
>> yes librados and ceph config is available, but that does not mean the
>> cluster is designed so that all nodes can reach the monitor nodes...
>> e.g.:
>>
>> 5 nodes with node0-node2 ceph nodes, node3 a 'compute' node, and
>> node4 is a node in the same cluster, but shares only the 'pve cluster
>> network' with the others, not the ceph or vm network.. this node will
>> never be able to reach the ceph monitors...
>>
> 
> You know which nodes hosts ceph. You even can limit this to monitor
> nodes, and do the same there (lowest or highest node-id sends, if none
> of those are quorate the monitor probably isn't either, and even if,
> it simply does not hurt)
> 

this makes sense and is very doable, something like:

if (quorate && monitor_node) {
	// get list of quorate nodes

	// get list of monitor nodes

	// check if i am the lowest/highest quorate monitor node
	// if yes, collect/broadcast data
}

the trade-off remains (this could happen to be a pve quorate node
where the monitor is not quorate), but in a 'normal' cluster
with 3 monitors the chance would be 1/3 (which is ok) and
converges against 1/2 for very many monitors (unlikely)

but yes as you say below, in such a scenario, the admin has
different problems ;)

>>>>
>>>> * having multiple nodes query it, distributes the load between
>>>> the nodes, especially when considering my comment on patch 2/4
>>>> where i suggested that we reduce the amount of updates here and
>>>> since the pvestatd loops are not synchronized, we get more datapoints
>>>> with less rados calls per node
>>>
>>> makes no sense, you multiply the (cluster traffic) load between nodes not
>>> reduce it.. All nodes producing cluster traffic for this is NAK'd by me.
>>
>> i am not really up to speed about the network traffic the current
>> corosync/pmxcfs versions produce, but i would imagine if we have
>> 1 node syncing m datapoints, it should be roughly the same as
>> n nodes syncing m/n datapoints ? we could scale that with the number of nodes for example...
> 
> what? if each nodes sync whatever bytes, all nodes send and receive that many,
> so you get (n-1) * (n-1) * (payload bytes + overhead) where overhead with crypto
> and all is not exactly zero. While a single sender means one (n-1) term less, i.e.
> O(n^2) vs. O(n) ..
> Plus, with the monitor nodes are senders one saves "n - (monitor_count)" status
> sends too.
> 

i obviously did not convey my idea very well...
yes, as my patch currently is, you are right that we have
(n-1)*(n-1)*size network traffic

but what i further tried to propose was a mechanic by which
each node only sends the data on the nth iteration of the loop
(where n could be fixed or even dependent on the (quorate) nodecount)

so that each node only sends 1/n datapoints per pvestatd loop (on average)

>>>>
>>>> see my above comment, the update calls are (by chance) not done at the same time
>>>
>>> becomes obsolete once this is once per cluster, also I normally don't
>>> want to have guaranteed-unpredictable time intervals in this sampling.
>>
>> i still see a problem with selecting one node as the source of truth
>> (for above reasons) and in every scenario, we will have (at least some times) not even intervals (think pvestatd updates that take longer, network congestion, nodes leaving/entering the quorate partition, etc.)
> 
> you can still have that if you do this per node, so I don't see your
> point.
> 
>>
>> also the intervals are not unpredictable (besides my point above)
>> they are just not evenly spaced...
> 
> Didn't you just said "not even intervals", so how are they not
> guaranteed unpredictable if, e.g., 16 nodes all sends that stuff..
> That's for sure not equidistant - a single node having control over
> this is as best it can get.
> 

we mean the same, but express us differently ;)

the timestamps 1,2,9,11,12,19,21,22,29,... are not equidistant (what i 
meant with 'not evenly spaced') but they are also not 'unpredictable' 
(what i understand as 'randomly spaced')

in summary, we have two proposals with different trade-offs:

1. have one node selected (in a sane/stable way) which is the only one 
who updates
    pros: mostly regular updates of the data
          we can select the most sensible node ourselves
    con: if we select a 'bad' node, we have no data at all, as long as 
we select that node

2. have each node try to broadcast in an interval dependent on the 
number of nodes (to not have overly much traffic)
    pro: as long as most nodes have a connection to ceph, we get data 
(at least some times)
    cons: irregular updates of the data
          if ceph has a problem, impacts more nodes' pvestatd

and as my proposal has more cons and less pros, i give up ;)
i am sending a reworked v2 soon(tm), but will first try to reuse
our 'rrd' mechanism to save the data, since at a first look seems easier
to extend for this purpose that to have a completely new interface
(also the data is time series data, so putting this inside some kv store 
seems wrong..., even if i sent it that way in the first place^^)

thanks for reviewing and pushing me in the right direction :)