[pve-devel] [PATCH pve-qemu] fix: 3865: backup restore human readable output
Thomas Lamprecht
t.lamprecht at proxmox.com
Fri Mar 25 18:59:00 CET 2022
On 25.03.22 13:39, Daniel Tschlatscher wrote:
> On 3/25/22 12:05, Thomas Lamprecht wrote:
>>>>> ++
>>>>> ++ if (delta != 0)
>>>>> ++ bps /= delta;
>>>> currently, if delta would be null you'd print a size unit as throughput unit?
>>>>
>>> This is due to the underlying code. The function that is wrapping this part of the code is called multiple times every second (something like ~20 times / sec) and in the original code a print statement would only occur after the percentage of bytes transferred increased by at least 1%.>
>>> The time elapsed between an increase in percent could be less than a second. In this case the output simply prints how much bytes were transferred since this second began. This means the value for throughput would at the time of the print be always lower than the real world B/s and only the last print in that second would represent the actual B/s.
>> This just doesn't makes sense to me, the if clauses enforces that its called at
>> max every 2s (or 100% edge case) already, and using wrong units is using wrong
>> units, you cannot really argue basic math/physics away..
>>
> The 100% edge case is the problem for the most part.
You could also ignore that in loop but print a 100% done after, rethinking this
makes the current approach to handle that look rather odd anyhow.
>
> It comes down to the inaccuracy of keeping time in seconds rather than something more granular. While it is true that in 0 time no actual data could be transferred, 0 in this case means somewhere from 0 => 0.999 seconds. Delta could be zero while the transferred bytes for example count 20MB.
yes, as mentioned using something more granular improves a lot of things..
>
> Just printing 0 would be confusing because *some* data was in fact transferred -> reflected in the bps variable. It could also give the user a false impression that their disk is slow or something similar if they see a lot of "0 B/s" entries (which was the actual reasoning here).
it isn't really reflect in bps as it's just wrong to print a size as throughput,
and with finer granularity or just not spamming the log at the 100% stage would
be a nicer, more practical solution.
The delta 0 -> throughput 0 is then still the only valid way to handle it, the
minimal time resolution the program handles needs to be used as metric, not the
physical (almost) infinitesimal one, and once that's ms or µs its only going to
happen when throughput would be rounded down to 0 anyway.
> The current implementation means, if delta is 0, bps = 20MB and in reality 0.33 seconds have elapsed that the program prints "20MB/s". In actuality the expected rate of transfer would be "60 MB/s" though.
bps (byte per second) cannot be MB, you really need to get your units straigt,
as else its just confusing to me, but I assume you meant a delta of 20 MiB in
bytes restored since last time, and then 20 / 0.33 = 60 so not seeing the issue.
>
> My thinking not using milliseconds here was: Even though I could calculate how much the transfer rate per second is, during the time of the print, it is still an approximation. Imagine following scenario:
So, you're arguing that because a much more accurate solution is still not 100%
accurate (which is impossible to do) lets go for the much less accurate one?
Sorry, but not really following that logic and really not getting why you still
argue for it?! It should be clear that using millisecond (or lower) resolution
gives much more accurate results, and can *never* be worse than more granular ones.
> In the first 0.33 secs 20MB/s are transferred (=> print goes out with 60 MB/s)
Do you meant to write "In the first 0.33s 20 MiB are transferred", as else we talk
second order derivative, i.e., throughput acceleration?
> In the next 0.33 secs only 10MB/s are transferred
> In the last 0.33 secs only 0.5MB/s are transferred
> In this case "60MB/s" was printed but 30.5MB have been transferred.
I don't get the point/example...
>
> The current implementation might be severely understating the actual throughput, but will never show a value for B/s that is higher than the actual throughput per second. The unaccounted for bytes here would be then reflected in the following output.
exactly, but making the interval, that the error "leaks" into, 1000, or even 1
million times smaller reduces the difference from reality vs. actual output quite
a lot.
> The millisecond approach would have both the same and opposite problem, e.g. if delta is only a few milliseconds and the throughput in that time a lot higher than average, then the printed B/s value could be highly exaggerated.
no, they do not have the opposite problem, that cannot really be as its the
same thing but just more accurate, it's still not perfect, and for very very
high bandwidths (100s of GiB/s) the error may be even still get relevant with
a per ms sampling, but then it'd be still 1000 times better on average.
>
> I let the current implementation be because, in practice, at the end of the backup restore there are usually a lot of sparse bytes (I hope I am using this term correctly). This means that the restore would go from 56% to 100% in less than a second. Simply writing 0 here would probably confuse the user more than anything because the value suddenly jumped by e.g. 6GiB with 0B/s.
This cannot be generalized in practice, on a long running production VM holes due
to sparseness can be there or not, and may not be accumulated all at the end.
Different OS and the filesystems they use also have very different impacts.
>
> Still, I do concede that using milliseconds just makes more sense because it's the more "proper" way to implement something like this and especially because of the unit. In the current implementation it would be something like B/last-output.
> Also, it seems to me that the rate of throughput plummeting or skyrocketing in the span of less than a second is rather unlikely, the millisecond method would statistically just be more correct.
Hows that unlikely? Modern HW and their caches are really not deterministic at all,
e.g., QLC SSDs with a few tens of GiB of SLC and some GiB of DDR4 cache can
show very stark jumps in throughput, from 6 Gbps (maxing out SATA link speed) to 100
Mbps (maxing out QLC). In addition to that, the IO schedule handling multiple
processes submitting IO concurrently doesn't makes it anymore stable, then add effects
of filesystems or possible ZFS RAID, ... on top. Also, simply that another process is
done with IO can make a lot of capacity available in mere micro seconds.
There's a reason some counters/timing bits of the IO paths adds to the kernel's RNG
entropy pool.
Further, note that the base assumption of why I went into this wasn't even erratic IO,
but rather CPU scheduling leading into very different actual duration intervals.
And that could even make a stable IO throughput flow display as completely erratic,
even if it actually wasn't. I.e., using such a low accuracy just adds way to much
distortion in any direction.
> I will change it in v3.
>
>>> Nonetheless, I later adapted this code to behave like the backup create function does, which updates every 3 seconds, eliminating this problem entirely. This is therefore dead code anyway.
>>>
>> later where?
>>
>> Actual improvement, track duration in milliseconds or less and still catch the edge
>> case for when delta is zero and just set throughput to zero then (as no data can
>> transfer in zero time).
>>
> Later in the sense of time. I rewrote the if statement above to only execute if delta is greater than 2 seconds (before it updated with every percent increase).
Ah ok, in a future patch - I encountered "later" quite often like in "Later in this patch
or series", so yeah can be clear enough & probably just confusion on my side.
> _>> __"eliminating this problem entirely"_ is, I retract, not true though because I forgot about the check whether progress is at 100%, in which case delta could still be 0.
In that case you either skip reporting this round and/or move the 100% reporting out of
the loop entirely, which could be the nicer approach in general anyway, and with the
reporting factored out in its own function also not much code overhead.
More information about the pve-devel
mailing list