[pbs-devel] [PATCH widget-toolkit] fix #4961: disks: increase timeout to 1min

Tue Sep 26 15:42:08 CEST 2023

Am 26/09/2023 um 10:15 schrieb Dominik Csapak:
> On 9/26/23 09:55, Fabian Grünbichler wrote:
>> On September 21, 2023 10:21 am, Gabriel Goller wrote:
>>> Increase the timeout of the api call to `../nodes/localhost/disks/list`
>>> to 1min. When having a lot of disks the default timeout of 30sec is
>>> not enough.
>>>
>>
>> that alone won't be enough, since we also have a 30s timeout for
>> proxying requests in a PVE cluster (and this component is used there).
> 
> it would actually be enough to fix that exact bug, which is for pbs
> where we don't have the 30s proxy limit, but you're right it's not
> a generic fix across both products
> 
> imho we could still apply this here (maybe with a short test to
> see if it doesn't make the situation worse for pve) so it's fixed
> for pbs at least and can properly fix it for pve later still

I'd rather like to avoid that, 30s is already long and might make some
users just close the dialogue or whatever.

And general "lightweight-task" mechanism that Fabian proposed could be
nice for some things but IMO the wrong approach here too.

Here, with the end result being to show a simple disk list, I just
refuse to believe that this has to take that long; here the API call has
some serious architectural problems, among other things that can mean a
mix of:

1. too much stuff read that is not relevant most of the time, we could
   avoid querying the data that has the worst expensive/useful ratio
   and make that only available on a per-disk info dialogue.

2. getting some info in a convoluted way (e.g, we have already lsblk
   info, which knows if a disk is a "zfs_member", but still do a zpool
   list, and all of those do separate stats on the device nodes to get
   the rdevs, multiple times...)

3. missing a cache for the more expensive stuff, e.g., one that the rrd
   stat querying code could probably fill every X'd run or so, as it
   checks for IO metrics anyway

I'd start with the 2. point above, as I saw quite a bit optimization
potential with only a quick skimming, some of that should hopefully
produce some actual improvements.

Then I'd check where most of the time is spent, either by asking the
reporter nicely for timings of all relevant commands and potentially
also by just checking ourselves, we have slow HW here too, and while
not 90 disks, even a tenth should show slowish enough behavior to
be able to find the major culprits with some tracing or even just
printf-timestamp-debugging

The _last_ thing I'd do here for this specific endpoint is extending
timeouts or inventing some new fancy light-weight task mechanisms.
As optimizations, omitting expensive stuff, or moving it out from
"for-all-disk" to a "per-single-disk" view, and caching for the
"expensive but just really interesting to always see" stuff goes a
long way..

btw. PBS still misses the Wipe-Disk feature from PVE..