[pbs-devel] Scheduler causing connectivity issues?

Mon Jul 18 09:31:41 CEST 2022

Hi,

>You have 30% of runnable process getting stalled due waiting for IO, that
>naturally should not cause the request accept future to get starved but is
>the reason for why it happened with the current (or better old)
>architecture. Increasing available memory, so that the page cache can hold
>more entries, could already relieve that system a bit.

Thanks. Please note that /var/lib/proxmox is on a different set of disks 
than the datastores. Root pool is on two PM883’s, datastore is lots of 
spinning disks with nvme-special devices. Not sure if that’s relevant in 
your findings, but here you have it :)

Memory upgrade is somewhere on our roadmap.

>We improved on the reproducer we got locally by simulating a higher latency
>disk using dm-delay on a small single core VM.
>
>For one we made the libpve-storage-perl do more efficient list-snapshot
>requests if they can be filtered by VMID, and on the PBS side we moved most
>operations that cause IO (and are related to backup groups/snapshots) to a
>separate thread pool so that the main thread should be less
>congested/blocked.
Given the other responses in this thread, I’m not going to upgrade yet 
to a testing-version in production. Please let me know if there is any 
other info you need from me.

—
Mark Schouten, CTO
Tuxis B.V.
mark at tuxis.nl