[pbs-devel] RFC: Scheduler for PBS
Max Carrara
m.carrara at proxmox.com
Fri Aug 9 16:20:30 CEST 2024
On Fri Aug 9, 2024 at 2:52 PM CEST, Dominik Csapak wrote:
> Hi,
>
> great to see that you tackle this!
>
> I read through the overview, which sounds fine, but I think that it
> should more reflect the actual issues, namely limitations in memory,
> threads, disk io and network.
>
> The actual reason people want to schedule things is to not overload the system
> (because of timeouts, hangs, etc.) so any scheduling system should consider
> not only the amount of jobs, but how much resources the the job will/can
> utilize.
>
> E.g. when I tried to introduce multi-threaded tape backup (configurable threads
> per tape job), Thomas rightfully said that it's probably not a good idea, since
> making multiple parallel tape backup job increases the load by much more than before.
>
> I generally like the approach, but I personally would like to see some
> work with resource constraints, for example one could imagine a configurable
> amount of available threads and (configurable?) used thread by job type
>
> so i can set my available to e.g. 10 and if my tape backup jobs then get
> 4, i can start 2 in parallel but not more
>
> Such a system does not have to be included from the beginning IMO, but the
> architecture should be prepared for such things
>
> Does that make sense?
That does make sense, yes! Thanks for bringing this to our attention.
We've just discussed this off-list a bit and mostly agree on stuff like
e.g. the thread limit per worker - though to be sure, do you mean the
number of threads that are passed to e.g. a `ParallelHandler` and
similar?
The scheduler doesn't really have a way to *really* enforce any limits,
though with the event-based architecture, it should be fairly trivial to
just add new fields to the scheduler's config.
We want to have a kind of "top-down control", so once the scheduler can
actually spawn and manage tasks itself (not like how it's done right
now, see my response to Chris), the scheduler could give the task a
separate thread pool for the stuff it wants to run in parallel. There
could even be different "types" of thread pools depending on the
purpose.
This is much easier said than done though, but I'm honestly rather
confident that we can get this to work. I would prefer to have the
resource-checking and -management decoupled and warded off, so that the
scheduler itself isn't really concerned with that. Rather, it should ask
the (e.g.) `ResourceManager` if there are enough threads available for a
`JobType::TapeBackup` or something of the sort.
Another thing we've been discussing just now was to just give the
spawned task a struct representing the limits it should abide to - that
would be a soft limit, but it would make things probably a lot easier.
(After all, passing a thread pool to the task also doesn't mean the task
*has* to use that thread pool...)
One thing I just discovered is tokio's `Semaphore` [1], which we could use
to keep track of the resources we've been handing out.
So, IMO this is a good idea and something we definitely should consider
in the future, though I have a couple questions:
1. How would you track & enforce memory limits? I think this is a much
harder problem, to be honest.
2. In the same vein, how could one find out how much memory a given task
will use? There's nothing that prevents tasks from just allocating
more memory at will, obviously.
Do you rather mean that if there's e.g. >90% memory being used (can
be made configurable), that we're not spawning any additional tasks?
3. How would you limit disk IO? We definitely want to add a limit for
the number of jobs that can run on a datastore at a time, so I guess
that would also be indirectly included there..?
(It could probably also be done with tokio's `Semaphore` [1], but
we'd need some kind of abstraction on top of that, because we can
still just read / write / open / close at will etc. We would need a
uniform way of accessing disk resources and *not* use any other way
to perform disk IO otherwise, which will be *hard*)
4. I guess network limits (e.g. bandwidth limits for sync jobs etc.)
could just be enforced on the TCP socket, so this shouldn't be too
hard. That way you could enforce individual rate limits for
individual tasks. Though, probably also easier said than done. Can
you elaborate some more on this, too?
Thanks a lot for your input, you've given us lots of ideas as well! :)
[1]: https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html
More information about the pbs-devel
mailing list