[pbs-devel] RFC: Scheduler for PBS

Fri Aug 9 16:20:30 CEST 2024

On Fri Aug 9, 2024 at 2:52 PM CEST, Dominik Csapak wrote:
> Hi,
>
> great to see that you tackle this!
>
> I read through the overview, which sounds fine, but I think that it
> should more reflect the actual issues, namely limitations in memory,
> threads, disk io and network.
>
> The actual reason people want to schedule things is to not overload the system
> (because of timeouts, hangs, etc.) so any scheduling system should consider
> not only the amount of jobs, but how much resources the the job will/can
> utilize.
>
> E.g. when I tried to introduce multi-threaded tape backup (configurable threads
> per tape job), Thomas rightfully said that it's probably not a good idea, since
> making multiple parallel tape backup job increases the load by much more than before.
>
> I generally like the approach, but I personally would like to see some
> work with resource constraints, for example one could imagine a configurable
> amount of available threads and (configurable?) used thread by job type
>
> so i can set my available to e.g. 10 and if my tape backup jobs then get
> 4, i can start 2 in parallel but not more
>
> Such a system does not have to be included from the beginning IMO, but the
> architecture should be prepared for such things
>
> Does that make sense?

That does make sense, yes! Thanks for bringing this to our attention.

We've just discussed this off-list a bit and mostly agree on stuff like
e.g. the thread limit per worker - though to be sure, do you mean the
number of threads that are passed to e.g. a `ParallelHandler` and
similar?

The scheduler doesn't really have a way to *really* enforce any limits,
though with the event-based architecture, it should be fairly trivial to
just add new fields to the scheduler's config.

We want to have a kind of "top-down control", so once the scheduler can
actually spawn and manage tasks itself (not like how it's done right
now, see my response to Chris), the scheduler could give the task a
separate thread pool for the stuff it wants to run in parallel. There
could even be different "types" of thread pools depending on the
purpose.

This is much easier said than done though, but I'm honestly rather
confident that we can get this to work. I would prefer to have the
resource-checking and -management decoupled and warded off, so that the
scheduler itself isn't really concerned with that. Rather, it should ask
the (e.g.) `ResourceManager` if there are enough threads available for a
`JobType::TapeBackup` or something of the sort.

Another thing we've been discussing just now was to just give the
spawned task a struct representing the limits it should abide to - that
would be a soft limit, but it would make things probably a lot easier.
(After all, passing a thread pool to the task also doesn't mean the task
*has* to use that thread pool...)

One thing I just discovered is tokio's `Semaphore` [1], which we could use
to keep track of the resources we've been handing out.

So, IMO this is a good idea and something we definitely should consider
in the future, though I have a couple questions:

1. How would you track & enforce memory limits? I think this is a much
   harder problem, to be honest.

2. In the same vein, how could one find out how much memory a given task
   will use? There's nothing that prevents tasks from just allocating
   more memory at will, obviously.

   Do you rather mean that if there's e.g. >90% memory being used (can
   be made configurable), that we're not spawning any additional tasks?

3. How would you limit disk IO? We definitely want to add a limit for
   the number of jobs that can run on a datastore at a time, so I guess
   that would also be indirectly included there..?

   (It could probably also be done with tokio's `Semaphore` [1], but
   we'd need some kind of abstraction on top of that, because we can
   still just read / write / open / close at will etc. We would need a
   uniform way of accessing disk resources and *not* use any other way
   to perform disk IO otherwise, which will be *hard*)

4. I guess network limits (e.g. bandwidth limits for sync jobs etc.)
   could just be enforced on the TCP socket, so this shouldn't be too
   hard. That way you could enforce individual rate limits for
   individual tasks. Though, probably also easier said than done. Can
   you elaborate some more on this, too?

Thanks a lot for your input, you've given us lots of ideas as well! :)

[1]: https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html