[pbs-devel] [PATCH proxmox-backup 0/4] improve push sync job log messages
Fabian Grünbichler
f.gruenbichler at proxmox.com
Thu Nov 21 20:32:56 CET 2024
> Gabriel Goller <g.goller at proxmox.com> hat am 21.11.2024 18:04 CET geschrieben:
>
>
> On 21.11.2024 17:26, Christian Ebner wrote:
> >On 11/21/24 17:06, Gabriel Goller wrote:
> >>I think I'd be nice if we use `.context()` instead of `format_err` and
> >>debug print ("{err:?}") instead of default ("{}") here. As these
> >>messages are all errors they shouldn't appear too often in the log and
> >>if an error happens, you get much more information.
> >>
> >>What do you think?
> >
> >Hmm, can check that out, but wouldn't that disrupt the single line
> >character of the log entries, prefixed by the time stamp?
>
> We could use "{err:#}", which will print everything in one line?
>
> Although on the other hand, I'd understand if you want to have strict
> control over what is displayed to the users, so no hard feelings on
> this one.
big tasks/job like the sync ones are a bit special w.r.t. error handling because they "catch" errors often at intermediate levels, log them, record that an error occur, but then proceed with a sensible next unit of work. so it's not possible to just add a lot of context to an error like with simple(r) API calls, where many/most errors can be treated as fatal, maybe requiring a bit of cleanup before bailing completely.
I think a good compromise is to treat those units of work as a sort of error scope, and
- add context where it is missing (e.g., low level file or network access, where just bubbling up errors might be meaningless)
- and then when logging the error, think about formatting
e.g., for syncs we have groups as lowest unit of work - if something fails within a group, we abort that group, but proceed with the next one. but if we just log "syncing group X failed - permission denied" that doesn't help whatsoever. it might be better to have three lines of warnings if something goes wrong, if those three lines contain appropriate information like
- request X / file access Y /.. failed with error Z (root cause)
- while processing snapshot A in group B (important context, because the request or path of the root cause might not tell us this)
- syncing group B failed (result of the error chain)
achieving that requires carefully analyzing all error sources/chains though. when in doubt, I'd rather have a bit too much context initially (that usually means too much is added at lower levels and we can optimize there) than too little, but within reason ;) we can't travel back in time and add needed information to debug issues to our or users' logs, a stray duplicate context can easily be filtered out though. of course that doesn't mean every error is allowed to grow 10 lines long by default, balance is still important even when doing a first pass/implementation of improved error messages!
and of course, the general rule applies - if we use errors with proper context, it should be done fairly consistently, so that the next level up can actually use it. it might be easier to use format_err or explicit additional `warn!` invocations to improve the situation immediately, and postpone adding proper contexts to do it for all of sync without time pressure.
More information about the pbs-devel
mailing list