[pve-devel] [RFC/PATCH qemu] PVE-Backup: avoid segfault issues upon backup-cancel

Wed May 25 10:10:43 CEST 2022

There might still be an edge case where completion and cancel race (I
didn't run into this in practice yet, but at a first glance it seems
possible):

1. job_exit -> job_completed -> job_finalize_single starts
2. pvebackup_co_complete_stream gets spawned in completion callback
3. job finalize_single finishes -> job's refcount hits zero -> job is freed
4. qmp_backup_cancel comes in and locks backup_state.backup_mutex before
pvebackup_co_complete_stream can remove the job from the di_list
5. qmp_backup_cancel/job_cancel_bh will operate on the already freed memory

It /would/ be fine if pvebackup_co_complete_stream is guaranteed to
run/take the backup_mutex before qmp_backup_cancel. It *is* spawned
earlier so maybe it is, but I haven't looked into ordering guarantees
for coroutines yet and it does have another yield point when taking
&backup_state.stat.lock, so I'm not so sure.

Possible fix: ref jobs when adding them to di_list and unref them when
removing them from di_list (instead of the proposed ref/unref used in
this patch).

Yet another issue (not directly related, but thematically): in
create_backup_jobs_bh, in the error case, job_cancel_sync is called for
each job, but since it's a transaction, the first call will cancel and
free all jobs, also leading to segfaults in scenarios where creation of
a non-first job fails. And even if the first one fails, the job_unref
there is also wrong, since the job was freed during job_cancel_sync.