[pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache

Mon Oct 27 11:59:30 CET 2025

On October 16, 2025 3:18 pm, Christian Ebner wrote:
> Chunks detected as corrupt have been renamed on both, the S3 backend
> and the local datastore cache, but not evicted from the in-memory
> cache containing the LRU chunk digests. This can lead to the chunks
> being considered as already present if their digest is still cached,
> and therefore not being re-inserted in the local store cache and S3
> backend on backup upload.
> 
> Fix this by not only renaming the local datastore's chunk marker
> file, but also removing it from the in-memory cache while holding the
> chunk store mutex lock to exclude interference from concurrent chunk
> inserts.
> 
> Reported-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
> ---
>  pbs-datastore/src/datastore.rs | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index a7ea8fd96..c551679e3 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -2467,10 +2467,18 @@ impl DataStore {
>  
>          let _lock = self.inner.chunk_store.mutex().lock().unwrap();
>  
> -        match std::fs::rename(&path, &new_path) {
> +        let result = match std::fs::rename(&path, &new_path) {
>              Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
>              Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
>              Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
> +        };
> +
> +        if let Some(cache) = self.cache() {
> +            // Requiremets for calling the unsafe method are met, since snapshots referencing the
> +            // corrupt chunk are to be considered corrupt. Ignore the error due to the missing file.
> +            let _ = unsafe { cache.remove(digest) };
>          }

not exactly related to this patch, but something I noticed while
reviewing it:

how do we protect against concurrent actions on the same chunk between
- verification detects chunk as corrupt
- rename on S3 level
- lock + rename and remove on chunk store/cache level

let's say to verification tasks and a backup writer race, all involving
the same chunk:

- verification A and B detect chunk as corrupt
- verification A renames it on the S3 level
- verification A renames it on the chunk store level and removes it from
  the cache
- backup writer uploads it to S3
- verification B renames it on the S3 level

and now either:
- backup writer inserts it into the cache
- verification B renames it on the chunk store level and removes it from
  the cache

which means the backup writer creates a corrupt, but not detected as
such backup

or
- verification B attempts to rename it on the chunk store level
  (ENOTFOUND) and removes it from the cache
- backup writer inserts it into the cache

which means we just broke cache coherency - the chunk exists in the
chunk store and cache, but not on S3 (anymore)

or another racey variant:
- verification A and B detect chunk as corrupt
- verification A renames it on the S3 level
- verification B renames it on the S3 level but fails because it doesn't
  exist there??

I think we might need another synchronization mechanism for verification
involving S3..

> +
> +        result
>      }
>  }
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>