[pbs-devel] [PATCH proxmox-backup 3/5] garbage collection: add structure for optimized image iteration

Fabian Grünbichler f.gruenbichler at proxmox.com
Wed Mar 5 14:47:36 CET 2025


On February 21, 2025 3:01 pm, Christian Ebner wrote:
> Implements the GroupedImageList struct and methods, which groups
> index files (image) paths by a hierarchy for optimized iteration
> during phase 1 of garbage collection.
> 
> Currently, phase 1 of garbage collection iterates over all folders in
> the datastore, without considering any logical organization. This is
> to avoid missing image indices which might have unexpected paths,
> thereby deleting chunks which are still in use by these indices in GC
> phase 2.
> 
> The new structure helps to iterate over the index files in a more
> logical way, without missing strange paths. The hierarchical
> organization helps to avoid touching shared chunks of incremental
> snapshot backups in a backup group multiple times, by allowing
> tracking of these without excessive memory requirements.
> 
> Since deduplication happens on a per image basis for subsequent
> snapshots, the hierarchy is chosen as follows:
> - ns/group
> - image filename
> - snapshot timestamp
> 
> This allows to iterate over consecutive snapshots for the same images
> in the same backup namespace and group.
> 
> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
> ---
>  pbs-datastore/src/datastore.rs | 63 ++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index eda78193d..520f54548 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -1,4 +1,5 @@
>  use std::collections::{HashMap, HashSet};
> +use std::ffi::OsString;
>  use std::io::{self, Write};
>  use std::os::unix::ffi::OsStrExt;
>  use std::os::unix::io::AsRawFd;
> @@ -1573,3 +1574,65 @@ impl DataStore {
>          Ok(())
>      }
>  }
> +
> +struct GroupedImageList {
> +    groups: HashMap<String, HashMap<OsString, Vec<(i64, PathBuf)>>>,

this seems very complicated, couldn't we make it a lot simpler by doing

key: NS + Group (as tuple, using the actual types)
value: (snapshot => index paths)

or even a nested mapping of

NS -> Group -> Snapshot -> Images

and then simply reset when proceeding from one snapshot to the next? the
scope of "in-memory chunk digests" would still be bounded (in a way that
is in line with how we do it in many other parts, like when doing backup
+ restore), and the structure of this helper entity and the way it is
iterated over would feel much more natural.

we could go one step further, but that might eat some of our performance
gains here

- list all images like we did before and store the result in a
  collection that allows fast removal
- iterate normally over the datastore in a structured fashion using the
  existing helpers
-- when proceeding from one snapshot to the next, use the new "don't
retouch chunks" logic
-- remove each visited image path from the list of images
- if any images are left at the end in that list, mark those manually
  (strange or vanished paths, should hopefully be empty or irrelevant)

the main downside is that we'd have to iterate twice (well, not quite
twice, since the hierarchical iterators skip parts outside of the
"known" hierarchy), but we would save all this custom parse-back logic
here. if we keep the parse-back logic, then I think we want to have a
logical structure as well that follows how we normally do things.

I think that the list_images part of GC is by far the least expensive
though (for my test datastore where the GC runtime goes down from 52 to
17s with your patch series, listing images is 12ms of that ;)), so
effectively doing it twice might not hurt that much in practice..

> +    strange_path_images: Vec<PathBuf>,
> +}
> +
> +impl GroupedImageList {
> +    fn new() -> Self {
> +        Self {
> +            groups: HashMap::new(),
> +            strange_path_images: Vec::new(),
> +        }
> +    }
> +
> +    fn insert(&mut self, img: &Path, base_path: &Path) -> Result<(), Error> {
> +        let img = img.to_path_buf();
> +
> +        if let Some(backup_dir_path) = img.parent() {
> +            let backup_dir_path = backup_dir_path.strip_prefix(base_path)?;
> +
> +            if let Some(backup_dir_str) = backup_dir_path.to_str() {
> +                if let Ok((namespace, backup_dir)) =
> +                    pbs_api_types::parse_ns_and_snapshot(backup_dir_str)
> +                {
> +                    if let Some(filename) = img.file_name() {
> +                        let filename = filename.to_os_string();
> +                        let group_key = format!("{namespace}/{group}", group = backup_dir.group);
> +
> +                        if let Some(images) = self.groups.get_mut(&group_key) {
> +                            if let Some(snapshots) = images.get_mut(&filename) {
> +                                snapshots.push((backup_dir.time, img));
> +                            } else {
> +                                let snapshots = vec![(backup_dir.time, img)];
> +                                images.insert(filename, snapshots);
> +                            }
> +                        } else {
> +                            // ns/group not present, insert new
> +                            let snapshots = vec![(backup_dir.time, img)];
> +                            let mut images = HashMap::new();
> +                            images.insert(filename, snapshots);
> +                            self.groups.insert(group_key, images);
> +                        }

this whole part could then be simplified using the Entry API, e.g. if 

    groups: HashMap<(BackupNamespace, String), HashMap<i64, Vec<PathBuf>>>,

then we can just do

                    let group_key = (namespace, backup_dir.group.to_string());

                    let snapshot_map = self.groups.entry(group_key).or_default();
                    let images = snapshot_map.entry(backup_dir.time).or_default();
                    images.push(img);

or similar if we split NS and group..

> +                        return Ok(());
> +                    }
> +                }
> +            }
> +        }
> +
> +        self.strange_path_images.push(img);
> +        Ok(())
> +    }
> +
> +    fn len(&self) -> usize {
> +        let mut count = self.strange_path_images.len();
> +        for (_group, images) in self.groups.iter() {
> +            for (_image, snapshots) in images.iter() {
> +                count += snapshots.len();
> +            }
> +        }
> +        count
> +    }
> +}
> -- 
> 2.39.5
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




More information about the pbs-devel mailing list