[pbs-devel] [PATCH proxmox-backup 3/5] garbage collection: add structure for optimized image iteration
Christian Ebner
c.ebner at proxmox.com
Fri Mar 7 09:24:00 CET 2025
On 3/5/25 14:47, Fabian Grünbichler wrote:
> On February 21, 2025 3:01 pm, Christian Ebner wrote:
>> Implements the GroupedImageList struct and methods, which groups
>> index files (image) paths by a hierarchy for optimized iteration
>> during phase 1 of garbage collection.
>>
>> Currently, phase 1 of garbage collection iterates over all folders in
>> the datastore, without considering any logical organization. This is
>> to avoid missing image indices which might have unexpected paths,
>> thereby deleting chunks which are still in use by these indices in GC
>> phase 2.
>>
>> The new structure helps to iterate over the index files in a more
>> logical way, without missing strange paths. The hierarchical
>> organization helps to avoid touching shared chunks of incremental
>> snapshot backups in a backup group multiple times, by allowing
>> tracking of these without excessive memory requirements.
>>
>> Since deduplication happens on a per image basis for subsequent
>> snapshots, the hierarchy is chosen as follows:
>> - ns/group
>> - image filename
>> - snapshot timestamp
>>
>> This allows to iterate over consecutive snapshots for the same images
>> in the same backup namespace and group.
>>
>> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
>> ---
>> pbs-datastore/src/datastore.rs | 63 ++++++++++++++++++++++++++++++++++
>> 1 file changed, 63 insertions(+)
>>
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index eda78193d..520f54548 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -1,4 +1,5 @@
>> use std::collections::{HashMap, HashSet};
>> +use std::ffi::OsString;
>> use std::io::{self, Write};
>> use std::os::unix::ffi::OsStrExt;
>> use std::os::unix::io::AsRawFd;
>> @@ -1573,3 +1574,65 @@ impl DataStore {
>> Ok(())
>> }
>> }
>> +
>> +struct GroupedImageList {
>> + groups: HashMap<String, HashMap<OsString, Vec<(i64, PathBuf)>>>,
>
> this seems very complicated, couldn't we make it a lot simpler by doing
>
> key: NS + Group (as tuple, using the actual types)
> value: (snapshot => index paths)
>
> or even a nested mapping of
>
> NS -> Group -> Snapshot -> Images
>
> and then simply reset when proceeding from one snapshot to the next? the
> scope of "in-memory chunk digests" would still be bounded (in a way that
> is in line with how we do it in many other parts, like when doing backup
> + restore), and the structure of this helper entity and the way it is
> iterated over would feel much more natural.
The type signature looks more complex than what it is in the end, but I
do agree that this can be reduced/optimized.
>
> we could go one step further, but that might eat some of our performance
> gains here
>
> - list all images like we did before and store the result in a
> collection that allows fast removal
> - iterate normally over the datastore in a structured fashion using the
> existing helpers
> -- when proceeding from one snapshot to the next, use the new "don't
> retouch chunks" logic
> -- remove each visited image path from the list of images
> - if any images are left at the end in that list, mark those manually
> (strange or vanished paths, should hopefully be empty or irrelevant)
>
> the main downside is that we'd have to iterate twice (well, not quite
> twice, since the hierarchical iterators skip parts outside of the
> "known" hierarchy), but we would save all this custom parse-back logic
> here. if we keep the parse-back logic, then I think we want to have a
> logical structure as well that follows how we normally do things.
>
> I think that the list_images part of GC is by far the least expensive
> though (for my test datastore where the GC runtime goes down from 52 to
> 17s with your patch series, listing images is 12ms of that ;)), so
> effectively doing it twice might not hurt that much in practice..
I do like this approach more, as this will also detect snapshots as
"strange paths" when the helpers themself do fail to detect an index
file for some reason, not just the snapshot being off.
So I will reiterate the patches using this approach, thanks!
More information about the pbs-devel
mailing list