[pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
Gabriel Goller
g.goller at proxmox.com
Mon Nov 4 17:02:21 CET 2024
On 04.11.2024 12:51, Fabian Grünbichler wrote:
>this doesn't really do what it says on the tin, see below.
>
>On October 18, 2024 11:09 am, Gabriel Goller wrote:
>> [snip]
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..c86fbb7568ab 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>> use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>> use pbs_api_types::{
>> - Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> + Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> + BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> };
>> use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>> Ok(())
>> }
>> +
>> + /// Load the verify state from the manifest.
>> + pub fn verify_state(&self) -> Result<VerifyState, anyhow::Error> {
>> + self.load_manifest().and_then(|(m, _)| {
>> + let verify = m.unprotected["verify_state"].clone();
>> + serde_json::from_value::<SnapshotVerifyState>(verify)
>> + .map(|svs| svs.state)
>> + .map_err(Into::into)
>> + })
>> + }
>
>wouldn't it make more sense to have this as a getter for an optional
>SnapshotVerifyState on the BackupManifest?
>
>then it could go into its own commit, other call sites that load the
>verify state from a manifest could be adapted to it, and then this
>commit can also start using it?
You're right the BackupManifest is loaded here twice actually. So I
could move this function to a getter in BackupManifest and then move the
construction of it (src/server/pull.rs:396):
let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
up to (src/server/pull.rs:365). And instead of having the try_block to
try read from the manifest just call BackupManifest::try_from.
I'll see if I can make this work...
>also see the comment further below about how the current implementation
>is very noisy if snapshots are newly synced as opposed to resynced..
>
>> }
>>
>> impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
>> diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
>> index 6fdc69a9e645..fa9db92f3d11 100644
>> --- a/src/api2/config/sync.rs
>> +++ b/src/api2/config/sync.rs
>> [snip]
>> diff --git a/src/server/pull.rs b/src/server/pull.rs
>> index 3117f7d2c960..b2dd15d9d6db 100644
>> --- a/src/server/pull.rs
>> +++ b/src/server/pull.rs
>> @@ -7,12 +7,14 @@ use std::sync::{Arc, Mutex};
>> use std::time::SystemTime;
>>
>> use anyhow::{bail, format_err, Error};
>> +use nom::combinator::verify;
>
>I think this snuck in ;)
Oops, my fault :)
>> [snip]
>> @@ -175,9 +182,10 @@ async fn pull_index_chunks<I: IndexFile>(
>> target.cond_touch_chunk(&info.digest, false)
>> })?;
>> if chunk_exists {
>> - //info!("chunk {} exists {}", pos, hex::encode(digest));
>> + //info!("chunk exists {}", hex::encode(info.digest));
>
>this
>
>> return Ok::<_, Error>(());
>> }
>> +
>
>and this as well?
Yep, removed both, thanks for the heads-up!
>> [snip]
>> @@ -325,13 +333,15 @@ async fn pull_single_archive<'a>(
>> /// - (Re)download the manifest
>> /// -- if it matches, only download log and treat snapshot as already synced
>> /// - Iterate over referenced files
>> -/// -- if file already exists, verify contents
>> +/// -- if file already exists, verify contents or pull again if last
>> +/// verification failed and `resync_corrupt` is true
>> /// -- if not, pull it from the remote
>> /// - Download log if not already existing
>> async fn pull_snapshot<'a>(
>> reader: Arc<dyn SyncSourceReader + 'a>,
>> snapshot: &'a pbs_datastore::BackupDir,
>> downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> + resync_corrupt: bool,
>> ) -> Result<SyncStats, Error> {
>> let mut sync_stats = SyncStats::default();
>> let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +362,14 @@ async fn pull_snapshot<'a>(
>> return Ok(sync_stats);
>> }
>>
>
>I think this part here is somewhat wrong ordering wise, or at least,
>unnecessarily expensive..
>
>if resync_corrupt is enabled, we want to (in this order!)
>- check the local snapshot for corruption, if it exists
>- if it is corrupt, we proceed with resyncing
>- if not, we only proceed with resyncing if it is the last snapshot in
> this group, and return early otherwise
>
>that way, we avoid redownloading all the manifests.. but see further
>below for another issue with the current implementation..
>
>> + let must_resync_existing = resync_corrupt
>> + && snapshot
>> + .verify_state()
>> + .inspect_err(|err| {
>> + tracing::error!("Failed to check verification state of snapshot: {err:?}")
>
>2024-11-04T12:34:57+01:00: Failed to check verification state of snapshot: unable to load blob '"/tank/pbs/ns/foobar/ns/test/ns/another_test/vm/900/2023-04-06T14:36:00Z/index.json.blob"' - No such file or directory (os error 2)
>
>this seems to be very noisy for newly synced snapshots, because the
>helper is implemented on BackupInfo instead of on BackupManifest..
True, but I think this is mostly because new backups don't have a
verify_state yet (and that doesn't necessarily mean == bad). Removed the
log line as it's silly anyway :)
>> + })
>> + .is_ok_and(|state| state == VerifyState::Failed);
>> +
>> if manifest_name.exists() {
>> let manifest_blob = proxmox_lang::try_block!({
>> let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> [snip]
>> @@ -528,6 +554,10 @@ async fn pull_group(
>> .enumerate()
>> .filter(|&(pos, ref dir)| {
>> source_snapshots.insert(dir.time);
>> + // If resync_corrupt is set, we go through all the remote snapshots
>> + if params.resync_corrupt {
>> + return true;
>> + }
>
>alternatively, we could check the local manifest here, and only include
>existing snapshots with a failed verification state, the last one and
>new ones? that way, we'd get more meaningful progress stats as well..
That's true, that would be cleaner. The downside is that we would have
open/parse the BackupManifest twice.
I could write something like:
if params.resync_corrupt {
let local_dir = params.target.store.backup_dir(target_ns.clone(), dir.clone());
if let Ok(dir) = local_dir {
let verify_state = dir.verify_state();
if verify_state == Some(VerifyState::Failed) {
return true;
}
}
}
>because right now, this will not only resync existing corrupt snapshots,
>but also ones that have been pruned locally, but not on the source
>(i.e., the other proposed "fixing" sync mode that syncs "missing"
>old snapshots, not just corrupt ones).
I'm too stupid to find the mail where this was mentioned/discussed, I'm
quite sure we said to just pull both, and then maybe separate them in a
future iteration/feature. But now that I think about the flag is named
`resync_corrupt` so I'd expect it to only pull in the corrupt snapshots.
I actually agree with this change, it probably is also more performant
(reading backup_manifest twice is probably faster than pulling lots of
unneeded manifests from the remote).
>> if last_sync_time > dir.time {
>> already_synced_skip_info.update(dir.time);
>> return false;
>> @@ -566,7 +596,13 @@ async fn pull_group(
>> .source
>> .reader(source_namespace, &from_snapshot)
>> .await?;
>> - let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
>> + let result = pull_snapshot_from(
>> + reader,
>> + &to_snapshot,
>> + downloaded_chunks.clone(),
>> + params.resync_corrupt,
>> + )
>> + .await;
>>
>> progress.done_snapshots = pos as u64 + 1;
>> info!("percentage done: {progress}");
More information about the pbs-devel
mailing list