[pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
Gabriel Goller
g.goller at proxmox.com
Thu Oct 17 15:20:10 CEST 2024
On 17.10.2024 08:15, Thomas Lamprecht wrote:
>Am 15/10/2024 um 15:18 schrieb Gabriel Goller:
>> This option allows us to "fix" corrupt snapshots (and/or their chunks)
>> by pulling them from another remote. When traversing the remote
>> snapshots, we check if it exists locally, and if it is, we check if the
>> last verification of it failed. If the local snapshot is broken and the
>> `resync-corrupt` option is turned on, we pull in the remote snapshot,
>> overwriting the local one.
>>
>> This is very useful and has been requested a lot, as there is currently
>> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
>> version of it on their offsite instance.
>>
>> Signed-off-by: Gabriel Goller <g.goller at proxmox.com>
>> Originally-by: Shannon Sterz <s.sterz at proxmox.com>
>
>those two trailers probably should be switched to uphold causal ordering.
Makes sense.
>In general, it would be _really_ nice to get some regression testing for
>sync jobs covering the selection with different matching and options.
>The lack of that doesn't have to block your/shannon's patch, but IMO it
>should be added rather sooner than later. Sync is becoming more and more
>complex, and the selection of snapshots is a very important and fundamental
>part of it.
Do you think we could mock ourselves out of this and unit-test it
somehow? For full regression tests we will need Lukas's CI work.
>> ---
>> pbs-api-types/src/jobs.rs | 11 +++++
>> pbs-datastore/src/backup_info.rs | 13 +++++-
>> src/api2/config/sync.rs | 4 ++
>> src/api2/pull.rs | 9 +++-
>> src/bin/proxmox-backup-manager.rs | 4 +-
>> src/server/pull.rs | 68 +++++++++++++++++++++----------
>> 6 files changed, 84 insertions(+), 25 deletions(-)
>>
>> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
>> index 868702bc059e..a3cd68bf5afd 100644
>> --- a/pbs-api-types/src/jobs.rs
>> +++ b/pbs-api-types/src/jobs.rs
>> @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>> .minimum(1)
>> .schema();
>>
>> +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new(
>> + "If the verification failed for a local snapshot, try to pull its chunks again.",
>
>not only chunks, also its indexes (which might be also a source of a bad
>verification result I think)
Yes, correct, I adjusted the description above.
>> +)
>> +.schema();
>> +
>> #[api(
>> properties: {
>> id: {
>> @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>> schema: TRANSFER_LAST_SCHEMA,
>> optional: true,
>> },
>> + "resync-corrupt": {
>> + schema: RESYNC_CORRUPT_SCHEMA,
>> + optional: true,
>> + }
>> }
>> )]
>> #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
>> @@ -585,6 +594,8 @@ pub struct SyncJobConfig {
>> pub limit: RateLimitConfig,
>> #[serde(skip_serializing_if = "Option::is_none")]
>> pub transfer_last: Option<usize>,
>> + #[serde(skip_serializing_if = "Option::is_none")]
>> + pub resync_corrupt: Option<bool>,
>> }
>>
>> impl SyncJobConfig {
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..2d283dbe5cd9 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>> use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>> use pbs_api_types::{
>> - Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> + Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> + BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> };
>> use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>> Ok(())
>> }
>> +
>> + /// Load the verify state from the manifest.
>> + pub fn verify_state(&self) -> Option<VerifyState> {
>> + self.load_manifest().ok().and_then(|(m, _)| {
>
>hmm, just ignoring errors here seems a bit odd, that might be a case where one
>might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit
>tedious.
Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now,
but I'd be wary about treating a failed `load_manifest` the same as a
`VerifyState::Failed`. Especially because currently an unverified
snapshot simply returns `Err(null...)`. So we would first need to extend
`VerifyState` to have a `Unknown` variant.
>> @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>(
>> reader: Arc<dyn SyncSourceReader + 'a>,
>> snapshot: &'a pbs_datastore::BackupDir,
>> downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> + resync_corrupt: bool,
>
>would be good to update the doc comment of this method to mention the new param
>in the description of steps
done.
>> ) -> Result<SyncStats, Error> {
>> let mut sync_stats = SyncStats::default();
>> let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>(
>> return Ok(sync_stats);
>> }
>>
>> + let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed);
>
>Hmm, but currently you'd only need to get the state if resync_corrupt is true?
>Or at least if the local manifest already exists, if we want to allow re-syncing
>on some other conditions (but that can be adapted to that once required).
>
>e.g.:
>
>let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed);
>
>That would make the checks below a bit easier to read.
Yep, done as well, this makes all the checks easier to read.
>And would it make sense to check the downloaded manifest's verify state, no point
>in re-pulling if the other side is also corrupt, or is that possibility already
>excluded somewhere else in this code path?
This should be covered, because:
- if the index is broken, we can't even open it
- if an unencrypted chunk is corrupted, the crc is checked when reading
it (pbs-client/src/remote_chunk_reader.rs:55)
- if an encrypted chunk is corrupted, the crc is checked as well
(pbs-datastore/src/data_blob.rs:81)
>> +
>> if manifest_name.exists() {
>> let manifest_blob = proxmox_lang::try_block!({
>> let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>(
>> format_err!("unable to read local manifest {manifest_name:?} - {err}")
>> })?;
>>
>> - if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
>> + if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data())
>
>unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less
>noise to read, so to say.
Changed.
>> + && (!resync_corrupt && !snapshot_is_corrupt)
>
>This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it
>slightly easier to read for me, and I'm slightly confused as why it's not
>`.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above:
>
>if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting {
> // ...
>}
>
>Or why would the expression need to be different here?
Changed this as well after combining the flags.
>> + {
>> if !client_log_name.exists() {
>> reader.try_download_client_log(&client_log_name).await?;
>> };
>> @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>(
>> let mut path = snapshot.full_path();
>> path.push(&item.filename);
>>
>> - if path.exists() {
>> + if !(resync_corrupt && snapshot_is_corrupt) && path.exists() {
>> match ArchiveType::from_path(&item.filename)? {
>> ArchiveType::DynamicIndex => {
>> let index = DynamicIndexReader::open(&path)?;
>> @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>(
>> }
>> }
>>
>> + if resync_corrupt && snapshot_is_corrupt {
>> + info!("pulling snapshot {} again", snapshot.dir());
>
>maybe include the reason for re-pulling this here, e.g. something like:
>
>"re-syncing snapshot {} due to bad verification result"
>
>And wouldn't that get logged multiple times for backups with multiple archives, like
>e.g. guests with multiple disks? Makes probably more sense to log that before the for
>loop
Yep, you're right, moved it to the beginning of the loop!
>> + }
>> +
>> let stats =
>> pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?;
>> sync_stats.add(stats);
>> @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>(
>> reader: Arc<dyn SyncSourceReader + 'a>,
>> snapshot: &'a pbs_datastore::BackupDir,
>> downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> + resync_corrupt: bool,
>> ) -> Result<SyncStats, Error> {
>> let (_path, is_new, _snap_lock) = snapshot
>> .datastore()
>> @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>(
>> let sync_stats = if is_new {
>> info!("sync snapshot {}", snapshot.dir());
>>
>> - match pull_snapshot(reader, snapshot, downloaded_chunks).await {
>> + match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
>> Err(err) => {
>> if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
>> snapshot.backup_ns(),
>> @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>(
>> }
>> } else {
>> info!("re-sync snapshot {}", snapshot.dir());
>> - pull_snapshot(reader, snapshot, downloaded_chunks).await?
>> + pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
>> };
>>
>> Ok(sync_stats)
>> @@ -528,21 +545,24 @@ async fn pull_group(
>> .enumerate()
>> .filter(|&(pos, ref dir)| {
>> source_snapshots.insert(dir.time);
>> - if last_sync_time > dir.time {
>> - already_synced_skip_info.update(dir.time);
>> - return false;
>> - } else if already_synced_skip_info.count > 0 {
>> - info!("{already_synced_skip_info}");
>> - already_synced_skip_info.reset();
>> - return true;
>> - }
>> + // If resync_corrupt is set, we go through all the remote snapshots
>> + if !params.resync_corrupt {
>
>maybe reverse that and use an explicit return? i.e.:
>
>if params.resync_corrupt {
> return true; // always check all remote snapshots when resyncing
>}
>
>// rest of the filter
>
>would avoid an indentation level for the non-bypass path and be slightly easier
>to read to me
True, fixed this as well.
Thanks for the review!
More information about the pbs-devel
mailing list