[pbs-devel] [PATCH proxmox-backup 4/9] fix #5982: garbage collection: check atime updates are honored

Mon Mar 3 17:09:10 CET 2025

On 3/3/25 16:50, Fabian Grünbichler wrote:
> On March 3, 2025 4:23 pm, Christian Ebner wrote:
>> On 3/3/25 14:51, Fabian Grünbichler wrote:
>>> On February 19, 2025 5:48 pm, Christian Ebner wrote:
>>>> Check if the filesystem backing the chunk store actually updates the
>>>> atime to avoid potential data loss in phase 2 of garbage collection
>>>> in case the atime update is not honored.
>>>>
>>>> Perform the check before phase 1 of garbage collection, as well as
>>>> on datastore creation. The latter to early detect and disallow
>>>> datastore creation on filesystem configurations which otherwise most
>>>> likely would lead to data losses.
>>>>
>>>> Enable the atime update check by default, but allow to opt-out by
>>>> setting a datastore tuning parameter flag for backwards compatibility.
>>>> This is honored by both, garbage collection and datastore creation.
>>>>
>>>> The check uses a 4 MiB fixed sized, unencypted and compressed chunk
>>>> as test marker, inserted if not present. This all zero-chunk is very
>>>> likely anyways for unencrypted backup contents with large all-zero
>>>> regions using fixed size chunking (e.g. VMs).
>>>>
>>>> The check sets the chunk's atime twice, once to set it to 1 second
>>>> into the past, and once to set it to now. By this one can detect if
>>>> the atime is set at all and if it also set if the filesystem is e.g.
>>>> mounted via relatime, which is intended to deferr the atime update.
>>>
>>> that's not how relatime works? see below..
>>>
>>>>
>>>> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=5982
>>>> Signed-off-by: Christian Ebner <c.ebner at proxmox.com>
>>>> ---
>>>> changes since version 1:
>>>> - Use all-zero fixed size chunk for atime update test
>>>> - Test on datastore creation and before garbage collection phase 1
>>>> - fail datastore creation and garbage collection jobs if check fails,
>>>>     but keep opt-out
>>>>
>>>>    pbs-datastore/src/chunk_store.rs | 80 ++++++++++++++++++++++++++++++--
>>>>    pbs-datastore/src/datastore.rs   | 10 ++++
>>>>    src/api2/config/datastore.rs     |  1 +
>>>>    3 files changed, 86 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs
>>>> index 7bdcb0297..be57e3c87 100644
>>>> --- a/pbs-datastore/src/chunk_store.rs
>>>> +++ b/pbs-datastore/src/chunk_store.rs
>>>> @@ -1,6 +1,7 @@
>>>>    use std::os::unix::io::AsRawFd;
>>>>    use std::path::{Path, PathBuf};
>>>>    use std::sync::{Arc, Mutex};
>>>> +use std::time::{Duration, SystemTime, UNIX_EPOCH};
>>>>    
>>>>    use anyhow::{bail, format_err, Error};
>>>>    use tracing::info;
>>>> @@ -13,6 +14,7 @@ use proxmox_sys::process_locker::{
>>>>    };
>>>>    use proxmox_worker_task::WorkerTaskContext;
>>>>    
>>>> +use crate::data_blob::DataChunkBuilder;
>>>>    use crate::file_formats::{
>>>>        COMPRESSED_BLOB_MAGIC_1_0, ENCRYPTED_BLOB_MAGIC_1_0, UNCOMPRESSED_BLOB_MAGIC_1_0,
>>>>    };
>>>> @@ -96,6 +98,7 @@ impl ChunkStore {
>>>>            uid: nix::unistd::Uid,
>>>>            gid: nix::unistd::Gid,
>>>>            sync_level: DatastoreFSyncLevel,
>>>> +        atime_update_check: bool,
>>>>        ) -> Result<Self, Error>
>>>>        where
>>>>            P: Into<PathBuf>,
>>>> @@ -150,7 +153,16 @@ impl ChunkStore {
>>>>                }
>>>>            }
>>>>    
>>>> -        Self::open(name, base, sync_level)
>>>> +        let chunk_store = Self::open(name, base, sync_level)?;
>>>> +        if atime_update_check {
>>>> +            chunk_store
>>>> +                .check_fs_atime_update()
>>>> +                .map(|()| info!("atime update check successful."))?;
>>>> +        } else {
>>>> +            info!("atime update check skipped.");
>>>> +        }
>>>> +
>>>> +        Ok(chunk_store)
>>>>        }
>>>>    
>>>>        fn lockfile_path<P: Into<PathBuf>>(base: P) -> PathBuf {
>>>> @@ -445,6 +457,51 @@ impl ChunkStore {
>>>>            Ok(())
>>>>        }
>>>>    
>>>> +    /// Check if atime updates are honored by the filesystem backing the chunk store.
>>>> +    ///
>>>> +    /// Sets the atime twice, once into the past by one second, checking that this was honored and
>>>> +    /// then to now again, as otherwise it is not guaranteed to also cover time ranges by `relatime`.
>>>> +    /// Uses a 4 MiB fixed size, compressed but unencrypted chunk to test. The chunk is inserted in
>>>> +    /// the chunk store if not yet present.
>>>> +    pub fn check_fs_atime_update(&self) -> Result<(), Error> {
>>>> +        let (zero_chunk, digest) = DataChunkBuilder::build_zero_chunk(None, 4096 * 1024, true)?;
>>>> +        self.insert_chunk(&zero_chunk, &digest)?;
>>>> +
>>>> +        let past = SystemTime::now().duration_since(UNIX_EPOCH)? - Duration::from_secs(1);
>>>> +        let times: [libc::timespec; 2] = [
>>>> +            libc::timespec {
>>>> +                tv_sec: past.as_secs().try_into()?,
>>>> +                tv_nsec: past.subsec_nanos().into(),
>>>> +            },
>>>> +            libc::timespec {
>>>> +                tv_sec: 0,
>>>> +                tv_nsec: UTIME_OMIT,
>>>> +            },
>>>> +        ];
>>>
>>> this sets *atime* one second in the past, while not touching *mtime*.
>>> this is not something we ever need? if this fails, something is weird,
>>> but it wouldn't mean that atime handling was broken for our use case..
>>>
>>>> +
>>>> +        use nix::NixPath;
>>>> +        let (path, _digest) = self.chunk_path(&digest);
>>>> +        path.with_nix_path(|cstr| unsafe {
>>>> +            let tmp = libc::utimensat(-1, cstr.as_ptr(), &times[0], libc::AT_SYMLINK_NOFOLLOW);
>>>> +            nix::errno::Errno::result(tmp)
>>>> +        })??;
>>>> +
>>>> +        let metadata_past = std::fs::metadata(&path).map_err(Error::from)?;
>>>> +        let atime_past = metadata_past.accessed()?;
>>>> +        if past != atime_past.duration_since(UNIX_EPOCH)? {
>>>> +            bail!("setting atime to the past failed, is atime support enabled on datastore backing filesystem?");
>>>> +        }
>>>> +
>>>> +        self.cond_touch_chunk(&digest, true)?;
>>>> +
>>>> +        let metadata_now = std::fs::metadata(&path).map_err(Error::from)?;
>>>> +        let atime_now = metadata_now.accessed()?;
>>>> +        if atime_past >= atime_now {
>>>> +            bail!("atime update check failed, is atime support enabled on datastore backing filesystem?");
>>>> +        }
>>>
>>> if something implements explicit timestamp updates to work like relatime
>>> does for read-triggered updates, this would still potentially fail -
>>> relatime covers checking *mtime* (which we haven't updated above!) first
>>> to see if *atime* should be updated when doing reads. it's basically a
>>> performance optimization (skip most atime updates for read-heavy work
>>> loads) without breaking a very specific use case (that of mutt abusing
>>> atime to know when it last read a local mailbox[0], which would break if
>>> atime were never updated on reads).
>>
>> Thanks a lot for the pointer! Need to dig a bit deeper into how relatime
>> works, as my understanding of it is clearly flawed.
>>
>>> I think what we actually want to do at GC time is:
>>>
>>> (potentially) insert the chunk
>>> stat the chunk
>>> touch the chunk (regularly)
>>> stat the chunk again
>>> compare the timestamps
>>
>> That was my initial approach, which failed because of the timestamp
>> granularity and since I didn't want to wait in-between the stat and
>> touch the approach with setting the atime into the past first came to
>> mind, therefore my implementation. But I see that this does not work
>> based on your feedback.
> 
> did you actually hit the granularity issue in practice? with which
> storage/file system?

Hmm, had a datastore on ZFS with atime on, which triggered the check for 
me on garbage collection, with an additional delay of 1 second however 
it worked... Might have been an implementation error tough...

> 
> because even with relatime or noatime on ext4 a sequence of
> 
> stat foobar; touch -a foobar; stat foobar; touch -a foobar; stat foobar
> 
> will give me different atimes for each stat. the same is true for ZFS no
> matter which atime settings I give the dataset.
> 
>>> we now have three cases:
>>>
>>> A) the old atime is before the new atime => the storage doesn't discard
>>> *all* atime updates and we can proceed
>>> B) the old atime and the new atime are identical and atime is *before*
>>> mtime and ctime => atime should have been updated even under
>>> "relatime-like" semantics, very likely this storage is discarding all
>>> atime updates, we should abort with an error
>>> C) the old atime and the new atime are identical, but atime is *after*
>>> ctime/mtime => we need to check whether we just ran into relatime-like
>>> behaviour
>>>
>>> C means we either abort as well (if we don't want to support such
>>> setups), or we try to detect if setting both mtime and atime works, and
>>> force the GC into doing that
>>>
>>> we might want to add a "wait a second" between the first stat and the
>>> touching to account for filesystems with more granular timestamps
>>> (inserting touches, and other tasks might touch at "the same time" as
>>> well by accident), since the requested timestamps might be modified to
>>> become "the greatest value supported by the filesystem that is not
>>> greater than the specified time" (utimensat(2)), and one second doesn't
>>> really hurt is in the scope of a GC..
>>>
>>> note that the 24h grace period is totally independent of this - it comes
>>> from lazytime and relates to how often changes of timestamps are
>>> *persisted on disk*. that shouldn't affect us for local storages (since
>>> we talk to the kernel and always get the updated view anyway, no matter
>>> if it is in-memory or on-disk. if the PBS system crashes GC is gone as
>>> well and needs to start over, no risk of inconsistency).
>>>
>>> we might have a problem if the storage uses lazytime internally and
>>> lives somewhere else
>>> - GC starts phase1
>>> - updates a few atimes
>>> - storage crashes and loses some of the updates while I/O to it blocks
>>> - storage comes back up
>>> - GC continues not knowing that some of the atime updates have been lost
>>>     in the ether
>>> - GC enters phase2
>>> - chunk timestamp makes it eligible for deletion, even though it was
>>>     touched in phase1
>>> - data loss
>>>
>>> I don't think we can really prevent this other than by ditching the
>>> whole atime mechanism and instead keep a list of written chunks in each
>>> backup writer, and generate such a list in phase1, merging them all and
>>> using that for phase2 (which is really expensive as it scales with the
>>> number of touched chunks and would require syncing up writers and GC).
>>> thankfully I'd expect most such failure scenarios to actually abort the
>>> GC anyway for I/O error reasons..
>>
>> Maybe one could use `statmount` to determine if the `SB_LAZYTIME` flag
>> in `smbuf.sb_flags` is set? But I assume that you meant the issue being
>> this not exposed by the filesystem...
> 
> the local mount having lazytime doesn't matter (if it does, that means
> the local kernel handles the consistency and if the kernel goes
> away/crashes, so does the GC). the issue would be some storage appliance
> using it (or something with similar semantics) internally that would
> cause atime updates to "work" but then disappear later without
> interrupting the GC. I don't see how we can protect against that without
> switching to a different, more expensive mechanism.

Hmm, okay. But at least one should be able to distinguish local 
filesystems from NAS by checking /proc/mounts output, so not to add this 
penalty to users running datastores on local filesystems.