[pve-devel] [RFC proxmox 4/7] cache: add new crate 'proxmox-cache'

Wolfgang Bumiller w.bumiller at proxmox.com
Tue Aug 22 13:56:21 CEST 2023


On Tue, Aug 22, 2023 at 12:08:38PM +0200, Max Carrara wrote:
> On 8/21/23 15:44, Lukas Wagner wrote:
> > For now, it contains a file-backed cache with expiration logic.
> > The cache should be safe to be accessed from multiple processes at
> > once.
> > 
> 
> This seems pretty neat! The cache implementation seems straightforward
> enough. I'll see if I can test it more thoroughly later.
> 
> However, in my opinion we should have a crate like
> "proxmox-collections" (or something of the sort) with modules for each
> data structure / collection similar to the standard library; I'm
> curious what others think about that. imo it would be a great
> opportunity to introduce that crate in this series, since you're
> already introducing one for the cache anyway.
> 
> So, proxmox-collections would look something like this:
> 
>   proxmox-collections
>   └── src
>       ├── cache
>       │   ├── mod.rs
>       │   └── shared_cache.rs
>       └── lib.rs
> 
> Let me know what you think!

Smaller crates help with compile time, I don't want to pull in a big
collections crate if I need one single type of collection from it ;-)

> 
> > The cache stores values in a directory, based on the key.
> > E.g. key "foo" results in a file 'foo.json' in the given base
> > directory. If a new value is set, the file is atomically replaced.
> > The JSON file also contains some metadata, namely 'added_at' and
> > 'expire_in' - they are used for cache expiration.
> > 
> > Note: This cache is not suited to applications that
> >  - Might want to cache huge amounts of data, and/or access the cache
> >    very frequently (due to the overhead of JSON de/serialization)
> >  - Require arbitrary keys - right now, keys are limited by
> >    SAFE_ID_REGEX
> > 
> > The cache was developed for the use in pvestatd, in order to cache
> > e.g. storage plugin status. There, these limitations do not really
> > play any role.
> > 
> > Signed-off-by: Lukas Wagner <l.wagner at proxmox.com>
> > ---
> >  Cargo.toml                            |   1 +
> >  proxmox-cache/Cargo.toml              |  20 ++
> >  proxmox-cache/examples/performance.rs |  82 ++++++++
> >  proxmox-cache/src/lib.rs              |  40 ++++
> >  proxmox-cache/src/shared_cache.rs     | 263 ++++++++++++++++++++++++++
> >  5 files changed, 406 insertions(+)
> >  create mode 100644 proxmox-cache/Cargo.toml
> >  create mode 100644 proxmox-cache/examples/performance.rs
> >  create mode 100644 proxmox-cache/src/lib.rs
> >  create mode 100644 proxmox-cache/src/shared_cache.rs
> > 
> > diff --git a/Cargo.toml b/Cargo.toml
> > index e334ac1..940e1d0 100644
> > --- a/Cargo.toml
> > +++ b/Cargo.toml
> > @@ -5,6 +5,7 @@ members = [
> >      "proxmox-async",
> >      "proxmox-auth-api",
> >      "proxmox-borrow",
> > +    "proxmox-cache",
> >      "proxmox-client",
> >      "proxmox-compression",
> >      "proxmox-http",
> > diff --git a/proxmox-cache/src/lib.rs b/proxmox-cache/src/lib.rs
> > new file mode 100644
> > index 0000000..d496dc7
> > --- /dev/null
> > +++ b/proxmox-cache/src/lib.rs
> > @@ -0,0 +1,40 @@
> > +use anyhow::Error;
> > +use serde_json::Value;
> > +
> > +pub mod shared_cache;
> > +
> > +pub use shared_cache::SharedCache;
> > +
> > +trait TimeProvider {

Also mentioned off list: this trait will be dropped.
For tests, a `#[cfg(test)]` alternative will be used for fetching the
time.

We just never need a different provider outside of tests, and storing it
as `Box<dyn TimeProvider>` immediately drops all the auto traits from
the cache struct, since trait objects *only* ever get the auto traits
you *explicitly* define.

> > +    /// Returns the current time as a UNIX epoch (second resolution)
> > +    fn now(&self) -> i64;
> > +}
> > +
> > +struct DefaultTimeProvider;
> > +
> > +impl TimeProvider for DefaultTimeProvider {
> > +    fn now(&self) -> i64 {
> > +        proxmox_time::epoch_i64()
> > +    }
> > +}
> > +
> > +pub trait Cache {
> > +    /// Set or insert a cache entry.
> > +    ///
> > +    /// If `expires_in` is set, this entry will expire in the desired number of
> > +    /// seconds.
> > +    fn set<S: AsRef<str>>(
> > +        &self,
> > +        key: S,
> > +        value: Value,
> > +        expires_in: Option<i64>,
> > +    ) -> Result<(), Error>;
> > +
> > +    /// Delete a cache entry.
> > +    fn delete<S: AsRef<str>>(&self, key: S) -> Result<(), Error>;
> > +
> > +    /// Get a value from the cache.
> > +    ///
> > +    /// Expired entries will *not* be returned.
> > +    fn get<S: AsRef<str>>(&self, key: S) -> Result<Option<Value>, Error>;
> > +}
> 
> I don't necessarily think that a trait would be necessary in this
> case, as there's not really any other structure (that can be used as
> caching mechanism) that you're abstracting over. (more below)

Already somewhat mentioned this off list. This should not be a trait for
now. If we ever need one we can still add one.

> 
> > diff --git a/proxmox-cache/src/shared_cache.rs b/proxmox-cache/src/shared_cache.rs
> > new file mode 100644
> > index 0000000..be6212c
> > --- /dev/null
> > +++ b/proxmox-cache/src/shared_cache.rs
> > @@ -0,0 +1,263 @@
> > +use std::path::{Path, PathBuf};
> > +
> > +use anyhow::{bail, Error};
> > +use serde::{Deserialize, Serialize};
> > +use serde_json::Value;
> > +
> > +use proxmox_schema::api_types::SAFE_ID_FORMAT;
> > +use proxmox_sys::fs::CreateOptions;
> > +
> > +use crate::{Cache, DefaultTimeProvider, TimeProvider};
> > +
> > +/// A simple, file-backed cache that can be used from multiple processes concurrently.
> > +///
> > +/// Cache entries are stored as individual files inside a base directory. For instance,
> > +/// a cache entry with the key 'disk_stats' will result in a file 'disk_stats.json' inside
> > +/// the base directory. As the extension implies, the cached data will be stored as a JSON
> > +/// string.
> > +///
> > +/// For optimal performance, `SharedCache` should have its base directory in a `tmpfs`.
> > +///
> > +/// ## Key Space
> > +/// Due to the fact that cache keys are being directly used as filenames, they have to match the
> > +/// following regular expression: `[A-Za-z0-9_][A-Za-z0-9._\-]*`
> > +///
> > +/// ## Concurrency
> > +/// All cache operations are based on atomic file operations, thus accessing/updating the cache from
> > +/// multiple processes at the same time is safe.
> > +///
> > +/// ## Performance
> > +/// On a tmpfs:
> > +/// ```sh
> > +///   $ cargo run --release --example=performance
> > +///   inserting 100000 keys took 896.609758ms (8.966µs per key)
> > +///   getting 100000 keys took 584.874842ms (5.848µs per key)
> > +///   deleting 100000 keys took 247.742702ms (2.477µs per key)
> > +///
> > +/// Inserting/getting large objects might of course result in lower performance due to the cost
> > +/// of serialization.
> > +/// ```
> > +///
> > +pub struct SharedCache {
> > +    base_path: PathBuf,
> > +    time_provider: Box<dyn TimeProvider>,
> > +    create_options: CreateOptions,
> > +}
> 
> Instead, this should be generic:
> 
> pub struct SharedCache<K, V> { ... }
> 
> .. and maybe rename it to SharedFileCache to make it explicit that this
> operates on a file. (but that's more dependent on one's taste tbh)
> 
> .. and the impl block below ...

The main issue here is that it'll all boil down to one general "json
thingy" because it'll be used from out of perl where if we have multiple
types here to instantiate we'll need to create separate `perlmod
#[export]`s for each and every type we need in our perl code.

OTOH, we can still have the crate side here generic and have the perl-rs
side just instantiate a single `SharedCache<String, serde_json::Value>`...

A generic `K` might make it annoying to work with paths, though, so IMO
that can just stay a string ;-)

> 
> > +
> > +impl Cache for SharedCache {
> > +    fn set<S: AsRef<str>>(
> > +        &self,
> > +        key: S,
> > +        value: Value,

In any case, I'd definitely argue against `Value` here and just go with
a generic `V: Serialize` at least in the methods.

Because why serialize twice? (Once from some struct to Value, then from
Value to json.)

Otherwise we could also just store `&[u8]` and have this be nothing more
than a thing to attach timeouts to bytes in files ;-)

> > +        expires_in: Option<i64>,
> > +    ) -> Result<(), Error> {
> > +        let path = self.get_path_for_key(key.as_ref())?;
> > +        let added_at = self.time_provider.now();
> > +
> > +        let item = CachedItem {
> > +            value,
> > +            added_at,
> > +            expires_in,
> > +        };
> > +
> > +        let serialized = serde_json::to_vec_pretty(&item)?;
> > +
> > +        // Atomically replace file
> > +        proxmox_sys::fs::replace_file(path, &serialized, self.create_options.clone(), true)?;
> > +        Ok(())
> > +    }
> > +
> > +    fn delete<S: AsRef<str>>(&self, key: S) -> Result<(), Error> {
> > +        let path = self.get_path_for_key(key.as_ref())?;
> > +        std::fs::remove_file(path)?;
> > +        Ok(())
> > +    }
> > +
> > +    fn get<S: AsRef<str>>(&self, key: S) -> Result<Option<Value>, Error> {

Same here of course.

> > +        let path = self.get_path_for_key(key.as_ref())?;
> > +
> > +        let value = if let Some(content) = proxmox_sys::fs::file_get_optional_contents(path)? {
> > +            let value: CachedItem = serde_json::from_slice(&content)?;

`CachedItem` would need to be generic, and here you could use
`CachedItem<&'_ serde_json::RawValue>` which *should* just refer to a
subslice in `content` here to defer the payload parsing to later.

This way you avoid at least *some* overhead of the intermediate `Value`
while still being able to access the value's time.

Alternatively we could think of a more efficient format where the added
and expires fields are in a fixed-sized binary header... but it
shouldn't matter much.

> > +
> > +            let now = self.time_provider.now();
> > +
> > +            if let Some(expires_in) = value.expires_in {
> > +                // Check if value is not expired yet. Also do not allow
> > +                // values from the future, in case we have clock jumps
> > +                if value.added_at + expires_in > now && value.added_at <= now {
> > +                    Some(value.value)
> > +                } else {
> > +                    None
> > +                }
> > +            } else {
> > +                Some(value.value)
> > +            }
> > +        } else {
> > +            None
> > +        };
> > +
> > +        Ok(value)
> > +    }
> > +}
> 
> ... can be replaced as follows, in order to make it similar to
> std::collections::{HashMap, BTreeMap}:
> 
> impl<K: AsRef<str>> for SharedCache<K, Value> {
>     // Returns old value on successful insert, if given
>     fn insert(&self, k: K, v: Value) -> Result<Option<Value>, Error> {
>         // ...
>     }
> 
>     fn get(&self, k: K) -> Result<Option<Value>, Error> {
>         // ...
>     }
> 
>     fn remove(&self, k: K) -> Result<Option<Value>, Error> {
>         // ...
>     }
> }
> 
> If necessary / sensible, other methods (inspired by {HashMap, BTreeMap} can
> be added as well, such as remove_entry, retain, clear, etc.

Using the naming found in std sure makes sense.

But please hold off on the `Entry` API for now, that one will bite ya ;-)





More information about the pve-devel mailing list