[pve-devel] [RFC proxmox 4/7] cache: add new crate 'proxmox-cache'

Max Carrara m.carrara at proxmox.com
Tue Aug 22 15:52:32 CEST 2023


On 8/22/23 13:56, Wolfgang Bumiller wrote:
> On Tue, Aug 22, 2023 at 12:08:38PM +0200, Max Carrara wrote:
>> On 8/21/23 15:44, Lukas Wagner wrote:
>>> For now, it contains a file-backed cache with expiration logic.
>>> The cache should be safe to be accessed from multiple processes at
>>> once.
>>>
>>
>> This seems pretty neat! The cache implementation seems straightforward
>> enough. I'll see if I can test it more thoroughly later.
>>
>> However, in my opinion we should have a crate like
>> "proxmox-collections" (or something of the sort) with modules for each
>> data structure / collection similar to the standard library; I'm
>> curious what others think about that. imo it would be a great
>> opportunity to introduce that crate in this series, since you're
>> already introducing one for the cache anyway.
>>
>> So, proxmox-collections would look something like this:
>>
>>   proxmox-collections
>>   └── src
>>       ├── cache
>>       │   ├── mod.rs
>>       │   └── shared_cache.rs
>>       └── lib.rs
>>
>> Let me know what you think!
> 
> Smaller crates help with compile time, I don't want to pull in a big
> collections crate if I need one single type of collection from it ;-)
> 

They do indeed, but in this case I'd prefer a single crate that guards
each collection behind a feature, which has essentially the same effect.

Or as an alternative, let proxmox-collections be a meta-crate that
re-exports smaller crates in its own namespace; let those smaller
crates then be optional dependencies which are only pulled in and compiled
when their respective features are specified[0].

I'm not sure how the latter works in conjunction with deb packaging, however.

That being said, I'm otherwise not opposed against proxmox-cache becoming
its own crate, but I do think that we should group our code logically in some way.

>>
>>> The cache stores values in a directory, based on the key.
>>> E.g. key "foo" results in a file 'foo.json' in the given base
>>> directory. If a new value is set, the file is atomically replaced.
>>> The JSON file also contains some metadata, namely 'added_at' and
>>> 'expire_in' - they are used for cache expiration.
>>>
>>> Note: This cache is not suited to applications that
>>>  - Might want to cache huge amounts of data, and/or access the cache
>>>    very frequently (due to the overhead of JSON de/serialization)
>>>  - Require arbitrary keys - right now, keys are limited by
>>>    SAFE_ID_REGEX
>>>
>>> The cache was developed for the use in pvestatd, in order to cache
>>> e.g. storage plugin status. There, these limitations do not really
>>> play any role.
>>>
>>> Signed-off-by: Lukas Wagner <l.wagner at proxmox.com>
>>> ---
>>>  Cargo.toml                            |   1 +
>>>  proxmox-cache/Cargo.toml              |  20 ++
>>>  proxmox-cache/examples/performance.rs |  82 ++++++++
>>>  proxmox-cache/src/lib.rs              |  40 ++++
>>>  proxmox-cache/src/shared_cache.rs     | 263 ++++++++++++++++++++++++++
>>>  5 files changed, 406 insertions(+)
>>>  create mode 100644 proxmox-cache/Cargo.toml
>>>  create mode 100644 proxmox-cache/examples/performance.rs
>>>  create mode 100644 proxmox-cache/src/lib.rs
>>>  create mode 100644 proxmox-cache/src/shared_cache.rs
>>>
>>> diff --git a/Cargo.toml b/Cargo.toml
>>> index e334ac1..940e1d0 100644
>>> --- a/Cargo.toml
>>> +++ b/Cargo.toml
>>> @@ -5,6 +5,7 @@ members = [
>>>      "proxmox-async",
>>>      "proxmox-auth-api",
>>>      "proxmox-borrow",
>>> +    "proxmox-cache",
>>>      "proxmox-client",
>>>      "proxmox-compression",
>>>      "proxmox-http",
>>> diff --git a/proxmox-cache/src/lib.rs b/proxmox-cache/src/lib.rs
>>> new file mode 100644
>>> index 0000000..d496dc7
>>> --- /dev/null
>>> +++ b/proxmox-cache/src/lib.rs
>>> @@ -0,0 +1,40 @@
>>> +use anyhow::Error;
>>> +use serde_json::Value;
>>> +
>>> +pub mod shared_cache;
>>> +
>>> +pub use shared_cache::SharedCache;
>>> +
>>> +trait TimeProvider {
> 
> Also mentioned off list: this trait will be dropped.
> For tests, a `#[cfg(test)]` alternative will be used for fetching the
> time.
> 
> We just never need a different provider outside of tests, and storing it
> as `Box<dyn TimeProvider>` immediately drops all the auto traits from
> the cache struct, since trait objects *only* ever get the auto traits
> you *explicitly* define.
> 
>>> +    /// Returns the current time as a UNIX epoch (second resolution)
>>> +    fn now(&self) -> i64;
>>> +}
>>> +
>>> +struct DefaultTimeProvider;
>>> +
>>> +impl TimeProvider for DefaultTimeProvider {
>>> +    fn now(&self) -> i64 {
>>> +        proxmox_time::epoch_i64()
>>> +    }
>>> +}
>>> +
>>> +pub trait Cache {
>>> +    /// Set or insert a cache entry.
>>> +    ///
>>> +    /// If `expires_in` is set, this entry will expire in the desired number of
>>> +    /// seconds.
>>> +    fn set<S: AsRef<str>>(
>>> +        &self,
>>> +        key: S,
>>> +        value: Value,
>>> +        expires_in: Option<i64>,
>>> +    ) -> Result<(), Error>;
>>> +
>>> +    /// Delete a cache entry.
>>> +    fn delete<S: AsRef<str>>(&self, key: S) -> Result<(), Error>;
>>> +
>>> +    /// Get a value from the cache.
>>> +    ///
>>> +    /// Expired entries will *not* be returned.
>>> +    fn get<S: AsRef<str>>(&self, key: S) -> Result<Option<Value>, Error>;
>>> +}
>>
>> I don't necessarily think that a trait would be necessary in this
>> case, as there's not really any other structure (that can be used as
>> caching mechanism) that you're abstracting over. (more below)
> 
> Already somewhat mentioned this off list. This should not be a trait for
> now. If we ever need one we can still add one.
> 
>>
>>> diff --git a/proxmox-cache/src/shared_cache.rs b/proxmox-cache/src/shared_cache.rs
>>> new file mode 100644
>>> index 0000000..be6212c
>>> --- /dev/null
>>> +++ b/proxmox-cache/src/shared_cache.rs
>>> @@ -0,0 +1,263 @@
>>> +use std::path::{Path, PathBuf};
>>> +
>>> +use anyhow::{bail, Error};
>>> +use serde::{Deserialize, Serialize};
>>> +use serde_json::Value;
>>> +
>>> +use proxmox_schema::api_types::SAFE_ID_FORMAT;
>>> +use proxmox_sys::fs::CreateOptions;
>>> +
>>> +use crate::{Cache, DefaultTimeProvider, TimeProvider};
>>> +
>>> +/// A simple, file-backed cache that can be used from multiple processes concurrently.
>>> +///
>>> +/// Cache entries are stored as individual files inside a base directory. For instance,
>>> +/// a cache entry with the key 'disk_stats' will result in a file 'disk_stats.json' inside
>>> +/// the base directory. As the extension implies, the cached data will be stored as a JSON
>>> +/// string.
>>> +///
>>> +/// For optimal performance, `SharedCache` should have its base directory in a `tmpfs`.
>>> +///
>>> +/// ## Key Space
>>> +/// Due to the fact that cache keys are being directly used as filenames, they have to match the
>>> +/// following regular expression: `[A-Za-z0-9_][A-Za-z0-9._\-]*`
>>> +///
>>> +/// ## Concurrency
>>> +/// All cache operations are based on atomic file operations, thus accessing/updating the cache from
>>> +/// multiple processes at the same time is safe.
>>> +///
>>> +/// ## Performance
>>> +/// On a tmpfs:
>>> +/// ```sh
>>> +///   $ cargo run --release --example=performance
>>> +///   inserting 100000 keys took 896.609758ms (8.966µs per key)
>>> +///   getting 100000 keys took 584.874842ms (5.848µs per key)
>>> +///   deleting 100000 keys took 247.742702ms (2.477µs per key)
>>> +///
>>> +/// Inserting/getting large objects might of course result in lower performance due to the cost
>>> +/// of serialization.
>>> +/// ```
>>> +///
>>> +pub struct SharedCache {
>>> +    base_path: PathBuf,
>>> +    time_provider: Box<dyn TimeProvider>,
>>> +    create_options: CreateOptions,
>>> +}
>>
>> Instead, this should be generic:
>>
>> pub struct SharedCache<K, V> { ... }
>>
>> .. and maybe rename it to SharedFileCache to make it explicit that this
>> operates on a file. (but that's more dependent on one's taste tbh)
>>
>> .. and the impl block below ...
> 
> The main issue here is that it'll all boil down to one general "json
> thingy" because it'll be used from out of perl where if we have multiple
> types here to instantiate we'll need to create separate `perlmod
> #[export]`s for each and every type we need in our perl code.
> 
> OTOH, we can still have the crate side here generic and have the perl-rs
> side just instantiate a single `SharedCache<String, serde_json::Value>`...
> 

Or perhaps have the perl-rs side wrap `SharedCache<String, Value>` and expose
only the methods that are needed; I don't see the issue here ;-)

> A generic `K` might make it annoying to work with paths, though, so IMO
> that can just stay a string ;-)
> 
>>
>>> +
>>> +impl Cache for SharedCache {
>>> +    fn set<S: AsRef<str>>(
>>> +        &self,
>>> +        key: S,
>>> +        value: Value,
> 
> In any case, I'd definitely argue against `Value` here and just go with
> a generic `V: Serialize` at least in the methods.
> 
> Because why serialize twice? (Once from some struct to Value, then from
> Value to json.)
> 

ACK, that one I hadn't considered. Good point!

> Otherwise we could also just store `&[u8]` and have this be nothing more
> than a thing to attach timeouts to bytes in files ;-)
> 
>>> +        expires_in: Option<i64>,
>>> +    ) -> Result<(), Error> {
>>> +        let path = self.get_path_for_key(key.as_ref())?;
>>> +        let added_at = self.time_provider.now();
>>> +
>>> +        let item = CachedItem {
>>> +            value,
>>> +            added_at,
>>> +            expires_in,
>>> +        };
>>> +
>>> +        let serialized = serde_json::to_vec_pretty(&item)?;
>>> +
>>> +        // Atomically replace file
>>> +        proxmox_sys::fs::replace_file(path, &serialized, self.create_options.clone(), true)?;
>>> +        Ok(())
>>> +    }
>>> +
>>> +    fn delete<S: AsRef<str>>(&self, key: S) -> Result<(), Error> {
>>> +        let path = self.get_path_for_key(key.as_ref())?;
>>> +        std::fs::remove_file(path)?;
>>> +        Ok(())
>>> +    }
>>> +
>>> +    fn get<S: AsRef<str>>(&self, key: S) -> Result<Option<Value>, Error> {
> 
> Same here of course.
> 
>>> +        let path = self.get_path_for_key(key.as_ref())?;
>>> +
>>> +        let value = if let Some(content) = proxmox_sys::fs::file_get_optional_contents(path)? {
>>> +            let value: CachedItem = serde_json::from_slice(&content)?;
> 
> `CachedItem` would need to be generic, and here you could use
> `CachedItem<&'_ serde_json::RawValue>` which *should* just refer to a
> subslice in `content` here to defer the payload parsing to later.
> 
> This way you avoid at least *some* overhead of the intermediate `Value`
> while still being able to access the value's time.
> 
> Alternatively we could think of a more efficient format where the added
> and expires fields are in a fixed-sized binary header... but it
> shouldn't matter much.
> 
>>> +
>>> +            let now = self.time_provider.now();
>>> +
>>> +            if let Some(expires_in) = value.expires_in {
>>> +                // Check if value is not expired yet. Also do not allow
>>> +                // values from the future, in case we have clock jumps
>>> +                if value.added_at + expires_in > now && value.added_at <= now {
>>> +                    Some(value.value)
>>> +                } else {
>>> +                    None
>>> +                }
>>> +            } else {
>>> +                Some(value.value)
>>> +            }
>>> +        } else {
>>> +            None
>>> +        };
>>> +
>>> +        Ok(value)
>>> +    }
>>> +}
>>
>> ... can be replaced as follows, in order to make it similar to
>> std::collections::{HashMap, BTreeMap}:
>>
>> impl<K: AsRef<str>> for SharedCache<K, Value> {
>>     // Returns old value on successful insert, if given
>>     fn insert(&self, k: K, v: Value) -> Result<Option<Value>, Error> {
>>         // ...
>>     }
>>
>>     fn get(&self, k: K) -> Result<Option<Value>, Error> {
>>         // ...
>>     }
>>
>>     fn remove(&self, k: K) -> Result<Option<Value>, Error> {
>>         // ...
>>     }
>> }
>>
>> If necessary / sensible, other methods (inspired by {HashMap, BTreeMap} can
>> be added as well, such as remove_entry, retain, clear, etc.
> 
> Using the naming found in std sure makes sense.
> 
> But please hold off on the `Entry` API for now, that one will bite ya ;-)

[0]: https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies






More information about the pve-devel mailing list