[pve-devel] [RFC proxmox 4/7] cache: add new crate 'proxmox-cache'
Max Carrara
m.carrara at proxmox.com
Tue Aug 22 15:52:32 CEST 2023
On 8/22/23 13:56, Wolfgang Bumiller wrote:
> On Tue, Aug 22, 2023 at 12:08:38PM +0200, Max Carrara wrote:
>> On 8/21/23 15:44, Lukas Wagner wrote:
>>> For now, it contains a file-backed cache with expiration logic.
>>> The cache should be safe to be accessed from multiple processes at
>>> once.
>>>
>>
>> This seems pretty neat! The cache implementation seems straightforward
>> enough. I'll see if I can test it more thoroughly later.
>>
>> However, in my opinion we should have a crate like
>> "proxmox-collections" (or something of the sort) with modules for each
>> data structure / collection similar to the standard library; I'm
>> curious what others think about that. imo it would be a great
>> opportunity to introduce that crate in this series, since you're
>> already introducing one for the cache anyway.
>>
>> So, proxmox-collections would look something like this:
>>
>> proxmox-collections
>> └── src
>> ├── cache
>> │ ├── mod.rs
>> │ └── shared_cache.rs
>> └── lib.rs
>>
>> Let me know what you think!
>
> Smaller crates help with compile time, I don't want to pull in a big
> collections crate if I need one single type of collection from it ;-)
>
They do indeed, but in this case I'd prefer a single crate that guards
each collection behind a feature, which has essentially the same effect.
Or as an alternative, let proxmox-collections be a meta-crate that
re-exports smaller crates in its own namespace; let those smaller
crates then be optional dependencies which are only pulled in and compiled
when their respective features are specified[0].
I'm not sure how the latter works in conjunction with deb packaging, however.
That being said, I'm otherwise not opposed against proxmox-cache becoming
its own crate, but I do think that we should group our code logically in some way.
>>
>>> The cache stores values in a directory, based on the key.
>>> E.g. key "foo" results in a file 'foo.json' in the given base
>>> directory. If a new value is set, the file is atomically replaced.
>>> The JSON file also contains some metadata, namely 'added_at' and
>>> 'expire_in' - they are used for cache expiration.
>>>
>>> Note: This cache is not suited to applications that
>>> - Might want to cache huge amounts of data, and/or access the cache
>>> very frequently (due to the overhead of JSON de/serialization)
>>> - Require arbitrary keys - right now, keys are limited by
>>> SAFE_ID_REGEX
>>>
>>> The cache was developed for the use in pvestatd, in order to cache
>>> e.g. storage plugin status. There, these limitations do not really
>>> play any role.
>>>
>>> Signed-off-by: Lukas Wagner <l.wagner at proxmox.com>
>>> ---
>>> Cargo.toml | 1 +
>>> proxmox-cache/Cargo.toml | 20 ++
>>> proxmox-cache/examples/performance.rs | 82 ++++++++
>>> proxmox-cache/src/lib.rs | 40 ++++
>>> proxmox-cache/src/shared_cache.rs | 263 ++++++++++++++++++++++++++
>>> 5 files changed, 406 insertions(+)
>>> create mode 100644 proxmox-cache/Cargo.toml
>>> create mode 100644 proxmox-cache/examples/performance.rs
>>> create mode 100644 proxmox-cache/src/lib.rs
>>> create mode 100644 proxmox-cache/src/shared_cache.rs
>>>
>>> diff --git a/Cargo.toml b/Cargo.toml
>>> index e334ac1..940e1d0 100644
>>> --- a/Cargo.toml
>>> +++ b/Cargo.toml
>>> @@ -5,6 +5,7 @@ members = [
>>> "proxmox-async",
>>> "proxmox-auth-api",
>>> "proxmox-borrow",
>>> + "proxmox-cache",
>>> "proxmox-client",
>>> "proxmox-compression",
>>> "proxmox-http",
>>> diff --git a/proxmox-cache/src/lib.rs b/proxmox-cache/src/lib.rs
>>> new file mode 100644
>>> index 0000000..d496dc7
>>> --- /dev/null
>>> +++ b/proxmox-cache/src/lib.rs
>>> @@ -0,0 +1,40 @@
>>> +use anyhow::Error;
>>> +use serde_json::Value;
>>> +
>>> +pub mod shared_cache;
>>> +
>>> +pub use shared_cache::SharedCache;
>>> +
>>> +trait TimeProvider {
>
> Also mentioned off list: this trait will be dropped.
> For tests, a `#[cfg(test)]` alternative will be used for fetching the
> time.
>
> We just never need a different provider outside of tests, and storing it
> as `Box<dyn TimeProvider>` immediately drops all the auto traits from
> the cache struct, since trait objects *only* ever get the auto traits
> you *explicitly* define.
>
>>> + /// Returns the current time as a UNIX epoch (second resolution)
>>> + fn now(&self) -> i64;
>>> +}
>>> +
>>> +struct DefaultTimeProvider;
>>> +
>>> +impl TimeProvider for DefaultTimeProvider {
>>> + fn now(&self) -> i64 {
>>> + proxmox_time::epoch_i64()
>>> + }
>>> +}
>>> +
>>> +pub trait Cache {
>>> + /// Set or insert a cache entry.
>>> + ///
>>> + /// If `expires_in` is set, this entry will expire in the desired number of
>>> + /// seconds.
>>> + fn set<S: AsRef<str>>(
>>> + &self,
>>> + key: S,
>>> + value: Value,
>>> + expires_in: Option<i64>,
>>> + ) -> Result<(), Error>;
>>> +
>>> + /// Delete a cache entry.
>>> + fn delete<S: AsRef<str>>(&self, key: S) -> Result<(), Error>;
>>> +
>>> + /// Get a value from the cache.
>>> + ///
>>> + /// Expired entries will *not* be returned.
>>> + fn get<S: AsRef<str>>(&self, key: S) -> Result<Option<Value>, Error>;
>>> +}
>>
>> I don't necessarily think that a trait would be necessary in this
>> case, as there's not really any other structure (that can be used as
>> caching mechanism) that you're abstracting over. (more below)
>
> Already somewhat mentioned this off list. This should not be a trait for
> now. If we ever need one we can still add one.
>
>>
>>> diff --git a/proxmox-cache/src/shared_cache.rs b/proxmox-cache/src/shared_cache.rs
>>> new file mode 100644
>>> index 0000000..be6212c
>>> --- /dev/null
>>> +++ b/proxmox-cache/src/shared_cache.rs
>>> @@ -0,0 +1,263 @@
>>> +use std::path::{Path, PathBuf};
>>> +
>>> +use anyhow::{bail, Error};
>>> +use serde::{Deserialize, Serialize};
>>> +use serde_json::Value;
>>> +
>>> +use proxmox_schema::api_types::SAFE_ID_FORMAT;
>>> +use proxmox_sys::fs::CreateOptions;
>>> +
>>> +use crate::{Cache, DefaultTimeProvider, TimeProvider};
>>> +
>>> +/// A simple, file-backed cache that can be used from multiple processes concurrently.
>>> +///
>>> +/// Cache entries are stored as individual files inside a base directory. For instance,
>>> +/// a cache entry with the key 'disk_stats' will result in a file 'disk_stats.json' inside
>>> +/// the base directory. As the extension implies, the cached data will be stored as a JSON
>>> +/// string.
>>> +///
>>> +/// For optimal performance, `SharedCache` should have its base directory in a `tmpfs`.
>>> +///
>>> +/// ## Key Space
>>> +/// Due to the fact that cache keys are being directly used as filenames, they have to match the
>>> +/// following regular expression: `[A-Za-z0-9_][A-Za-z0-9._\-]*`
>>> +///
>>> +/// ## Concurrency
>>> +/// All cache operations are based on atomic file operations, thus accessing/updating the cache from
>>> +/// multiple processes at the same time is safe.
>>> +///
>>> +/// ## Performance
>>> +/// On a tmpfs:
>>> +/// ```sh
>>> +/// $ cargo run --release --example=performance
>>> +/// inserting 100000 keys took 896.609758ms (8.966µs per key)
>>> +/// getting 100000 keys took 584.874842ms (5.848µs per key)
>>> +/// deleting 100000 keys took 247.742702ms (2.477µs per key)
>>> +///
>>> +/// Inserting/getting large objects might of course result in lower performance due to the cost
>>> +/// of serialization.
>>> +/// ```
>>> +///
>>> +pub struct SharedCache {
>>> + base_path: PathBuf,
>>> + time_provider: Box<dyn TimeProvider>,
>>> + create_options: CreateOptions,
>>> +}
>>
>> Instead, this should be generic:
>>
>> pub struct SharedCache<K, V> { ... }
>>
>> .. and maybe rename it to SharedFileCache to make it explicit that this
>> operates on a file. (but that's more dependent on one's taste tbh)
>>
>> .. and the impl block below ...
>
> The main issue here is that it'll all boil down to one general "json
> thingy" because it'll be used from out of perl where if we have multiple
> types here to instantiate we'll need to create separate `perlmod
> #[export]`s for each and every type we need in our perl code.
>
> OTOH, we can still have the crate side here generic and have the perl-rs
> side just instantiate a single `SharedCache<String, serde_json::Value>`...
>
Or perhaps have the perl-rs side wrap `SharedCache<String, Value>` and expose
only the methods that are needed; I don't see the issue here ;-)
> A generic `K` might make it annoying to work with paths, though, so IMO
> that can just stay a string ;-)
>
>>
>>> +
>>> +impl Cache for SharedCache {
>>> + fn set<S: AsRef<str>>(
>>> + &self,
>>> + key: S,
>>> + value: Value,
>
> In any case, I'd definitely argue against `Value` here and just go with
> a generic `V: Serialize` at least in the methods.
>
> Because why serialize twice? (Once from some struct to Value, then from
> Value to json.)
>
ACK, that one I hadn't considered. Good point!
> Otherwise we could also just store `&[u8]` and have this be nothing more
> than a thing to attach timeouts to bytes in files ;-)
>
>>> + expires_in: Option<i64>,
>>> + ) -> Result<(), Error> {
>>> + let path = self.get_path_for_key(key.as_ref())?;
>>> + let added_at = self.time_provider.now();
>>> +
>>> + let item = CachedItem {
>>> + value,
>>> + added_at,
>>> + expires_in,
>>> + };
>>> +
>>> + let serialized = serde_json::to_vec_pretty(&item)?;
>>> +
>>> + // Atomically replace file
>>> + proxmox_sys::fs::replace_file(path, &serialized, self.create_options.clone(), true)?;
>>> + Ok(())
>>> + }
>>> +
>>> + fn delete<S: AsRef<str>>(&self, key: S) -> Result<(), Error> {
>>> + let path = self.get_path_for_key(key.as_ref())?;
>>> + std::fs::remove_file(path)?;
>>> + Ok(())
>>> + }
>>> +
>>> + fn get<S: AsRef<str>>(&self, key: S) -> Result<Option<Value>, Error> {
>
> Same here of course.
>
>>> + let path = self.get_path_for_key(key.as_ref())?;
>>> +
>>> + let value = if let Some(content) = proxmox_sys::fs::file_get_optional_contents(path)? {
>>> + let value: CachedItem = serde_json::from_slice(&content)?;
>
> `CachedItem` would need to be generic, and here you could use
> `CachedItem<&'_ serde_json::RawValue>` which *should* just refer to a
> subslice in `content` here to defer the payload parsing to later.
>
> This way you avoid at least *some* overhead of the intermediate `Value`
> while still being able to access the value's time.
>
> Alternatively we could think of a more efficient format where the added
> and expires fields are in a fixed-sized binary header... but it
> shouldn't matter much.
>
>>> +
>>> + let now = self.time_provider.now();
>>> +
>>> + if let Some(expires_in) = value.expires_in {
>>> + // Check if value is not expired yet. Also do not allow
>>> + // values from the future, in case we have clock jumps
>>> + if value.added_at + expires_in > now && value.added_at <= now {
>>> + Some(value.value)
>>> + } else {
>>> + None
>>> + }
>>> + } else {
>>> + Some(value.value)
>>> + }
>>> + } else {
>>> + None
>>> + };
>>> +
>>> + Ok(value)
>>> + }
>>> +}
>>
>> ... can be replaced as follows, in order to make it similar to
>> std::collections::{HashMap, BTreeMap}:
>>
>> impl<K: AsRef<str>> for SharedCache<K, Value> {
>> // Returns old value on successful insert, if given
>> fn insert(&self, k: K, v: Value) -> Result<Option<Value>, Error> {
>> // ...
>> }
>>
>> fn get(&self, k: K) -> Result<Option<Value>, Error> {
>> // ...
>> }
>>
>> fn remove(&self, k: K) -> Result<Option<Value>, Error> {
>> // ...
>> }
>> }
>>
>> If necessary / sensible, other methods (inspired by {HashMap, BTreeMap} can
>> be added as well, such as remove_entry, retain, clear, etc.
>
> Using the naming found in std sure makes sense.
>
> But please hold off on the `Entry` API for now, that one will bite ya ;-)
[0]: https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies
More information about the pve-devel
mailing list