[pve-devel] applied: [PATCH access-control] auth key: fix double rotation in clusters

Fri Jul 15 13:39:35 CEST 2022

Am 13/07/2022 um 15:13 schrieb Fabian Grünbichler:
> there is a (hard to trigger) race that can cause a double rotation of
> the auth key, with potentially confusing fallout (various processes on
> different nodes having an inconsistent view of the current and previous
> auth keys, resulting in "random" invalid ticket errors until the next
> proper key rotation 24h later).
> 
> the underlying cause is that `stat()` calls are excempt from our
> otherwise non-cached/direct_io handling of pmxcfs I/O, which allows the
> following sequence of events to take place:
> 
> LAST: mtime of current auth key
> 
> - current epoch advances to LAST + 24h
> 
> the following can be arbitrarly interleaved between node A and B:
> - LAST+24h node A: pvedaemon/pvestatd on node A calls check_authkey(1)
> - LAST+24h node A: it returns 0 (rotation required)
> - LAST+24h node A: rotate_key() is called
> - LAST+24h node A: cfs_lock_authkey is called
> - LAST+24h node B: pvedaemon/pvestatd calls check_authkey(1)
> - LAST+24h node B: key is not yet cached in-memory by current process
> - LAST+24h node B: key file is opened, stat-ed, read, parsed, and content+mtime
>   is cached (the kernel will now cache this stat result for 1s unless
>   the path is opened)
> - LAST+24h node B: it returns 0 (rotation required)
> - LAST+24h node B: rotate_key() is called
> - LAST+24h node B: cfs_lock_authkey is called
> 
> the following is mutex-ed via a cfs_lock:
> - LAST+24h node A: lock is obtained
> - LAST+24h node A: check_authkey() is called
> - LAST+24h node A: key is stat-ed, mtime is still (correctly) LAST,
>   cached mtime and content are returned
> - LAST+24h node A: it returns 0 (rotation still required)
> - LAST+24h node A: get_pubkey() is called and returns current auth key
> - LAST+24h node A: new keypair is generated and persisted
> - LAST+24h node A: cfs_lock is released
> - LAST+24h node B: changes by node A are processed by pmxcfs
> - LAST+24h node B: lock is obtained
> - LAST+24h node B: check_authkey() is called
> - LAST+24h node B: key is stat-ed, mtime is (incorrectly!) still LAST
>   since the stat call is handled by the kernel/page cache, not by
>   pmxcfs, cached mtime and content are returned
> - LAST+24h node B: it returns 0 (rotation still required)
> - LAST+24h node B: get_pubkey() is called and returns either previous or
>   key written by node A (depending on whether page cache or pmxcfs
>   answers stat call)
> - LAST+24h node B: new keypair is generated, key returned by last
>   get_pubkey call is written as old key
> 
> the end result is that some nodes and process will treat the key
> generated by node A as "current", while others will treat the one
> generated by nodoe B as "current". both have the same mtime, so the
> in-memory cache hash won't be updated unless the service is restarted or
> another rotation happens. depending on who generated the ticket and who
> attempts validating it, a ticket might be rejected as invalid even
> though the generating party would treat it as valid, and time on all
> nodes is properly synced.
> 
> there seems to be now way for pmxcfs to pro-actively invalidate the page
> cache entry safely (since we'd need to do so while writes to the same
> path can happen concurrently), so work around by forcing an open/close
> at the (stat) call site which does the work for us. regular reads are
> not affected since those already bypass the page cache entirely anyway.
> 
> thankfully in almost all cases, the following sequence has enough
> synchronization overhead baked in to avoid triggering the issue almost
> entirely:
> 
> - cfs_lock
> - generate key
> - create tmp file for old key
> - write tmp file
> - rename tmp file into proper place
> - create tmp file for new pub key
> - write tmp file
> - rename tmp file into proper place
> - create tmp file for new priv key
> - write tmp file
> - rename tmp file into proper place
> - release lock
> 
> that being said, there has been at least one report where this was
> triggered in the wild from time to time.
> 
> it is easy to reproduce by increasing the attr_timeout and entry_timeout
> fuse settings inside pmxcfs to increase the time stat results are
> treated as valid/retained in the page cache:
> 
> -----8<-----
> 

applied, thanks!