[pve-devel] [PATCH access-control] auth key: fix double rotation in clusters

Wed Jul 13 15:13:59 CEST 2022

there is a (hard to trigger) race that can cause a double rotation of
the auth key, with potentially confusing fallout (various processes on
different nodes having an inconsistent view of the current and previous
auth keys, resulting in "random" invalid ticket errors until the next
proper key rotation 24h later).

the underlying cause is that `stat()` calls are excempt from our
otherwise non-cached/direct_io handling of pmxcfs I/O, which allows the
following sequence of events to take place:

LAST: mtime of current auth key

- current epoch advances to LAST + 24h

the following can be arbitrarly interleaved between node A and B:
- LAST+24h node A: pvedaemon/pvestatd on node A calls check_authkey(1)
- LAST+24h node A: it returns 0 (rotation required)
- LAST+24h node A: rotate_key() is called
- LAST+24h node A: cfs_lock_authkey is called
- LAST+24h node B: pvedaemon/pvestatd calls check_authkey(1)
- LAST+24h node B: key is not yet cached in-memory by current process
- LAST+24h node B: key file is opened, stat-ed, read, parsed, and content+mtime
  is cached (the kernel will now cache this stat result for 1s unless
  the path is opened)
- LAST+24h node B: it returns 0 (rotation required)
- LAST+24h node B: rotate_key() is called
- LAST+24h node B: cfs_lock_authkey is called

the following is mutex-ed via a cfs_lock:
- LAST+24h node A: lock is obtained
- LAST+24h node A: check_authkey() is called
- LAST+24h node A: key is stat-ed, mtime is still (correctly) LAST,
  cached mtime and content are returned
- LAST+24h node A: it returns 0 (rotation still required)
- LAST+24h node A: get_pubkey() is called and returns current auth key
- LAST+24h node A: new keypair is generated and persisted
- LAST+24h node A: cfs_lock is released
- LAST+24h node B: changes by node A are processed by pmxcfs
- LAST+24h node B: lock is obtained
- LAST+24h node B: check_authkey() is called
- LAST+24h node B: key is stat-ed, mtime is (incorrectly!) still LAST
  since the stat call is handled by the kernel/page cache, not by
  pmxcfs, cached mtime and content are returned
- LAST+24h node B: it returns 0 (rotation still required)
- LAST+24h node B: get_pubkey() is called and returns either previous or
  key written by node A (depending on whether page cache or pmxcfs
  answers stat call)
- LAST+24h node B: new keypair is generated, key returned by last
  get_pubkey call is written as old key

the end result is that some nodes and process will treat the key
generated by node A as "current", while others will treat the one
generated by nodoe B as "current". both have the same mtime, so the
in-memory cache hash won't be updated unless the service is restarted or
another rotation happens. depending on who generated the ticket and who
attempts validating it, a ticket might be rejected as invalid even
though the generating party would treat it as valid, and time on all
nodes is properly synced.

there seems to be now way for pmxcfs to pro-actively invalidate the page
cache entry safely (since we'd need to do so while writes to the same
path can happen concurrently), so work around by forcing an open/close
at the (stat) call site which does the work for us. regular reads are
not affected since those already bypass the page cache entirely anyway.

thankfully in almost all cases, the following sequence has enough
synchronization overhead baked in to avoid triggering the issue almost
entirely:

- cfs_lock
- generate key
- create tmp file for old key
- write tmp file
- rename tmp file into proper place
- create tmp file for new pub key
- write tmp file
- rename tmp file into proper place
- create tmp file for new priv key
- write tmp file
- rename tmp file into proper place
- release lock

that being said, there has been at least one report where this was
triggered in the wild from time to time.

it is easy to reproduce by increasing the attr_timeout and entry_timeout
fuse settings inside pmxcfs to increase the time stat results are
treated as valid/retained in the page cache:

-----8<-----
 diff --git a/data/src/pmxcfs.c b/data/src/pmxcfs.c
 index d78a248..e3e807b 100644
 --- a/data/src/pmxcfs.c
 +++ b/data/src/pmxcfs.c
 @@ -935,7 +935,7 @@ int main(int argc, char *argv[])
 
  	mkdir(CFSDIR, 0755);
 
 -	char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", NULL};
 +	char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", "-oentry_timeout=5", "-oattr_timeout=5", NULL};
 
  	struct fuse_args fuse_args = FUSE_ARGS_INIT(sizeof (fa)/sizeof(gpointer) - 1, fa);
 
----->8-----

in which case it's even easy to trigger more than double rotation in a
bigger test cluster (stopping all PVE services except for pve-cluster
helps avoiding interference):

on a single node:
$ touch --date yesterday /etc/pve/authkey.pub

in parallel (i.e., via tmux synchronized panes):
-----8<-----
use strict;
use warnings;
use PVE::Cluster;
use PVE::AccessControl;
PVE::Cluster::cfs_update();

# ensure page cache entry is there
PVE::AccessControl::check_authkey(1);
PVE::AccessControl::check_authkey(1);
# now attempt rotation
PVE::AccessControl::rotate_authkey();
----->8-----

Thanks to Wolfgang Bumiller for assistance in triaging and exploring
various avenues of fixing.

Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
---
apologies for the wall of text - but probably better to have too much
info preserved than too little ;)

 src/PVE/AccessControl.pm | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/PVE/AccessControl.pm b/src/PVE/AccessControl.pm
index 3725a7d..953f135 100644
--- a/src/PVE/AccessControl.pm
+++ b/src/PVE/AccessControl.pm
@@ -203,7 +203,16 @@ sub rotate_authkey {
     return if $authkey_lifetime == 0;
 
     PVE::Cluster::cfs_lock_authkey(undef, sub {
-	# re-check with lock to avoid double rotation in clusters
+	# stat() calls might be answered from the kernel page cache for up to
+	# 1s, so this special dance is needed to avoid a double rotation in
+	# clusters *despite* the cfs_lock context..
+
+	# drop in-process cache hash
+	$pve_auth_key_cache = {};
+	# force open/close of file to invalidate page cache entry
+	get_pubkey();
+	# now re-check with lock held and page cache invalidated so that stat()
+	# does the right thing, and any key updates by other nodes are visible.
 	return if check_authkey();
 
 	my $old = get_pubkey();
-- 
2.30.2