[pve-devel] [PATCH cluster 2/4] fix #4886: SSH: pin node's host key if available

Tue Jan 16 14:39:56 CET 2024

Thank you for the responses below, I just want to explain all the fuss in my reasoning, i.e. why I am bringing it up before this goes into a release.

The original way of handling SSH was meant to be pretty transparent to the user:

A. If they ran plain ssh, without any extra argv[], it would "just work" and it would not go on creating extra duplicate entries in e.g. local files.
B. If they added some stuff of their own, intuitively into local user config, it would "just blend in" and magically work for PVE-* code.
C. The "hijacked" part of the SSH config was the global one, but this made most sense logically as why would a machine that is part of a cluster need distinctive single-host-wide config.
D. Even then, the "hijacked" file was well-known (pun intended) format and use case (this I think is important), the idea was that the existing (ssh) tooling follows the symlinks. It was meant to be transparent, i.e. adding global item should be added globally, extrapolated to cluste-wide.

The code below runs well for PVE-* tooling and it plays nice with the old versions of PVE, but it does break: A, B, C and creates at least some inconvenience with D (I would expect extra support case load). Arguably the code now fixed the logical bug (not playing nice with ssh tooling) by simply removing C altogether.

I. The case for combo of UserKnownHostsFile (UserKHF) and GlobalKnownHostsFile (GlobalKHF) and use of "none" (in either case):

I-1. Injecting UserKHF instead of GlobalKHF for injecting the pinned keys breaks B, changes known behaviour C, there's no benefit to use it in this reversed combo.

I-2. Use of "none" (suppose we are talking UserKHF now in respect to (1) above) does not gain anything (maybe except for clarity during alpha testing) as you agree based on (2) in the quoted below, omitting it allows B to be preserved.

II. The case for KnownHostsCommand (KnownHC):

II-1. Even with amendments from I above, the A remains broken for human user, they might start piling up stale (duplicate) keys by invoking commands towards other cluster members without -o's, they will not know why e.g. PVE-* tooling works, but theirs only works from one node, but not another, or only with migration IP, not regular one, or why config change of PVE did not reflect into their shell experience. With KnownHC injected in their user config, this will never become an issue.

II-2. Should a user become acquainted with the two (now mandatory) -o's, they will at some point start running into cases where e.g. dues to StrictHostKeyChecking (StrictHKC), they are asked to add a machine, only to be asked again and again due to (3) as quoted below. With KnownHC injected in their user config, this can never happen.

II-3. Even better than the original setup, the use of KnownHC allows for e.g. adjustment in case of migration network is changed. I.e. the next ssh command trying to reach that new IP will be instantly pointed to the right hostkey, no need for somehow updating it in further locations.

II-4. Also, KnownHC allows for very good logging of situations when there is something wrong with specifically cluster hostkeys. The command is only executed if there was no match in UserKHF and GlobalKHF, there is no way it would silently fail because everytime there was no match in those two, the command will get to have its last say. So you catch every error explicitly, it won't become a red herring ruon forums.

II-5. The KnownHC is basically equivalent to -o GlobalKHF extra_maintained_file_that_can_become_stale, without taking away the Global feature from the user (and even current cluster config).

II-6. The KnownHC allows for much easier management of multiple keys or their rotation, instead of populating pinned keys into a different format (in non-intuitive use case), it could either copy/rsync the hostkeys as they are, or simply bring its own in addition (even better). It can ensure much better handling of situations when a new node is reintroduces under same IP/name without any extra effort or thought put into it.

Minor comments inline below still.

On Tue, Jan 16, 2024 at 10:00:10AM +0100, Fabian Grünbichler wrote:
> > Esi Y via pve-devel <pve-devel at lists.proxmox.com> hat am 15.01.2024 15:31 CET geschrieben:
> > On Mon, Jan 15, 2024 at 12:51:48PM +0100, Fabian Grünbichler wrote:
> > > > On Thu, Jan 11, 2024 at 11:51:16AM +0100, Fabian Grünbichler wrote:
> > > > > if the target node has already stored their SSH host key on pmxcfs, pin it and
> > > > > ignore the global known hosts information.
> > > > > 
> > > > > Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
> > > > > ---
> > > > >  src/PVE/SSHInfo.pm | 15 ++++++++++++++-
> > > > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/src/PVE/SSHInfo.pm b/src/PVE/SSHInfo.pm
> > > > > index c351148..fad23bf 100644
> > > > > --- a/src/PVE/SSHInfo.pm
> > > > > +++ b/src/PVE/SSHInfo.pm
> > > > > @@ -49,11 +49,24 @@ sub get_ssh_info {
> > > > >  
> > > > >  sub ssh_info_to_command_base {
> > > > >      my ($info, @extra_options) = @_;
> > > > > +
> > > > > +    my $nodename = $info->{name};
> > > > > +
> > > > > +    my $known_hosts_file = "/etc/pve/nodes/$nodename/ssh_known_hosts";
> > > > > +    my $known_hosts_options = undef;
> > > > > +    if (-f $known_hosts_file) {
> > > > > +	$known_hosts_options = [
> > > > > +	    '-o', "UserKnownHostsFile=$known_hosts_file",
> > > > > +	    '-o', 'GlobalKnownHostsFile=none',
> > > > 
> > > > why does Global need to be none, even as this only applies if the snippet exists?
> > > 
> > > because we want to only let SSH look at our pinned file, not the regular one, which might contain bogus information. since our pinned file contains an entry for our host key alias which must match, the global file can never improve the situation, but it can cause a verification failure.
> > 
> > This might not work as expected.
> > 
> > 1. There will not be any verification failure if there is at least some valid key present. If wrong keys are present alongside a good one, it's a pass. If _only_ wrong keys are present, with StrictHostKeyChecking default (ask) it will outright stop.
> > 
> > 2. The Global none does not improve anything there. If no keys are present it will try to ask (under SKHC default), but no use in BatchMode.
> 
> technically true, but doesn't really matter for our use case. we only want to use our own pinned key (or maybe, keys, at some point in the future) for internal connections.
> 
> > 3. Using -o UserKHF alongside default SKHC, e.g. if run by someone even manually after a failed script without BatchMode, will have it crash for them because the pinned file cannot be updated by ssh properly due to the same issue as mentioned before regarding ssh-keygen -R. In this case the pmxcfs will cause it to crash again on link-unlink-rename() again [1].
> > 
> > [1] https://github.com/openssh/openssh-portable/blob/50080fa42f5f744b798ee29400c0710f1b59f50e/hostfile.c#L695
> 
> it doesn't crash, it just fails to work. and this is not the same issue as the original one at all, since previously running the suggest command would break the PVE setup by removing our symlink, whereas now it creates an empty temp file but preserves our setup.

I meant the ssh will crash on them connecting (come to think of it, I did not test if it connects the one time without updating the file or not at all), I do not think they will appreciate the error message and will bring it to clutter the forums.

It is true it is not the original issue per se, that was silent logical only fail (from ssh-keygen point of view, everything worked). This is "goto fail" in the case of pmxcfs right on the link(), it will clean up after itself unlink(temp) free(temp) free(back), but they will be no wiser from the "link" error.

> > 4. I suppose you did not like my suggestion re KnownHostsCommand [2] instead of "pinning", but giving -o's to ssh code where the files reside on pmxcfs is just creating the same problem (that e.g. keygen -R had) elsewhere depending if you plan e.g. multiline.
> 
> the only advantage of a KnownHostsCommand would be to avoid the above (tiny) issue in interactive use cases. our use case is by definition not interactive. the only situation where this should arise in practice is if you manually rotate the SSH host key of a node already in the cluster. even then, it will solve itself after a reboot (or manual invocation of pvecm updatecerts, which should definitely be noted in a yet-to-be-written "keys/secrets and rotating them" section of the docs).

I do not think there is currently anything self-healing after reboot.

> the command approach has similar problems though:
> - if it outputs a non-matching host key line, the connection will be aborted (so this is stricter than the file based solution! which is especially problematic if we extend this to handle all key types, since then a rotation of one of them would already trip it up)

It better outright throws an exit code in that case, see above the main answer. It is not stricter, the command only goes to run when there was no match in User|GlobalKHF, if even the command can't tell what one wants, I better have it exit code with clear error (you reached here-be-dragons you should have never reached), it should not be ordinary ssh failed to connect. My suggestion was to always return just one line only, the one for which the SSH was run, that is possible with %H and other options. So multikeys are a treat with that approach.

> - it internally treats the command option as if it were a file, leading to very nice output like this:
> 
> Offending RSA key in KnownHostsCommand-HOSTNAME:2
>   remove with:
>   ssh-keygen -f "KnownHostsCommand-HOSTNAME" -R "XXX"
> Host key for XXX has changed and you have requested strict checking.
> Host key verification failed.
> 
> (XXX is my hostname, the rest is output exactly like it is!)

That's so cool. :) But can you elaborate how you tested this? Was this with User|GlobalKHF none at the same time? Because this makes no sense unless there is bad keys and no good key. And at that I would expect this message only if there were no keys in the files, but only bad keys in this output. But I suggested to produce no output in case %H can't be matched.

But then again, it will not crash, just fail to run. I mean, if someone comes to forum that this is what was shown to them (and other output from the command itself), then they ran it, then nothing still works, they did not break anything (additional) by following those instructions and error is known from glancing at the first post/support email.

> last but not least, switching to a command is always possible as follow-up since it's entirely on the client side anyway and requires no coordination across the cluster - the command would just output the contents of the file anyhow.
> 

My point was, it should not just output the snippet, it should make a tailor-made one-line output that is always up to date, without needing any -o HostKeyAlias even. If connecting to IP that now newly set up migration IP, it will give proper result, for instance. But this is not necessary to do outright, as you say you can initially use it with -o HostKA even.

I mostly consider it clean addition because it's literally not touching anything User|GlobalKHF and thus makes it impossible for someone to try to run ssh tooling on the semi-products of known_hosts_selfref snippets which will be hard to maintain if multi-line or multi-pattern-match single line is needed.