[pve-devel] [RFC cluster 0/2] fix #4886: improve SSH handling

Thu Jan 11 13:23:08 CET 2024

On Tue, Jan 09, 2024 at 09:57:30AM +0100, Fabian Grünbichler wrote:
> > Esi Y via pve-devel <pve-devel at lists.proxmox.com> hat am 09.01.2024 06:01 CET geschrieben:
> > On Thu, Dec 21, 2023 at 10:53:11AM +0100, Fabian Grünbichler wrote:
> > > RFC since this would be a bigger change in how we approach intra-cluster
> > > SSH access.
> > > 
> > > there are still a few parts that currently don't use SSHInfo, but
> > > would need to be switched over if we want to pursue this approach:
> > > 
> > > - get_vnc_connection_info in PVE::API2::Nodes
> > > - 'upload' API endpoint in PVE::API2::Storage::Status
> > > - SSH proxy in pvesh
> > > 
> > > these changes would need to happen coordinated with the patches from
> > > this RFC series!
> > 
> > Not necessarily.
> 
> if we do the unmerge as well, then yes - else a node with unmerged known host would fail to connect to nodes that joined the cluster after unmerging.

First of all, I have just seen the barrage of related patches re the helper, so let me say I think it's good to refactor and bring into consistency invoking ssh by that means.

I apologise for late reply, it belongs to here though and I believe is still relevant with your currently pursued solution as well.

PVE (and unsuspected users)  would not need to depend on the helper if it had not depended on options such as -o HostKeyAlias. For the sake of brewity I filed bug report #5170 [1] re details of that, here I would just pitch that dropping in a config file into /etc/ssh/ssh_config.d/ would be much better than including options into each and every cluster SSH, a config file from which user connections would benefit as well and which would not run the risk of keeping stale entries in "remote" known_hosts partial blocks. The solution is to allow GlobalKnownHostsFile and UserKnownHostsFile as-is, but to add KnownHostsCommand to the client config. This allows for:
1. All entries to be considered alongside with cluster ones;
2. Check what is being sought for with %H and adapt approprietaly - no more worries about all the configs being in sync;
3. Log whenever something when wrong;
3. User never worrying about getting different ssh conn from a command-line and botching any one particulars machine local ones.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5170

>   
> > > next steps afterwards:
> > > - unmerge known hosts in `pvecm updatecerts`, instead of merging
> > > -- to disentangle regular ssh from intra-cluster SSH
> > 
> > Both of these could be accomplished with codebase/complexity reduction in an approach to:
> 
> I am not sure what "both" means here? there is only a single thing quoted ;)

Fair enough, I expected there will still be some merging necessary, now those are just snippets. The current name could be confusing to a user, e.g. would be better called self_known_hosts or such, but see above, no need for that at all.

> > 1.  eliminate shared ssh config and key files, i.e. completely remove any need for symlinks; and
> 
> that's the evaluate part further below.

This is for another thread apparently, the issue is that this proposed known_hosts fix e.g. still does not do anything for the duplicates in authorized_keys.

> > 2.  instead initialise each node at join (first one at cluster creation) into their respective node-local files with ssh certs; whilst
> 
> versus just copying the host key, which is far simpler?

I will leave the certs discussion out of this thread to not have it convoluted, the issue with certs in current PVE architectural model is that the CA resides on all nodes anyhow. Initialising here meant nothing else than "HostCertificate /etc/ssh/clusterhostkey-cert.pub" > /etc/ssh/sshd_config.d/clustercerts.conf.

> > 3.  keeping the setup maintenance-free, since any single key addition/refresh does not need to propagate any individual keys around the cluster; meanwhile
> 
> yes it does, changing the key would entail revoking the old one (and distributing that!) and signing the new one.

This is not fair counterpoint due to this fix still not addressing bug report #4670 [2] re authorized_keys. Revoking is an append to a file rather than deleting entries from existing, so simpler.

[2] https://bugzilla.proxmox.com/show_bug.cgi?id=4670

> > 4.  no requirement for additional -o options;
> 
> we already need -o options anyway, so there is no downside to adding additional ones

See above detailed in bug #5170 [1], I just wish this does not exacetbate it, as it is an opportunity to fix 2in1.

> > 5.  no requirement for sshd config appends, just drop-ins;
> 
> there's no need for sshd config changes either with the patches here?

The approach as originally proposed (without KnownHostsCommand) is changing existing functionality - the keys are gone from the machine and it requires non-intuitive manual -o's. So no config changes, but behaviour change that would need documenting at least in the release notes.

> > 6.  existing X509 CA key can be reused for ssh PKI as well.
> 
> that might no actually be the best of ideas ;)

Out of scope of this thread, but I take this as a snide remark as in this particular case it's literally format conversion of what's already there with no security implications.

openssl x509 -in /etc/pve/pve-root-ca.pem -inform pem -pubkey -noout | ssh-keygen -f /dev/stdin -i -m PKCS8 > /etc/pve/pve-root-ca.pub

> > > -- to allow `ssh-keygen -f .. -R ..` to work properly again
> > 
> > Will always work with local-only files. In the other approach, the -R will still not work with a file stored on pmxcfs.
> 
> yes, the -R will work with a file stored on pmxcfs. just not with a symlink, no matter where it points. which is why the next step above lists unmerging the known hosts to get rid of the symlink.

-R does not work if fs does not support hardlinks [3], but it can be also addressed with the KnownHostsCommand, otherwise if you make the snippet files multiline we are back to the same problem should someone need to run it on pmxcfs

[3] https://github.com/openssh/openssh-portable/blob/64e0600f23c6dec36c3875392ac95b8a9100c2d6/ssh-keygen.c#L1368-L1382

> > > -- existing keys would still be preserved for not-yet-upgraded nodes, so this
> > >    should be do-able without waiting for a major release..
> > 
> > With the ssh certs, the old (non-cert) keys could be safely left behind in the pmxcfs location, upgraded nodes would append the previously shared content into local files, old nodes would not make use of the new keys until upgraded, but will keep working with the old to the extent they used to work. Universal remedy for any legacy issues would be to upgrade the node.
> 
> the same is true for the patches here:
> - updated nodes publish their own key, and use published keys if available
> - non-updated nodes will still have the symlink and use the "old" merged file

Yes, this was mostly meant as the certs are not any worse or more disruptive than iterative approach.

> > 
> > > - evaluate whether we want to split out
> > > -- the client config (we currently force a cipher order there)
> > > -- the client key (could live in /etc/pve/priv instead?)
> > > -- or even the sshd instance altogether (would allow not touching the
> > >    regular sshd config at all)
> > 
> > Non-issue in the ssh certs approach.
> 
> on the contrary, all of the above are also valid for cert-based auth..

I will keep this out of scope here, but happy to discuss this if interested - I find it to be adding complexity where it is not needed.

> > What's the counter-argument to ssh certs, given the simlicity in comparison with both the old approach and the intially suggested one here?
> 
> the one sentence summary - it doesn't get us closer to where we want to end up (either getting rid of SSH entirely, or full disentangling PVE-internal SSH use from the regular, default sshd instance), but adds more complexity instead.

I could have provided POC which would be equivalent to ~6 shell lines that need to be executed upon cluster join addressing both known_hosts and authorized_keys with no mandatory argv[], but I suspect the patches are well on the way now and unless it disrupts the components that now make use of the new helper - everyone prefers what they know and what has worked so far. I happen to known ssh certs, so hence my original take. I believe the certs are also future proof (without extra effort) when it comes to e.g. keys rotation (expiries built in), something PVE does not take into account as of today and tracked separately as bug report #5174 [4].

[4] https://bugzilla.proxmox.com/show_bug.cgi?id=5174

ONE FINAL POINT:

The original merging, juggling, accounting for hashed known_hosts, etc. elaborate codepath - I humbly believe - was all done in the interest of making SSH transparent to the user within the cluster. The proposed approach here is much cleaner, but it takes out that original objective off the feature list. I suspect it was important enough then to make it worth of all the efforts going into this. It would be a pity if that is lost when it does not have to be. The certs have that built-in, but the KnownHostsCommand also retains that without any additional risk.