[pve-devel] applied: [PATCH storage 1/3] fix random hangs on reboot with active CephFS mount ordering cycle

Thomas Lamprecht t.lamprecht at proxmox.com
Wed Jan 29 19:58:22 CET 2020

commit 54e0b0034bd6654c566cb4ae7d4a5953c48cd1ca introduced the
"_netdev" option, for PVE 5.3. The systemd generator then correctly
resolved that in the following resulting order-dependencies:
> Wants=network-online.target
> Before=umount.target remote-fs.target
> After=remote-fs-pre.target system.slice network.target network-online.target -.mount

This worked well and all were happy. With the current systemd in 6.0
we sometimes get the local-fs ones there generated too. This is a
fallout from a try to better handling nested mount hierachies, where
a .mount unit needs to be mounter or unmounted, before or after,
respectively, the parent mount was processed. It seems that sometime
that glitches and thus a "RequireMountFor=/mnt/pve" gets thrown in
and result sometimes in the local-fs order constraints being added.

The issue now is, that one must not have ordering depends to all,
local-fs, local-fs-pre, remote-fs, remote-fs-pre, as that gets you a
ordering cycle. Systemd tries to solve that cycle by randomly
dropping one constraint and retrying. By luck this is a not so
important unit, and all goes on well. Most of the time one isn't that
lucky and something important gets dropped, for example:

> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Found ordering cycle on systemd-timesyncd.service/stop
> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Found dependency on systemd-tmpfiles-setup.service/stop
> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Found dependency on local-fs.target/stop
> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Found dependency on mnt-pve-cephfs.mount/stop
> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Found dependency on remote-fs-pre.target/stop
> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Found dependency on rbdmap.service/stop
> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Found dependency on sysinit.target/stop
> Jan 24 18:43:05 prod1 systemd[1]: sysinit.target: Job remote-fs-pre.target/stop deleted to break ordering cycle starting with sysinit.target/stop

Then, most of the time the host reboot hangs for ~10 minutes, often
showing scapegoat units like the pve-ha-lrm being the cause of the
hang (even if no HA is configure >.<).

This behavior is fixed with newer systemd versions, e.g., the v244
from buster-backports, but that is not a real option for us for now.

So until 7.0 we generate the unit with the correct dependencies
directly in the ephemeral /run/ tmpfs backed systemd/system path and
start it.

While FUSE gets only the local-fs ordering constraint, it seems to cope
very well regarding such symptoms. But it _is_ racy and probably only
works due to systemd stopping it early as it has not much ordering
constraints at all.. It should be moved in the future nonetheless, as
there's a mount.fuse.ceph helper that should be not an issue.

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
 PVE/Storage/CephFSPlugin.pm | 60 +++++++++++++++++++++++++++++--------
 1 file changed, 47 insertions(+), 13 deletions(-)

diff --git a/PVE/Storage/CephFSPlugin.pm b/PVE/Storage/CephFSPlugin.pm
index dcf961c..bc1ca3e 100644
--- a/PVE/Storage/CephFSPlugin.pm
+++ b/PVE/Storage/CephFSPlugin.pm
@@ -7,7 +7,7 @@ use IO::File;
 use Net::IP;
 use File::Path;
-use PVE::Tools qw(run_command);
+use PVE::Tools qw(run_command file_set_contents);
 use PVE::ProcFSTools;
 use PVE::Storage::Plugin;
 use PVE::JSONSchema qw(get_standard_option);
@@ -37,6 +37,7 @@ sub cephfs_is_mounted {
     return undef;
 # FIXME: duplicate of api/diskmanage one, move to common helper (pve-common's
 #        Tools or Systemd ?)
 sub systemd_escape {
@@ -55,6 +56,41 @@ sub systemd_escape {
     return $val;
+# FIXME: remove in PVE 7.0 where systemd is recent enough to not have those
+#        local-fs/remote-fs dependency cycles generated for _netdev mounts...
+sub systemd_netmount {
+    my ($where, $type, $what, $opts) = @_;
+# don't do default deps, systemd v241 generator produces ordering deps on both
+# local-fs(-pre) and remote-fs(-pre) targets if we use the required _netdev
+# option. Over thre corners this gets us an ordering cycle on shutdown, which
+# may make shutdown hang if the random cycle breaking hits the "wrong" unit to
+# delete.
+    my $unit =  <<"EOF";
+Before=umount.target remote-fs.target
+After=systemd-journald.socket system.slice network.target -.mount remote-fs-pre.target network-online.target
+    my $unit_fn = systemd_escape($where, 1) . ".mount";
+    my $unit_path = "/run/systemd/system/$unit_fn";
+    file_set_contents($unit_path, $unit);
+    run_command(['systemctl', 'start', $unit_fn], errmsg => "mount error");
 sub cephfs_mount {
     my ($scfg, $storeid) = @_;
@@ -77,22 +113,20 @@ sub cephfs_mount {
 	    push @$cmd, '-r', $subdir if !($subdir =~ m|^/$|);
 	    push @$cmd, $mountpoint;
 	    push @$cmd, '--conf', $configfile if defined($configfile);
+	    if ($scfg->{options}) {
+		push @$cmd, '-o', $scfg->{options};
+	    }
+	    run_command($cmd, errmsg => "mount error");
     } else {
 	my $source = "$server:$subdir";
-	$cmd = ['/bin/mount', '-t', 'ceph', $source, $mountpoint, '-o', "name=$cmd_option->{userid}"];
-	push @$cmd, '-o', "secretfile=$secretfile" if defined($secretfile);
+	my @opts = ( "name=$cmd_option->{userid}" );
+	push @opts, "secretfile=$secretfile" if defined($secretfile);
+	push @opts, $scfg->{options} if $scfg->{options};
-	# tell systemd that we're network dependent, else it umounts us to late
-	# on shutdown, when we couldn't connect to the active MDS and thus
-	# unmount hangs and delays shutdown/reboot (man systemd.mount).
-	push @$cmd, '-o', '_netdev';
+	systemd_netmount($mountpoint, 'ceph', $source, join(',', @opts));
-    if ($scfg->{options}) {
-	push @$cmd, '-o', $scfg->{options};
-    }
-    run_command($cmd, errmsg => "mount error");
 # Configuration

More information about the pve-devel mailing list