[pve-devel] [PATCH cluster v4] cfs_lock: address race where alarm triggers with lock accquired
Thomas Lamprecht
t.lamprecht at proxmox.com
Wed Nov 8 14:57:23 CET 2017
As mkdir can possibly hang forever we need to enforce a timeout on
it. But this was made in such a way so that a small time window
existed where the lock could be acquired successfully but the alarm
triggered still, leaving around an unused lock for 120 seconds.
Wrap only the mkdir call itself in an alarm and save its result
directly in a $got_lock variable, this minimizes the window as far as
possible from the perl side.
This is also easier to track for humans reading the code and should
cope better against code changes, e.g., it does not breaks just if an
error message typo got corrected a few lines above.
---
changes v3 -> v4:
* do not calculate the remaining timeout manually, simply reuse the
return value from alarm
data/PVE/Cluster.pm | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/data/PVE/Cluster.pm b/data/PVE/Cluster.pm
index 70ce250..20d74d2 100644
--- a/data/PVE/Cluster.pm
+++ b/data/PVE/Cluster.pm
@@ -861,6 +861,7 @@ my $cfs_lock = sub {
my ($lockid, $timeout, $code, @param) = @_;
my $res;
+ my $got_lock = 0;
# this timeout is for aquire the lock
$timeout = 10 if !$timeout;
@@ -877,22 +878,23 @@ my $cfs_lock = sub {
die "$msg: pve cluster filesystem not online.\n";
}
- local $SIG{ALRM} = sub { die "got lock request timeout\n"; };
+ my $timeout_err = sub { die "$msg: got lock request timeout\n"; };
+ local $SIG{ALRM} = $timeout_err;
- alarm ($timeout);
+ while (1) {
+ alarm ($timeout);
+ $got_lock = mkdir($filename);
+ $timeout = alarm(0);
+
+ last if $got_lock;
+
+ $timeout_err->() if $timeout == 0;
- if (!(mkdir $filename)) {
print STDERR "trying to aquire cfs lock '$lockid' ...";
- while (1) {
- if (!(mkdir $filename)) {
- (utime 0, 0, $filename); # cfs unlock request
- } else {
- print STDERR " OK\n";
- last;
- }
- sleep(1);
- }
+ utime (0, 0, $filename); # cfs unlock request
+ sleep(1);
}
+ print STDERR " OK\n";
# fixed command timeout: cfs locks have a timeout of 120
# using 60 gives us another 60 seconds to abort the task
--
2.11.0
More information about the pve-devel
mailing list