[pve-devel] [PATCH pve-kernel] Backport rbd patches to fix lock issues

Mon Sep 6 11:22:03 CEST 2021

Hello,

We had a strange locking issue, described in 
https://tracker.ceph.com/issues/42757

We rebuild the pve-kernel with the two specified patches: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8798d070d416d18a75770fc19787e96705073f43 
and 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed9eb71085ecb7ded9a5118cec2ab70667cc7350 


This fixed our issue and we didn't have any virtual machines randomly 
stuck anymore.

We do think it's probably useful to apply the patch for everybody to 
avoid funny locking issue that could happens with bad network conditions.

This is a patch for pve-kernel with the two patches.

I do hope it's the correct way to submit the change :)

Have a nice day,

---
  patches/kernel/0010-rbd-always-kick.patch | 76 +++++++++++++++++++++
  patches/kernel/0011-rbd-dont-hold.patch   | 80 +++++++++++++++++++++++
  2 files changed, 156 insertions(+)
  create mode 100644 patches/kernel/0010-rbd-always-kick.patch
  create mode 100644 patches/kernel/0011-rbd-dont-hold.patch

diff --git a/patches/kernel/0010-rbd-always-kick.patch 
b/patches/kernel/0010-rbd-always-kick.patch
new file mode 100644
index 0000000..798bc70
--- /dev/null
+++ b/patches/kernel/0010-rbd-always-kick.patch
@@ -0,0 +1,76 @@
+From ba378b796088f5660050d2f2d1e0dcf555a080e6 Mon Sep 17 00:00:00 2001
+From: Ilya Dryomov <idryomov at gmail.com>
+Date: Sat, 3 Jul 2021 11:56:55 +0200
+Subject: rbd: always kick acquire on "acquired" and "released" 
notifications
+
+commit 8798d070d416d18a75770fc19787e96705073f43 upstream.
+
+Skipping the "lock has been released" notification if the lock owner
+is not what we expect based on owner_cid can lead to I/O hangs.
+One example is our own notifications: because owner_cid is cleared
+in rbd_unlock(), when we get our own notification it is processed as
+unexpected/duplicate and maybe_kick_acquire() isn't called.  If a peer
+that requested the lock then doesn't go through with acquiring it,
+I/O requests that came in while the lock was being quiesced would
+be stalled until another I/O request is submitted and kicks acquire
+from rbd_img_exclusive_lock().
+
+This makes the comment in rbd_release_lock() actually true: prior to
+this change the canceled work was being requeued in response to the
+"lock has been acquired" notification from rbd_handle_acquired_lock().
+
+Cc: stable at vger.kernel.org # 5.3+
+Signed-off-by: Ilya Dryomov <idryomov at gmail.com>
+Tested-by: Robin Geuze <robin.geuze at nl.team.blue>
+Signed-off-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
+---
+ drivers/block/rbd.c | 20 +++++++-------------
+ 1 file changed, 7 insertions(+), 13 deletions(-)
+
+(limited to 'drivers/block/rbd.c')
+
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index ca509a7a164ab..55745d6953f02 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -4340,15 +4340,11 @@ static void rbd_handle_acquired_lock(struct 
rbd_device *rbd_dev, u8 struct_v,
+       if (!rbd_cid_equal(&cid, &rbd_empty_cid)) {
+               down_write(&rbd_dev->lock_rwsem);
+               if (rbd_cid_equal(&cid, &rbd_dev->owner_cid)) {
+-                      /*
+-                       * we already know that the remote client is
+-                       * the owner
+-                       */
+-                      up_write(&rbd_dev->lock_rwsem);
+-                      return;
++                      dout("%s rbd_dev %p cid %llu-%llu == owner_cid\n",
++                           __func__, rbd_dev, cid.gid, cid.handle);
++              } else {
++                      rbd_set_owner_cid(rbd_dev, &cid);
+               }
+-
+-              rbd_set_owner_cid(rbd_dev, &cid);
+               downgrade_write(&rbd_dev->lock_rwsem);
+       } else {
+               down_read(&rbd_dev->lock_rwsem);
+@@ -4373,14 +4369,12 @@ static void rbd_handle_released_lock(struct 
rbd_device *rbd_dev, u8 struct_v,
+       if (!rbd_cid_equal(&cid, &rbd_empty_cid)) {
+               down_write(&rbd_dev->lock_rwsem);
+               if (!rbd_cid_equal(&cid, &rbd_dev->owner_cid)) {
+-                      dout("%s rbd_dev %p unexpected owner, cid 
%llu-%llu != owner_cid %llu-%llu\n",
++                      dout("%s rbd_dev %p cid %llu-%llu != owner_cid 
%llu-%llu\n",
+                            __func__, rbd_dev, cid.gid, cid.handle,
+                            rbd_dev->owner_cid.gid, 
rbd_dev->owner_cid.handle);
+-                      up_write(&rbd_dev->lock_rwsem);
+-                      return;
++              } else {
++                      rbd_set_owner_cid(rbd_dev, &rbd_empty_cid);
+               }
+-
+-              rbd_set_owner_cid(rbd_dev, &rbd_empty_cid);
+               downgrade_write(&rbd_dev->lock_rwsem);
+       } else {
+               down_read(&rbd_dev->lock_rwsem);
+--
+cgit 1.2.3-1.el7
+
diff --git a/patches/kernel/0011-rbd-dont-hold.patch 
b/patches/kernel/0011-rbd-dont-hold.patch
new file mode 100644
index 0000000..5421ef8
--- /dev/null
+++ b/patches/kernel/0011-rbd-dont-hold.patch
@@ -0,0 +1,80 @@
+From 13066d6628f04194fea7c2b6a497ebc9d1d0df5f Mon Sep 17 00:00:00 2001
+From: Ilya Dryomov <idryomov at gmail.com>
+Date: Sat, 3 Jul 2021 11:31:26 +0200
+Subject: rbd: don't hold lock_rwsem while running_list is being drained
+
+commit ed9eb71085ecb7ded9a5118cec2ab70667cc7350 upstream.
+
+Currently rbd_quiesce_lock() holds lock_rwsem for read while blocking
+on releasing_wait completion.  On the I/O completion side, each image
+request also needs to take lock_rwsem for read.  Because rw_semaphore
+implementation doesn't allow new readers after a writer has indicated
+interest in the lock, this can result in a deadlock if something that
+needs to take lock_rwsem for write gets involved.  For example:
+
+1. watch error occurs
+2. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
+   releases lock_rwsem
+3. after reestablishing the watch, rbd_reregister_watch() takes
+   lock_rwsem for write and calls rbd_reacquire_lock()
+4. rbd_quiesce_lock() downgrades lock_rwsem to for read and blocks on
+   releasing_wait until running_list becomes empty
+5. another watch error occurs
+6. rbd_watch_errcb() blocks trying to take lock_rwsem for write
+7. no in-flight image request can complete and delete itself from
+   running_list because lock_rwsem won't be granted anymore
+
+A similar scenario can occur with "lock has been acquired" and "lock
+has been released" notification handers which also take lock_rwsem for
+write to update owner_cid.
+
+We don't actually get anything useful from sitting on lock_rwsem in
+rbd_quiesce_lock() -- owner_cid updates certainly don't need to be
+synchronized with.  In fact the whole owner_cid tracking logic could
+probably be removed from the kernel client because we don't support
+proxied maintenance operations.
+
+Cc: stable at vger.kernel.org # 5.3+
+URL: https://tracker.ceph.com/issues/42757
+Signed-off-by: Ilya Dryomov <idryomov at gmail.com>
+Tested-by: Robin Geuze <robin.geuze at nl.team.blue>
+Signed-off-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
+---
+ drivers/block/rbd.c | 12 +++++-------
+ 1 file changed, 5 insertions(+), 7 deletions(-)
+
+(limited to 'drivers/block/rbd.c')
+
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index bf2f0373a3b2f..ca509a7a164ab 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -4239,8 +4239,6 @@ again:
+
+ static bool rbd_quiesce_lock(struct rbd_device *rbd_dev)
+ {
+-      bool need_wait;
+-
+       dout("%s rbd_dev %p\n", __func__, rbd_dev);
+       lockdep_assert_held_write(&rbd_dev->lock_rwsem);
+
+@@ -4252,11 +4250,11 @@ static bool rbd_quiesce_lock(struct rbd_device 
*rbd_dev)
+        */
+       rbd_dev->lock_state = RBD_LOCK_STATE_RELEASING;
+ rbd_assert(!completion_done(&rbd_dev->releasing_wait));
+-      need_wait = !list_empty(&rbd_dev->running_list);
+-      downgrade_write(&rbd_dev->lock_rwsem);
+-      if (need_wait)
+- wait_for_completion(&rbd_dev->releasing_wait);
+-      up_read(&rbd_dev->lock_rwsem);
++      if (list_empty(&rbd_dev->running_list))
++              return true;
++
++      up_write(&rbd_dev->lock_rwsem);
++      wait_for_completion(&rbd_dev->releasing_wait);
+
+       down_write(&rbd_dev->lock_rwsem);
+       if (rbd_dev->lock_state != RBD_LOCK_STATE_RELEASING)
+--
+cgit 1.2.3-1.el7
+
-- 
2.20.1