[PVE-User] Backup of one VM always fails

Arjen leesteken at protonmail.ch
Fri Dec 4 11:36:25 CET 2020


On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:
> 
> On 04/12/2020 09:30, Frank Thommen wrote:
> > > On Thursday, December 3, 2020 10:16 PM, Frank Thommen
> > > <f.thommen at dkfz-heidelberg.de> wrote:
> > > 
> > > > 
> > > > Dear all,
> > > > 
> > > > on our PVE cluster, the backup of a specific VM always fails
> > > > (which
> > > > makes us worry, as it is our GitLab instance). The general
> > > > backup plan
> > > > is "back up all VMs at 00:30". In the confirmation email we
> > > > see, that
> > > > the backup of this specific VM takes six to seven hours and
> > > > then fails.
> > > > The error message in the overview table used to be:
> > > > 
> > > > vma_queue_write: write error - Broken pipe
> > > > 
> > > > With detailed log
> > > > 
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -----------------------------------------------
> > > > 
> > > > 
> > > > 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
> > > > 123: 2020-12-01 02:53:08 INFO: status = running
> > > > 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
> > > > 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
> > > > 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
> > > > 'ceph-rbd:vm-123-disk-0' 20G
> > > > 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
> > > > 'ceph-rbd:vm-123-disk-2' 1000G
> > > > 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
> > > > 'ceph-rbd:vm-123-disk-3' 2T
> > > > 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
> > > > 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
> > > > 123: 2020-12-01 02:53:09 INFO: creating archive
> > > > '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
> > > > 02_53_08.vma.lzo'
> > > > 123: 2020-12-01 02:53:09 INFO: started backup task
> > > > 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
> > > > 123: 2020-12-01 02:53:12 INFO: status: 0%
> > > > (167772160/3294239916032),
> > > > sparse 0% (31563776), duration 3, read/write 55/45 MB/s
> > > > [... ecc. ecc. ...]
> > > > 123: 2020-12-01 09:42:14 INFO: status: 35%
> > > > (1170252365824/3294239916032), sparse 0% (26845003776),
> > > > duration 24545,
> > > > read/write 59/56 MB/s
> > > > 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
> > > > Broken
> > > > pipe
> > > > 123: 2020-12-01 09:42:14 INFO: aborting backup job
> > > > 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
> > > > vma_queue_write: write error - Broken pipe
> > > > 
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > ----------!
> > > > 
> > ---------
> > -----------------------------------------------------------------
> > -----------------------------------------------------------------
> > -----------------------------------------------------------------
> > -----------------------------------------------------------------
> > ----------------------------------
> > 
> > > > Since lately (upgrade to the newest PVE release) it's
> > > > 
> > > > VM 123 qmp command 'query-backup' failed - got timeout
> > > > 
> > > > with log
> > > > 
> > > > -------------------------------------------------------------
> > > > -------------------------------------------------------------
> > > > 
> > > > 
> > > > 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
> > > > 123: 2020-12-03 03:29:00 INFO: status = running
> > > > 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
> > > > 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
> > > > 'ceph-rbd:vm-123-disk-0' 20G
> > > > 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
> > > > 'ceph-rbd:vm-123-disk-2' 1000G
> > > > 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
> > > > 'ceph-rbd:vm-123-disk-3' 2T
> > > > 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
> > > > 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
> > > > 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
> > > > '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
> > > > 03_29_00.vma.lzo'
> > > > 123: 2020-12-03 03:29:01 INFO: started backup task
> > > > 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
> > > > 123: 2020-12-03 03:29:01 INFO: resuming VM again
> > > > 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
> > > > read:
> > > > 94.7 MiB/s, write: 51.7 MiB/s
> > > > [... ecc. ecc. ...]
> > > > 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
> > > > 36m 7s,
> > > > read: 57.3 MiB/s, write: 53.6 MiB/s
> > > > 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
> > > > backup' failed
> > > > 
> > > > -   got timeout
> > > >     123: 2020-12-03 09:22:57 INFO: aborting backup job
> > > >     123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
> > > > cancel'
> > > >     failed - unable to connect to VM 123 qmp socket - timeout
> > > > after
> > > > 5981 retries
> > > >     123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
> > > > VM 123 qmp
> > > >     command 'query-backup' failed - got timeout
> > > > 
> > > > 
> > > > The VM has some quite big vdisks (20G, 1T and 2T). All stored
> > > > in Ceph.
> > > > There is still plenty of space in Ceph.
> > > > 
> > > > Can anyone give us some hint on how to investigate and debug
> > > > this
> > > > further?
> > > 
> > > Because it is a write error, maybe we should look at the backup
> > > destination.
> > > Maybe it is a network connection issue? Maybe something wrong
> > > with the
> > > host? Maybe the disk is full?
> > > Which storage are you using for backup? Can you show us the
> > > corresponding entry in /etc/pve/storage.cfg?
> > 
> > We are backing up to cephfs with still 8 TB or so free.
> > 
> > /etc/pve/storage.cfg is
> > ------------
> > dir: local
> >          path /var/lib/vz
> >          content vztmpl,backup,iso
> > 
> > dir: data
> >          path /data
> >          content snippets,images,backup,iso,rootdir,vztmpl
> > 
> > cephfs: cephfs
> >          path /mnt/pve/cephfs
> >          content backup,vztmpl,iso
> >          maxfiles 5
> > 
> > rbd: ceph-rbd
> >          content images,rootdir
> >          krbd 0
> >          pool pve-pool1
> > ------------
> > 
> 
> The problem has reached a new level of urgency, as since two days
> each
> time after a failed backup the VMm becomes unaccessible and has to be
> stopped and started manually from the PVE UI.

I don't see anything wrong the configuration that you shared.
Was anything changed in the last few days since the last successful
backup? Any updates from Proxmox? Changes to the network?
I know very little about Ceph and clusters, sorry.
What makes this VM different, except for the size of the disks?

best regards, Arjen





More information about the pve-user mailing list