[PVE-User] Backup of one VM always fails

Eneko Lacunza elacunza at binovo.es
Mon Dec 14 16:12:59 CET 2020


Hi all,

Sorry for being late in thread.

We're also having backup issues with some VMs since some weeks.

We see a different log though:

INFO: Starting Backup of VM 176 (qemu)
INFO: Backup started at 2020-12-12 00:27:44
INFO: status = running
INFO: VM Name: buildbot-slave-odoo01-d10
INFO: include disk 'scsi0' 'proxmox_r3_ssd:vm-176-disk-0' 10G
INFO: include disk 'scsi1' 'proxmox_r3_ssd:vm-176-disk-1' 45G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive 
'/mnt/pve/txoritokieta-proxmox/dump/vzdump-qemu-176-2020_12_12-00_27_44.vma.lzo'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '95fde1e6-5349-41ef-ad82-a271ac34556e'
INFO: resuming VM again
INFO: 0% (524.0 MiB of 55.0 GiB) in 3s, read: 174.7 MiB/s, write: 162.8 
MiB/s
INFO: 1% (1.1 GiB of 55.0 GiB) in 6s, read: 184.0 MiB/s, write: 181.3 MiB/s
INFO: 2% (1.6 GiB of 55.0 GiB) in 9s, read: 174.7 MiB/s, write: 173.2 MiB/s
INFO: 3% (2.1 GiB of 55.0 GiB) in 12s, read: 174.8 MiB/s, write: 173.6 MiB/s
INFO: 4% (2.6 GiB of 55.0 GiB) in 15s, read: 179.9 MiB/s, write: 169.0 MiB/s
INFO: 5% (3.1 GiB of 55.0 GiB) in 18s, read: 185.3 MiB/s, write: 184.0 MiB/s
INFO: 6% (3.7 GiB of 55.0 GiB) in 21s, read: 188.0 MiB/s, write: 186.7 MiB/s
INFO: 7% (3.9 GiB of 55.0 GiB) in 24s, read: 59.4 MiB/s, write: 59.4 MiB/s
INFO: 8% (4.4 GiB of 55.0 GiB) in 1m 52s, read: 6.5 MiB/s, write: 6.2 MiB/s
INFO: 9% (5.0 GiB of 55.0 GiB) in 1m 55s, read: 181.0 MiB/s, write: 
181.0 MiB/s
INFO: 10% (5.7 GiB of 55.0 GiB) in 1m 59s, read: 183.2 MiB/s, write: 
183.2 MiB/s
INFO: 11% (6.2 GiB of 55.0 GiB) in 2m 2s, read: 181.3 MiB/s, write: 
171.6 MiB/s
INFO: 12% (6.7 GiB of 55.0 GiB) in 2m 5s, read: 177.3 MiB/s, write: 
177.3 MiB/s
INFO: 13% (7.3 GiB of 55.0 GiB) in 2m 8s, read: 185.3 MiB/s, write: 
185.3 MiB/s
INFO: 14% (7.8 GiB of 55.0 GiB) in 2m 11s, read: 185.3 MiB/s, write: 
185.3 MiB/s
INFO: 15% (8.3 GiB of 55.0 GiB) in 2m 14s, read: 178.7 MiB/s, write: 
169.7 MiB/s
INFO: 16% (9.0 GiB of 55.0 GiB) in 3m 28s, read: 8.6 MiB/s, write: 8.6 MiB/s
INFO: 17% (9.5 GiB of 55.0 GiB) in 3m 31s, read: 184.1 MiB/s, write: 
184.1 MiB/s
INFO: 18% (10.0 GiB of 55.0 GiB) in 3m 34s, read: 186.7 MiB/s, write: 
178.7 MiB/s
INFO: 19% (10.6 GiB of 55.0 GiB) in 3m 37s, read: 178.7 MiB/s, write: 
177.3 MiB/s
INFO: 20% (11.1 GiB of 55.0 GiB) in 3m 40s, read: 186.7 MiB/s, write: 
186.7 MiB/s
INFO: 21% (11.6 GiB of 55.0 GiB) in 3m 43s, read: 174.7 MiB/s, write: 
174.7 MiB/s
INFO: 22% (12.1 GiB of 55.0 GiB) in 3m 46s, read: 173.3 MiB/s, write: 
172.4 MiB/s
INFO: 23% (12.7 GiB of 55.0 GiB) in 3m 49s, read: 178.7 MiB/s, write: 
178.7 MiB/s
INFO: 24% (13.2 GiB of 55.0 GiB) in 6m 28s, read: 3.7 MiB/s, write: 3.7 
MiB/s
INFO: 25% (13.8 GiB of 55.0 GiB) in 6m 31s, read: 186.7 MiB/s, write: 
186.7 MiB/s
INFO: 26% (14.4 GiB of 55.0 GiB) in 6m 50s, read: 30.9 MiB/s, write: 
30.9 MiB/s
INFO: 27% (15.0 GiB of 55.0 GiB) in 6m 54s, read: 167.0 MiB/s, write: 
167.0 MiB/s
INFO: 28% (15.6 GiB of 55.0 GiB) in 6m 57s, read: 186.8 MiB/s, write: 
186.8 MiB/s
INFO: 29% (16.1 GiB of 55.0 GiB) in 7m 0s, read: 183.7 MiB/s, write: 
180.5 MiB/s
INFO: 30% (16.5 GiB of 55.0 GiB) in 8m 0s, read: 7.3 MiB/s, write: 7.3 MiB/s
INFO: 31% (17.1 GiB of 55.0 GiB) in 8m 20s, read: 30.9 MiB/s, write: 
30.9 MiB/s
INFO: 32% (17.8 GiB of 55.0 GiB) in 8m 24s, read: 165.5 MiB/s, write: 
165.4 MiB/s
INFO: 33% (18.3 GiB of 55.0 GiB) in 8m 27s, read: 169.7 MiB/s, write: 
164.1 MiB/s
INFO: 34% (18.7 GiB of 55.0 GiB) in 8m 30s, read: 153.0 MiB/s, write: 
153.0 MiB/s
INFO: 35% (19.4 GiB of 55.0 GiB) in 8m 34s, read: 168.9 MiB/s, write: 
168.9 MiB/s
INFO: 36% (20.0 GiB of 55.0 GiB) in 9m 54s, read: 7.7 MiB/s, write: 7.7 
MiB/s
INFO: 37% (20.5 GiB of 55.0 GiB) in 9m 57s, read: 171.7 MiB/s, write: 
171.2 MiB/s
INFO: 38% (21.0 GiB of 55.0 GiB) in 10m 0s, read: 167.0 MiB/s, write: 
167.0 MiB/s
INFO: 39% (21.5 GiB of 55.0 GiB) in 10m 3s, read: 185.3 MiB/s, write: 
185.3 MiB/s
INFO: 40% (22.0 GiB of 55.0 GiB) in 10m 6s, read: 184.0 MiB/s, write: 
183.5 MiB/s
INFO: 41% (22.7 GiB of 55.0 GiB) in 11m 6s, read: 11.3 MiB/s, write: 
11.3 MiB/s
INFO: 42% (23.2 GiB of 55.0 GiB) in 11m 9s, read: 180.0 MiB/s, write: 
180.0 MiB/s
INFO: 43% (23.8 GiB of 55.0 GiB) in 11m 13s, read: 136.0 MiB/s, write: 
136.0 MiB/s
INFO: 44% (24.3 GiB of 55.0 GiB) in 11m 16s, read: 174.7 MiB/s, write: 
174.7 MiB/s
INFO: 45% (24.8 GiB of 55.0 GiB) in 11m 19s, read: 178.7 MiB/s, write: 
178.7 MiB/s
INFO: 46% (25.3 GiB of 55.0 GiB) in 11m 22s, read: 185.0 MiB/s, write: 
184.0 MiB/s
INFO: 47% (25.9 GiB of 55.0 GiB) in 11m 25s, read: 188.3 MiB/s, write: 
188.3 MiB/s
INFO: 48% (26.4 GiB of 55.0 GiB) in 11m 28s, read: 186.7 MiB/s, write: 
178.5 MiB/s
INFO: 49% (27.0 GiB of 55.0 GiB) in 11m 31s, read: 177.3 MiB/s, write: 
177.3 MiB/s
lzop: Input/output error: <stdout>
Warning: unable to close filehandle GEN6465 properly: Input/output error 
at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 752.
ERROR: VM 176 qmp command 'query-backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 176 qmp command 'backup-cancel' failed - unable to connect to 
VM 176 qmp socket - timeout after 5986 retries
ERROR: Backup of VM 176 failed - VM 176 qmp command 'query-backup' 
failed - got timeout

This was last saturday; had to Stop/Start the VM this morning. We backup 
this VM on satudays, it has blocked on 14th nov, 21th nov, 28th nov, and 
12th dec. It "strangely" missed hanging on 5th dec :-)

VM is on ceph storage backed by SSD, Debian 10 OS. We have other almost 
identical VMs not affected... (various buildbot workers). Host is Ryzen 
1700.

# pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-1
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-3
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-15
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

El 4/12/20 a las 15:20, Frank Thommen escribió:
> Lot's of klicking to configure the other 30 VMs, but yes, that's 
> probably the most appropriate thing to do now :-)  I have to arrange 
> for a free NFS share first, though, as there is no free local disk 
> space for 3+ TB of backup...
>
>
> On 04/12/2020 15:00, Yannis Milios wrote:
>> Can you try removing this specific VM from the normal backup schedule 
>> and
>> then create a new test schedule for it, if possible to a different 
>> backup
>> target (nfs, local etc) ?
>>
>>
>>
>> On Fri, 4 Dec 2020 at 11:10, Frank Thommen 
>> <f.thommen at dkfz-heidelberg.de>
>> wrote:
>>
>>> On 04/12/2020 11:36, Arjen via pve-user wrote:
>>>> On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:
>>>>>
>>>>> On 04/12/2020 09:30, Frank Thommen wrote:
>>>>>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen
>>>>>>> <f.thommen at dkfz-heidelberg.de> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> on our PVE cluster, the backup of a specific VM always fails
>>>>>>>> (which
>>>>>>>> makes us worry, as it is our GitLab instance). The general
>>>>>>>> backup plan
>>>>>>>> is "back up all VMs at 00:30". In the confirmation email we
>>>>>>>> see, that
>>>>>>>> the backup of this specific VM takes six to seven hours and
>>>>>>>> then fails.
>>>>>>>> The error message in the overview table used to be:
>>>>>>>>
>>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>>
>>>>>>>> With detailed log
>>>>>>>>
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -----------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>>>>>>>> 123: 2020-12-01 02:53:08 INFO: status = running
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
>>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
>>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
>>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: creating archive
>>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
>>>>>>>> 02_53_08.vma.lzo'
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: started backup task
>>>>>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>>>>>>>> 123: 2020-12-01 02:53:12 INFO: status: 0%
>>>>>>>> (167772160/3294239916032),
>>>>>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>>>>>>>> [... ecc. ecc. ...]
>>>>>>>> 123: 2020-12-01 09:42:14 INFO: status: 35%
>>>>>>>> (1170252365824/3294239916032), sparse 0% (26845003776),
>>>>>>>> duration 24545,
>>>>>>>> read/write 59/56 MB/s
>>>>>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
>>>>>>>> Broken
>>>>>>>> pipe
>>>>>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>>>>>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
>>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>>
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> ----------!
>>>>>>>>
>>>>>> ---------
>>>>>> -----------------------------------------------------------------
>>>>>> -----------------------------------------------------------------
>>>>>> -----------------------------------------------------------------
>>>>>> -----------------------------------------------------------------
>>>>>> ----------------------------------
>>>>>>
>>>>>>>> Since lately (upgrade to the newest PVE release) it's
>>>>>>>>
>>>>>>>> VM 123 qmp command 'query-backup' failed - got timeout
>>>>>>>>
>>>>>>>> with log
>>>>>>>>
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: status = running
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
>>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
>>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
>>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
>>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
>>>>>>>> 03_29_00.vma.lzo'
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: started backup task
>>>>>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>>>>>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
>>>>>>>> read:
>>>>>>>> 94.7 MiB/s, write: 51.7 MiB/s
>>>>>>>> [... ecc. ecc. ...]
>>>>>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
>>>>>>>> 36m 7s,
>>>>>>>> read: 57.3 MiB/s, write: 53.6 MiB/s
>>>>>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
>>>>>>>> backup' failed
>>>>>>>>
>>>>>>>> -   got timeout
>>>>>>>>      123: 2020-12-03 09:22:57 INFO: aborting backup job
>>>>>>>>      123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
>>>>>>>> cancel'
>>>>>>>>      failed - unable to connect to VM 123 qmp socket - timeout
>>>>>>>> after
>>>>>>>> 5981 retries
>>>>>>>>      123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
>>>>>>>> VM 123 qmp
>>>>>>>>      command 'query-backup' failed - got timeout
>>>>>>>>
>>>>>>>>
>>>>>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored
>>>>>>>> in Ceph.
>>>>>>>> There is still plenty of space in Ceph.
>>>>>>>>
>>>>>>>> Can anyone give us some hint on how to investigate and debug
>>>>>>>> this
>>>>>>>> further?
>>>>>>>
>>>>>>> Because it is a write error, maybe we should look at the backup
>>>>>>> destination.
>>>>>>> Maybe it is a network connection issue? Maybe something wrong
>>>>>>> with the
>>>>>>> host? Maybe the disk is full?
>>>>>>> Which storage are you using for backup? Can you show us the
>>>>>>> corresponding entry in /etc/pve/storage.cfg?
>>>>>>
>>>>>> We are backing up to cephfs with still 8 TB or so free.
>>>>>>
>>>>>> /etc/pve/storage.cfg is
>>>>>> ------------
>>>>>> dir: local
>>>>>>           path /var/lib/vz
>>>>>>           content vztmpl,backup,iso
>>>>>>
>>>>>> dir: data
>>>>>>           path /data
>>>>>>           content snippets,images,backup,iso,rootdir,vztmpl
>>>>>>
>>>>>> cephfs: cephfs
>>>>>>           path /mnt/pve/cephfs
>>>>>>           content backup,vztmpl,iso
>>>>>>           maxfiles 5
>>>>>>
>>>>>> rbd: ceph-rbd
>>>>>>           content images,rootdir
>>>>>>           krbd 0
>>>>>>           pool pve-pool1
>>>>>> ------------
>>>>>>
>>>>>
>>>>> The problem has reached a new level of urgency, as since two days
>>>>> each
>>>>> time after a failed backup the VMm becomes unaccessible and has to be
>>>>> stopped and started manually from the PVE UI.
>>>>
>>>> I don't see anything wrong the configuration that you shared.
>>>> Was anything changed in the last few days since the last successful
>>>> backup? Any updates from Proxmox? Changes to the network?
>>>> I know very little about Ceph and clusters, sorry.
>>>> What makes this VM different, except for the size of the disks?
>>>
>>> On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think
>>> from 6.1-3).  After that the error message slightly changed and - in
>>> hindsight - since then the VM stops being accessible after the failed
>>> backup.
>>>
>>> However: The VM never ever backed up successfully, not even before the
>>> PVE upgrade.  It's just that no one really took notice of it.
>>>
>>> The VM is not really special.  It's our only Debian VM (but I hope
>>> that's not an issue :-) and the VM has been migrated 1:1 from oVirt by
>>> migrating and importing the disk images.  But we have a few other such
>>> VMs and they run and back up just fine.
>>>
>>> No network changes. Basically nothing changed that I could think of.
>>>
>>> But to be clear: Our current main problem is the failing backup, not 
>>> the
>>> crash.
>>>
>>>
>>> Cheers, Frank
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> -- 
>> Sent from Gmail Mobile
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>


-- 
Eneko Lacunza                | +34 943 569 206
                              | elacunza at binovo.es
Zuzendari teknikoa           | https://www.binovo.es
Director técnico             | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun





More information about the pve-user mailing list