[PVE-User] Proxmox and glusterfs: VMs get corupted

Wed May 31 16:55:21 CEST 2023

>Was könnte ich denn noch versuchen? Würde es vielleicht Sinn machen das
>Image-Format von qcow2 auf raw umzustellen? wir

ja, ich würd mal testhalber auf raw umstellen. und discard aus machen.

>wir haben qcow2 vor allem wegen
>der Snapshots und Platzersparnis gewählt

dito. fahren damit ganz gut, obwohl wir qcow2 on top zfs datasets haben....

>Wir haben glusterfs deshalb gewählt, weil es uns am unkompliziertesten
>schien und weil wir etwas Respekt vor z.B. Ceph haben.

das geht mir exakt auch so, ich habe deshalb bislang einen bogen um ceph gemacht und schiele
auch schon eine weile auf glusterfs, habe aber den eindruck daß das im kontext proxmox (oder auch
sonstwie) irgendwie sehr exotisch ist (sonst hätte hier vielleicht auch mal wer geantwortet
https://forum.proxmox.com/threads/glusterfs-sharding-afr-question.118554/ ).

wir haben aus diesem grund bislang auf shared storage verzichtet.

ich bin irgendwie gebranntes kind was san und shared storage angeht, habe früher sogar mal
SANs inkl. San-Virtualisierung mit IBM SVC an der backe gehabt und auch noch nie so viel ärger
mit IT gehabt und auch nie so schlecht geschlafen)

wir haben aktuell nur local storage , setzen auf cold-standy ausfall-hardware und replizieren
die lokalen storages mit sanoid auf ein ausfall-system.

ich hätte aber auch lieber eine einfache und gut wartbare online/redundanz-lösung.
mit ceph kann ich mir nicht anfreunden. und wenn ich so lese was leute da für nöte ud
herausforderungen mit haben bin ich auch offengesagt froh drum es NICHT an der backe zu haben.

grüsse
roland

Am 31.05.23 um 16:23 schrieb Christian Schoepplein:
> Hallo Roland,
>
> danke für deine Antwort und Tipps.
>
> Ich hab nun mehrere hundert größere und sehr große Files nach
> /mnt/pve/gfs_vms geschrieben und die md5-Summen verglichen, alles kein
> Problem. Auch beim Lesen nicht.
>
> Wenn ich aio auf threads setze, wird es gefühlt leider sogar noch schlimmer
> mit den kaputten VMs. Ich hab Folgendes in der VM Konfig stehen:
>
> scsi0: gfs_vms:200/vm-200-disk-0.qcow2,discard=on,aio=threads,size=10444M
>
> Ist das so richtig? Laut den Prozessen sollte es stimmen:
>
> root     1708993  4.3  1.7 3370764 1174016 ?     Sl   15:32   1:40
> /usr/bin/kvm -id 200 -name testvm,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/200.qmp,server=on,wait=off -mon chardev=qmp,mode=control -ch
> ardev socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5 -mon
> chardev=qmp-event,mode=control -pidfile /var/run/qemu-server/200.pid -daemonize -smbios type=1,uuid=0da99a1f-a9ac-4999-a6c4-203cd39ff72e -smp 1,sockets=1,cores=1,maxcpus
> =1 -nodefaults -boot
> menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vnc unix:/var/run/qemu-server/200.vnc,password=on -cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep -m 2048 -object memory-ba
> ckend-ram,id=ram-node0,size=2048M -numa
> node,nodeid=0,cpus=0,memdev=ram-node0 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device vmgenid,guid=dc4109a1-7b6f-4735-9685-ca50a38744e2 -device usb-tablet,id=tablet,bus=ehci.0,port=1 -chard
> ev
> socket,id=serial0,path=/var/run/qemu-server/200.serial0,server=on,wait=off -device isa-serial,chardev=serial0 -device VGA,id=vga,bus=pcie.0,addr=0x1 -chardev socket,path=/var/run/qemu-server/200.qga,server=on,wait=off,id=qga0 -device vir
> tio-serial,id=qga0,bus=pci.0,addr=0x8 -device
> virtserialport,chardev=qga0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on -iscsi initiator-name=iqn.1993-08.org.debian:01:cbb6926f9
> 59d -drive
> file=gluster://gluster1.linova.de/gfs_vms/images/200/vm-200-cloudinit.qcow2,if=none,id=drive-ide2,media=cdrom,aio=io_uring -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr
> =0x5 -drive
> file=gluster://gluster1.linova.de/gfs_vms/images/200/vm-200-disk-0.qcow2,if=none,id=drive-scsi0,aio=threads,discard=on,format=qcow2,cache=none,detect-zeroes=unmap -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=dri
> ve-scsi0,id=scsi0,bootindex=101 -netdev
> type=tap,id=net0,ifname=tap200i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=5E:1F:9A:04:D6:6C,netdev=net0,bus=pci.0,addr=
> 0x12,id=net0,rx_queue_size=1024,tx_queue_size=1024 -machine type=q35+pve0
>
> Ich werd das Ganze jeztt nochmal mit einem lokalen Storage Backend
> probieren, geh aber davon aus, dass es damit läuft.
>
> Leider hat das gluster-Zeugs ein Kollege aufgesetzt, wenn es daran also
> liegt, muss ich mich wohl näher damit beschäftigen...
>
> Wir haben glusterfs deshalb gewählt, weil es uns am unkompliziertesten
> schien und weil wir etwas Respekt vor z.B. Ceph haben.
>
> Was könnte ich denn noch versuchen? Würde es vielleicht Sinn machen das
> Image-Format von qcow2 auf raw umzustellen? wir haben qcow2 vor allem wegen
> der Snapshots und Platzersparnis gewählt, falls das mit glusterfs nicht
> vernünftig funktioniert, müssten wir da ggf. auch nochmal schauen.
>
> Ich selbst habe bisher virtuelle Maschinen immer nur mit libvirt betrieben,
> ohne ein zentrales Storage. Daher kommen gerade viele neue Themen zusammen,
> die alle recht komplex sind :-(. Daher wäre ich über jeden Tipp für ein
> sinnvolles Setup froh :-).
>
> Ciao und danke,
>
>    Christian
>
>
> On Tue, May 30, 2023 at 06:46:51PM +0200, Roland wrote:
>> if /mnt/pve/gfs_vms is a writeable path from inside pve host, did you check if there is
>> also corruption when reading/writing large files there and compare with md5sum after copy ?
>>
>> furthermore, i remember there was a gluster/qcow2 issue with aio=native some years ago,
>> could you retry with aio=threads for the virtual disks ?
>>
>> regards
>> roland
>>
>> Am 30.05.23 um 18:32 schrieb Christian Schoepplein:
>>> Hi,
>>>
>>> we are testing the current proxmox version with a glusterfs storage backend
>>> and have a strange issue with file getting corupted inside the virtual
>>> machines. For what reason ever from one moment to another binaries can not
>>> longer be executed, scripts are damaged and so on. In the logs I get errors
>>> like this:
>>>
>>> May 30 11:22:36 ns1 dockerd[1234]: time="2023-05-30T11:22:36.874765091+02:00" level=warning msg="Running modprobe bridge br_netfilter failed with message: modprobe: ERROR: could not insert 'bridge': Exec format error\nmodprobe: ERROR: could not insert 'br_netfilter': Exec format error\ninsmod /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko \ninsmod /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko \n, error: exit status 1"
>>>
>>> On such a broken system a file brings the following:
>>>
>>> root at ns1:~# file /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko
>>> /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko: data
>>> root at ns1:~#
>>>
>>> On a normal system it looks like this:
>>>
>>> root at gluster1:~# file /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko
>>> /lib/modules/5.15.0-72-generic/kernel/net/802/stp.ko: ELF 64-bit LSB
>>> relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=1084f7cfcffbd4c607724fba287c0ea7fc5775
>>> root at gluster1:~#
>>>
>>> there are not only kernel modules afected. I saw the same behaviour for
>>> scripts, icinga check modules, the sendmail binary and so on, I think it is
>>> totaly random :-(.
>>>
>>> We have the problems with newly installed VMs, VMs cloned from a template
>>> create on our proxmox host and with VMs which we used before with libvirtd
>>> and migrated to our new proxmox machine. So IMHO it can not be related to
>>> the way we create new virtual machines...
>>>
>>> We are using the following software:
>>>
>>> root at proxmox1:~# pveversion -v
>>> proxmox-ve: 7.4-1 (running kernel: 5.15.104-1-pve)
>>> pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
>>> pve-kernel-5.15: 7.4-1
>>> pve-kernel-5.15.104-1-pve: 5.15.104-2
>>> pve-kernel-5.15.102-1-pve: 5.15.102-1
>>> ceph-fuse: 15.2.17-pve1
>>> corosync: 3.1.7-pve1
>>> criu: 3.15-1+pve-1
>>> glusterfs-client: 9.2-1
>>> ifupdown2: 3.1.0-1+pmx3
>>> ksm-control-daemon: 1.4-1
>>> libjs-extjs: 7.0.0-1
>>> libknet1: 1.24-pve2
>>> libproxmox-acme-perl: 1.4.4
>>> libproxmox-backup-qemu0: 1.3.1-1
>>> libproxmox-rs-perl: 0.2.1
>>> libpve-access-control: 7.4-2
>>> libpve-apiclient-perl: 3.2-1
>>> libpve-common-perl: 7.3-4
>>> libpve-guest-common-perl: 4.2-4
>>> libpve-http-server-perl: 4.2-3
>>> libpve-rs-perl: 0.7.5
>>> libpve-storage-perl: 7.4-2
>>> libspice-server1: 0.14.3-2.1
>>> lvm2: 2.03.11-2.1
>>> lxc-pve: 5.0.2-2
>>> lxcfs: 5.0.3-pve1
>>> novnc-pve: 1.4.0-1
>>> proxmox-backup-client: 2.4.1-1
>>> proxmox-backup-file-restore: 2.4.1-1
>>> proxmox-kernel-helper: 7.4-1
>>> proxmox-mail-forward: 0.1.1-1
>>> proxmox-mini-journalreader: 1.3-1
>>> proxmox-widget-toolkit: 3.6.5
>>> pve-cluster: 7.3-3
>>> pve-container: 4.4-3
>>> pve-docs: 7.4-2
>>> pve-edk2-firmware: 3.20230228-2
>>> pve-firewall: 4.3-1
>>> pve-firmware: 3.6-4
>>> pve-ha-manager: 3.6.0
>>> pve-i18n: 2.12-1
>>> pve-qemu-kvm: 7.2.0-8
>>> pve-xtermjs: 4.16.0-1
>>> qemu-server: 7.4-3
>>> smartmontools: 7.2-pve3
>>> spiceterm: 3.2-2
>>> swtpm: 0.8.0~bpo11+3
>>> vncterm: 1.7-1
>>> zfsutils-linux: 2.1.9-pve1
>>> root at proxmox1:~#
>>>
>>> root at proxmox1:~# cat /etc/pve/storage.cfg
>>> dir: local
>>>           path /var/lib/vz
>>>           content rootdir,iso,images,vztmpl,backup,snippets
>>>
>>> zfspool: local-zfs
>>>           pool rpool/data
>>>           content images,rootdir
>>>           sparse 1
>>>
>>> glusterfs: gfs_vms
>>>           path /mnt/pve/gfs_vms
>>>           volume gfs_vms
>>>           content images
>>>           prune-backups keep-all=1
>>>           server gluster1.linova.de
>>>           server2 gluster2.linova.de
>>>
>>> root at proxmox1:~#
>>>
>>> The config of a typical VM looks like this:
>>>
>>> root at proxmox1:~# cat /etc/pve/qemu-server/101.conf
>>> #ns1
>>> agent: enabled=1,fstrim_cloned_disks=1
>>> boot: c
>>> bootdisk: scsi0
>>> cicustom: user=local:snippets/user-data
>>> cores: 1
>>> hotplug: disk,network,usb
>>> ide2: gfs_vms:101/vm-101-cloudinit.qcow2,media=cdrom,size=4M
>>> ipconfig0: ip=10.200.32.9/22,gw=10.200.32.1
>>> kvm: 1
>>> machine: q35
>>> memory: 2048
>>> meta: creation-qemu=7.2.0,ctime=1683718002
>>> name: ns1
>>> nameserver: 10.200.0.5
>>> net0: virtio=1A:61:75:25:C6:30,bridge=vmbr0
>>> numa: 1
>>> ostype: l26
>>> scsi0: gfs_vms:101/vm-101-disk-0.qcow2,discard=on,size=10444M
>>> scsihw: virtio-scsi-pci
>>> searchdomain: linova.de
>>> serial0: socket
>>> smbios1: uuid=e2f503fe-4a66-4085-86c0-bb692add6b7a
>>> sockets: 1
>>> vmgenid: 3be6ec9d-7cfd-47c0-9f86-23c2e3ce5103
>>>
>>> root at proxmox1:~#
>>>
>>> Our glusterfs storage backend consists of three servers all running Ubuntu
>>> 22.04 and glusterfs version 10.1. There are no errors in the logs on the
>>> glusterfs hosts when a VM crashes and because some times also icinga plugins
>>> get corupted I do get a very exact time range to search in the logs for
>>> errors and warnings.
>>>
>>> However, I think it has something to do with our glusterfs setup. If I clone
>>> a VM from a template I get the following:
>>>
>>> root at proxmox1:~# qm clone 9000 200 --full --name testvm --description
>>> "testvm" --storage gfs_vms                                                                                                                                         [62/62]
>>> create full clone of drive ide2 (gfs_vms:9000/vm-9000-cloudinit.qcow2)
>>> Formatting
>>> 'gluster://gluster1.linova.de/gfs_vms/images/200/vm-200-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
>>> [2023-05-30 16:18:17.753152 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:17.876879 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:17.877606 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:17.878275 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-2: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:27.761247 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> [2023-05-30 16:18:28.766999 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:28.936449 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-0:
>>> All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:28.937547 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:28.938115 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-2: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:38.774387 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> create full clone of drive scsi0 (gfs_vms:9000/base-9000-disk-0.qcow2)
>>> Formatting
>>> 'gluster://gluster1.linova.de/gfs_vms/images/200/vm-200-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=10951327744 lazy_refcounts=off refcount_bits=16
>>> [2023-05-30 16:18:39.962238 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:40.084300 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:40.084996 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:40.085505 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-2: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:49.970199 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> [2023-05-30 16:18:50.975729 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
>>> [2023-05-30 16:18:51.768619 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:51.769330 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:18:51.769822 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-2: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:19:00.984578 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> transferred 0.0 B of 10.2 GiB (0.00%)
>>> [2023-05-30 16:19:02.030902 +0000] I
>>> [io-stats.c:3706:ios_sample_buf_size_configure] 0-gfs_vms: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
>>> transferred 112.8 MiB of 10.2 GiB (1.08%)
>>> transferred 230.8 MiB of 10.2 GiB (2.21%)
>>> transferred 340.5 MiB of 10.2 GiB (3.26%)
>>> ...
>>> transferred 10.1 GiB of 10.2 GiB (99.15%)
>>> transferred 10.2 GiB of 10.2 GiB (100.00%)
>>> transferred 10.2 GiB of 10.2 GiB (100.00%)
>>> [2023-05-30 16:19:29.804006 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:19:29.804807 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:19:29.805486 +0000] E [MSGID: 108006]
>>> [afr-common.c:6140:__afr_handle_child_down_event] 0-gfs_vms-replicate-2: All subvolumes are down. Going offline until at least one of them comes back up.
>>> [2023-05-30 16:19:32.044693 +0000] I [io-stats.c:4038:fini] 0-gfs_vms:
>>> io-stats translator unloaded
>>> root at proxmox1:~#
>>>
>>> Is this message about the subvolumes which are down normal or might this be
>>> the reason for our strange problems?
>>>
>>> I have no idea how to further debug the problem so any helping idea or hint
>>> would be great. Pleae let me also know if I can provide more infos regarding
>>> our setup.
>>>
>>> Ciao and thanks a lot,
>>>
>>>     Schoepp
>>>