From martin at proxmox.com Wed May 4 12:27:27 2022 From: martin at proxmox.com (Martin Maurer) Date: Wed, 4 May 2022 12:27:27 +0200 Subject: [PVE-User] Proxmox VE 7.2 released! Message-ID: Hi all, we're excited to announce the release of Proxmox Virtual Environment 7.2. It's based on Debian 11.3 "Bullseye" but using a newer Linux kernel 5.15.30, QEMU 6.2, LXC 4, Ceph 16.2.7, and OpenZFS 2.1.4 and countless enhancements and bugfixes. Here is a selection of the highlights - Support for the accelerated virtio-gl (VirGL) display driver - Notes templates for backup jobs (e.g. add the name of your VMs and CTs to the backup notes) - Ceph erasure code support - Updated existing and new LXC container templates (New: Ubuntu 22.04, Devuan 4.0, Alpine 3.15) - ISO: Updated memtest86+ to the completely rewritten 6.0b version, adding support for UEFI and modern memory like DDR5 - and many more GUI enhancements As always, we have included countless bugfixes and improvements on many places; see the release notes for all details. Release notes https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_7.2 Press release https://www.proxmox.com/en/news/press-releases/proxmox-virtual-environment-7-2-available Video tutorial https://www.proxmox.com/en/training/video-tutorials/item/what-s-new-in-proxmox-ve-7-2 Download https://www.proxmox.com/en/downloads Alternate ISO download: https://enterprise.proxmox.com/iso Documentation https://pve.proxmox.com/pve-docs Community Forum https://forum.proxmox.com Bugtracker https://bugzilla.proxmox.com Source code https://git.proxmox.com We want to shout out a big THANK YOU to our active community for all your intensive feedback, testing, bug reporting and patch submitting! FAQ Q: Can I upgrade Proxmox VE 7.0 or 7.1 to 7.2 via GUI? A: Yes. Q: Can I upgrade Proxmox VE 6.4 to 7.2 with apt? A: Yes, please follow the upgrade instructions on https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0 Q: Can I install Proxmox VE 7.2 on top of Debian 11.x "Bullseye"? A: Yes, see https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_11_Bullseye Q: Can I upgrade my Proxmox VE 6.4 cluster with Ceph Octopus to 7.2 with Ceph Octopus/Pacific? A: This is a two step process. First, you have to upgrade Proxmox VE from 6.4 to 7.2, and afterwards upgrade Ceph from Octopus to Pacific. There are a lot of improvements and changes, so please follow exactly the upgrade documentation: https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0 https://pve.proxmox.com/wiki/Ceph_Octopus_to_Pacific Q: Where can I get more information about feature updates? A: Check the https://pve.proxmox.com/wiki/Roadmap, https://forum.proxmox.com/, the https://lists.proxmox.com/, and/or subscribe to our https://www.proxmox.com/en/news. -- Best Regards, Martin Maurer martin at proxmox.com https://www.proxmox.com From dziobek at hlrs.de Wed May 4 14:10:59 2022 From: dziobek at hlrs.de (Martin Dziobek) Date: Wed, 4 May 2022 14:10:59 +0200 Subject: [PVE-User] Proxmox VE 7.2 - Problem of understanding 'bridge-disable-mac-learning' In-Reply-To: References: Message-ID: <20220504141059.48ab3303@schleppmd.hlrs.de> Dear all, In the Release Notes of 7.2, it says: "Administrators can now disable MAC learning on a bridge in /etc/network/interfaces with the bridge-disable-mac-learning flag. This reduces the number of packets flooded on all ports (for unknown MAC addresses), preventing issues with certain hosting providers (for example, Hetzner), which resulted in the Proxmox VE node getting disconnected" where as in descriptions of how to disable mac bridge learning for example on https://www.xmodulo.com/disable-mac-learning-linux-bridge.html it says: "Once MAC learning is turned off, a Linux bridge will flood every incoming packet to the rest of the ports. Understand this implication before proceeding." So flooding is reduced *or* increased ... May someone shed a light on this ? Best regards, Martin From s.ivanov at proxmox.com Wed May 4 15:39:59 2022 From: s.ivanov at proxmox.com (Stoiko Ivanov) Date: Wed, 4 May 2022 15:39:59 +0200 Subject: [PVE-User] Proxmox VE 7.2 - Problem of understanding 'bridge-disable-mac-learning' In-Reply-To: <20220504141059.48ab3303@schleppmd.hlrs.de> References: <20220504141059.48ab3303@schleppmd.hlrs.de> Message-ID: <20220504153959.19ffadc1@rosa.proxmox.com> hi, On Wed, 4 May 2022 14:10:59 +0200 Martin Dziobek wrote: > Dear all, > > In the Release Notes of 7.2, it says: > > "Administrators can now disable MAC learning on a bridge in /etc/network/interfaces with the bridge-disable-mac-learning flag. > This reduces the number of packets flooded on all ports (for unknown MAC addresses), preventing issues with certain hosting > providers (for example, Hetzner), which resulted in the Proxmox VE node getting disconnected" > > where as in descriptions of how to disable mac bridge learning > for example on https://www.xmodulo.com/disable-mac-learning-linux-bridge.html > > it says: > > "Once MAC learning is turned off, a Linux bridge will flood every incoming packet to the rest of the ports. > Understand this implication before proceeding." > > So flooding is reduced *or* increased ... > > May someone shed a light on this ? I think the commit message of the relevant commit describes the situation quite well: https://git.proxmox.com/?p=pve-common.git;a=commit;h=354ec8dee37d481ebae49b488349a8e932dce736 it disables learning on the individual ports - but at the same time also the unicast_flood flag is set to false - see `man 8 bridge` - so I'd expect the combination of the 2 to work as advertised (and will try to rephrase the release note entry a bit too be less confusing) I hope this helps! Best regards, stoiko From Alexandre.DERUMIER at groupe-cyllene.com Thu May 5 15:24:31 2022 From: Alexandre.DERUMIER at groupe-cyllene.com (DERUMIER, Alexandre) Date: Thu, 5 May 2022 13:24:31 +0000 Subject: [PVE-User] Proxmox VE 7.2 - Problem of understanding 'bridge-disable-mac-learning' In-Reply-To: <20220504153959.19ffadc1@rosa.proxmox.com> References: <20220504141059.48ab3303@schleppmd.hlrs.de> <20220504153959.19ffadc1@rosa.proxmox.com> Message-ID: <3eace3432f6ca87b660f01390a8cf13395322e12.camel@groupe-cyllene.com> mmm,looking at the git, it seem that qemu-server && pve-container patch es to register mac address in bridge are not applied ... [pve-devel] [PATCH V2 qemu-server 0/3] add disable bridge learning feature https://lists.proxmox.com/pipermail/pve-devel/2022-March/052210.html [pve-devel] [PATCH V2 pve-container 0/1] add disable bridge learning feature https://lists.proxmox.com/pipermail/pve-devel/2022-March/052206.html Le mercredi 04 mai 2022 ? 15:39 +0200, Stoiko Ivanov a ?crit?: > hi, > > > On Wed, 4 May 2022 14:10:59 +0200 > Martin Dziobek wrote: > > > Dear all, > > > > In the Release Notes of 7.2, it says: > > > > "Administrators can now disable MAC learning on a bridge in > > /etc/network/interfaces with the bridge-disable-mac-learning flag. > > This reduces the number of packets flooded on all ports (for > > unknown MAC addresses), preventing issues with certain hosting > > providers (for example, Hetzner), which resulted in the Proxmox VE > > node getting disconnected" > > > > where as in descriptions of how to disable mac bridge learning > > for example on? > > https://antiphishing.cetsi.fr/proxy/v3?i=ZUcyY1RmWEJYTXg4endZcf4pHMlLXnVUx16Ppu9iYP8&r=N3ZnQkVkbG1hOHVwcWFJNMLpdiUetyglobBNT6FebFASxxZ1q4z56SmutCfWl0tQ&f=RkdqNzdIQkFjZzVZTkZxbZ21HjwKhyMg-rZGU8E0XD_frmmy_SGxhjX_N0NdVXVt8hYCzR91DADKO1rwT7UlwQ&u=https%3A//www.xmodulo.com/disable-mac-learning-linux-bridge.html&k=YkLs > > > > it says: > > > > "Once MAC learning is turned off, a Linux bridge will flood every > > incoming packet to the rest of the ports. > > Understand this implication before proceeding." > > > > So flooding is reduced *or* increased ... > > > > May someone shed a light on this ? > I think the commit message of the relevant commit describes the > situation > quite well: > https://antiphishing.cetsi.fr/proxy/v3?i=ZUcyY1RmWEJYTXg4endZcf4pHMlLXnVUx16Ppu9iYP8&r=N3ZnQkVkbG1hOHVwcWFJNMLpdiUetyglobBNT6FebFASxxZ1q4z56SmutCfWl0tQ&f=RkdqNzdIQkFjZzVZTkZxbZ21HjwKhyMg-rZGU8E0XD_frmmy_SGxhjX_N0NdVXVt8hYCzR91DADKO1rwT7UlwQ&u=https%3A//git.proxmox.com/%3Fp%3Dpve-common.git%3Ba%3Dcommit%3Bh%3D354ec8dee37d481ebae49b488349a8e932dce736&k=YkLs > > it disables learning on the individual ports - but at the same time > also > the unicast_flood flag is set to false - see `man 8 bridge` - so I'd > expect the combination of the 2 to work as advertised > (and will try to rephrase the release note entry a bit too be less > confusing) > > I hope this helps! > > Best regards, > stoiko > > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://antiphishing.cetsi.fr/proxy/v3?i=ZUcyY1RmWEJYTXg4endZcf4pHMlLXnVUx16Ppu9iYP8&r=N3ZnQkVkbG1hOHVwcWFJNMLpdiUetyglobBNT6FebFASxxZ1q4z56SmutCfWl0tQ&f=RkdqNzdIQkFjZzVZTkZxbZ21HjwKhyMg-rZGU8E0XD_frmmy_SGxhjX_N0NdVXVt8hYCzR91DADKO1rwT7UlwQ&u=https%3A//lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user&k=YkLs > From mlist at jarasoft.net Thu May 5 15:56:42 2022 From: mlist at jarasoft.net (Jack Raats) Date: Thu, 5 May 2022 15:56:42 +0200 Subject: [PVE-User] Proces BOOTFB Message-ID: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net> Hi, At this moment I use proxmox 7.2-3. Before this version I could run a VM passthrough. On proxmox 7.2-3 I get an error that BAR1 doesn't have memory anymore. The memory ism occupied by a proces called BOOTFB. What is this proces doing? How to get the passthroug thing working again? Thanks Jack Raats From leesteken+proxmox at pm.me Thu May 5 16:08:41 2022 From: leesteken+proxmox at pm.me (Arjen) Date: Thu, 05 May 2022 14:08:41 +0000 Subject: [PVE-User] Proces BOOTFB In-Reply-To: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net> References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net> Message-ID: On Thursday, May 5th, 2022 at 15:56, Jack Raats wrote: > Hi, > > At this moment I use proxmox 7.2-3. Before this version I could run a VM > passthrough. > On proxmox 7.2-3 I get an error that BAR1 doesn't have memory anymore. > The memory ism occupied by a proces called BOOTFB. > > What is this proces doing? > How to get the passthroug thing working again? > > Thanks > Jack Raats Before (with kernel 5.13) using video=efifb:off video=vesafb:off would fix this (at the expense of boot messages). With 7.2 (or kernel 5.15), I would expect video=simplefb:off to fix this, but I my experience this does not work for every GPU. I found that, for AMD GPUs, unblacklisting amdgpu AND not early binding to vfio_pci AND removing those video= parameters works best. amdgpu just takes over from the bootfb, and does release the GPU nicely to vfio_pci when starting the VM. (Of course, for AMD vendor-reset and reset_method=device_specific might be required.) I don't know if this also works for nouveau or i915. I hope this helps, Arjen From ralf.storm at konzept-is.de Fri May 6 13:10:21 2022 From: ralf.storm at konzept-is.de (storm) Date: Fri, 6 May 2022 13:10:21 +0200 Subject: [PVE-User] Network Mismatch after upgrade to 7.2 In-Reply-To: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net> References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net> Message-ID: <60d3fb98-0c90-d38d-fd4e-3840c17c1367@konzept-is.de> Hello, on one of my nodes I have total chaos in the network configuration after upgrading from 7.1 to 7.2 (liscensed pve-enterprise repo) I have some interfaces in the GUI, which are not in the system, the cli shows something totally different one actively used for a client network disappeared totally the mac address reported by the connected switch for this network cannot be found on the node :( Any clue whats the issue and how to resolve this? best regards Ralf From elacunza at binovo.es Fri May 6 13:20:17 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Fri, 6 May 2022 13:20:17 +0200 Subject: [PVE-User] Network Mismatch after upgrade to 7.2 In-Reply-To: <60d3fb98-0c90-d38d-fd4e-3840c17c1367@konzept-is.de> References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net> <60d3fb98-0c90-d38d-fd4e-3840c17c1367@konzept-is.de> Message-ID: <641296a8-89e5-d4c8-7514-8bb179dc5b8f@binovo.es> Hi, Maybe kernel changed names of the interfaces. To fix the issue, you must change old interface names with new names in /etc/network/interfaces El 6/5/22 a las 13:10, storm escribi?: > Hello, > > on one of my nodes I have total chaos in the network configuration > after upgrading from 7.1 to 7.2 (liscensed pve-enterprise repo) > > I have some interfaces in the GUI, which are not in the system, the > cli shows something totally different > > one actively used for a client network disappeared totally > > the mac address reported by the connected switch for this network > cannot be found on the node :( > > > Any clue whats the issue and how to resolve this? > > best regards > > Ralf > > > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From mlist at jarasoft.net Fri May 6 13:23:48 2022 From: mlist at jarasoft.net (Jack Raats) Date: Fri, 6 May 2022 13:23:48 +0200 Subject: [PVE-User] Proces BOOTFB In-Reply-To: References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net> Message-ID: Op 05-05-2022 om 16:08 schreef Arjen: > On Thursday, May 5th, 2022 at 15:56, Jack Raats wrote: > >> Hi, >> >> At this moment I use proxmox 7.2-3. Before this version I could run a VM >> passthrough. >> On proxmox 7.2-3 I get an error that BAR1 doesn't have memory anymore. >> The memory ism occupied by a proces called BOOTFB. >> >> What is this proces doing? >> How to get the passthroug thing working again? >> >> Thanks >> Jack Raats > Before (with kernel 5.13) using video=efifb:off video=vesafb:off would fix this (at the expense of boot messages). > With 7.2 (or kernel 5.15), I would expect video=simplefb:off to fix this, but I my experience this does not work for every GPU. > > I found that, for AMD GPUs, unblacklisting amdgpu AND not early binding to vfio_pci AND removing those video= parameters works best. > amdgpu just takes over from the bootfb, and does release the GPU nicely to vfio_pci when starting the VM. > (Of course, for AMD vendor-reset and reset_method=device_specific might be required.) > I don't know if this also works for nouveau or i915. > > I hope this helps, > Arjen I've tried all the possible, but nothing works... Until I started the old kernel and everything worked perfectly! I think that amdgpu, which is included in the kernel, doesn't takes over from bootfb Greetings, Jack Raats From elacunza at binovo.es Wed May 11 16:35:24 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Wed, 11 May 2022 16:35:24 +0200 Subject: PVE 7.2 unstability Message-ID: Hi all, Yesterday we upgraded a 5-node cluster to PVE 7.2 from PVE 7.1: # pveversion -v proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve) pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1) pve-kernel-5.15: 7.2-3 pve-kernel-helper: 7.2-3 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.35-1-pve: 5.15.35-2 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-1-pve: 5.13.19-3 ceph: 16.2.7 ceph-fuse: 16.2.7 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-8 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.1-6 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-1 libpve-storage-perl: 7.2-2 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.12-1 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.1.8-1 proxmox-backup-file-restore: 2.1.8-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.4-10 pve-cluster: 7.2-1 pve-container: 4.2-1 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.4-2 pve-ha-manager: 3.3-4 pve-i18n: 2.7-1 pve-qemu-kvm: 6.2.0-5 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-2 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1 We're seen since them some unstability with our VMs, some of them start consuming a full CPU core without explanation. We have seen this issue only with Linux VMs, mostly Debian 9,10,11 (but that's the most common OS in out 80+ VMs). Issue happens with 1 core, 2 cores and 4 cores. This issue seems to be easily reproduced bulk-migrating VMs. Not all bulk-migrated VMs show the issue, but some do. We see the issue in not recently migrated VMs too. Some of the VMs show a timelapse in syslog. For example, in our "release" VM: May 10 15:11:18 release systemd[1]: Stopping User Runtime Directory /run/user/1003... May 10 15:11:18 release systemd[1]: run-user-1003.mount: Succeeded. May 10 15:11:18 release systemd[1]: user-runtime-dir at 1003.service: Succeeded. May 10 15:11:18 release systemd[1]: Stopped User Runtime Directory /run/user/1003. May 10 15:11:18 release systemd[1]: Removed slice User Slice of UID 1003. Jan 15 06:42:04 release systemd[1]: Starting Daily apt download activities... Jan 15 06:42:04 release mariadbd[453]: 850115? 6:42:04 [ERROR] mysqld got signal 11 ; Jan 15 06:42:04 release mariadbd[453]: This could be because you hit a bug. It is also possible that this binary Jan 15 06:42:04 release mariadbd[453]: or one of the libraries it was linked against is corrupt, improperly built, Jan 15 06:42:04 release mariadbd[453]: or misconfigured. This error can also be caused by malfunctioning hardware. Jan 15 06:42:04 release mariadbd[453]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs Jan 15 06:42:04 release mariadbd[453]: We will try our best to scrape up some info that will hopefully help Jan 15 06:42:04 release mariadbd[453]: diagnose the problem, but since we have already crashed, Jan 15 06:42:04 release mariadbd[453]: something is definitely wrong and this may fail. Jan 15 06:42:04 release mariadbd[453]: Server version: 10.5.15-MariaDB-0+deb11u1 Jan 15 06:42:04 release mariadbd[453]: key_buffer_size=134217728 Jan 15 06:42:04 release mariadbd[453]: read_buffer_size=131072 Jan 15 06:42:04 release mariadbd[453]: max_used_connections=3 Jan 15 06:42:04 release mariadbd[453]: max_threads=153 Jan 15 06:42:04 release mariadbd[453]: thread_count=0 Jan 15 06:42:04 release mariadbd[453]: It is possible that mysqld could use up to Jan 15 06:42:04 release mariadbd[453]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467872 K? bytes of memory Jan 15 06:42:04 release mariadbd[453]: Hope that's ok; if not, decrease some variables in the equation. Jan 15 06:42:04 release mariadbd[453]: Thread pointer: 0x0 Jan 15 06:42:04 release mariadbd[453]: Attempting backtrace. You can use the following information to find out Jan 15 06:42:04 release mariadbd[453]: where mysqld died. If you see no messages after this, something went Jan 15 06:42:04 release mariadbd[453]: terribly wrong... Jan 15 06:42:04 release mariadbd[453]: stack_bottom = 0x0 thread_stack 0x49000 Jan 15 06:42:04 release systemd[1]: Starting Online ext4 Metadata Check for All Filesystems... Jan 15 06:42:04 release systemd[1]: Starting Clean php session files... Jan 15 06:42:04 release systemd[1]: Starting Cleanup of Temporary Directories... Jan 15 06:42:04 release systemd[1]: Starting Rotate log files... Jan 15 06:42:04 release systemd[1]: Starting Daily man-db regeneration... Jan 15 06:42:04 release systemd[1]: e2scrub_all.service: Succeeded. Jan 15 06:42:04 release systemd[1]: Finished Online ext4 Metadata Check for All Filesystems. Jan 15 06:42:04 release systemd[1]: systemd-tmpfiles-clean.service: Succeeded. Jan 15 06:42:04 release systemd[1]: Finished Cleanup of Temporary Directories. Jan 15 06:42:04 release systemd[1]: phpsessionclean.service: Succeeded. Jan 15 06:42:04 release systemd[1]: Finished Clean php session files. Jan 15 06:42:04 release systemd[1]: apt-daily.service: Succeeded. Jan 15 06:42:04 release systemd[1]: Finished Daily apt download activities. Jan 15 06:42:04 release systemd[1]: Starting Daily apt upgrade and clean activities... Jan 15 06:42:04 release systemd[1]: apt-daily-upgrade.service: Succeeded. Jan 15 06:42:04 release systemd[1]: Finished Daily apt upgrade and clean activities. Jan 15 06:42:04 release systemd[1]: Reloading The Apache HTTP Server. Jan 15 06:42:04 release systemd[1]: Looping too fast. Throttling execution a little. [...reset...] Is anyone seeing this issue? Those servers have AMD Ryzen procesors. Cheers Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From elacunza at binovo.es Thu May 12 09:33:29 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Thu, 12 May 2022 09:33:29 +0200 Subject: [PVE-User] PVE 7.2 unstability In-Reply-To: References: Message-ID: Hi all, Definitively there's some issue with time. I just captured this event in a Debian 11 VM: May 12 09:18:42 monitor-cloud systemd[1]: Stopped User Runtime Directory /run/user/1001. May 12 09:18:42 monitor-cloud systemd[1]: Removed slice User Slice of UID 1001. May 12 17:32:35 monitor-cloud icinga2[943]: [2022-05-12 17:32:35 +0200] information/Application: We jumped forward in time: 29633.8 seconds [...reset...] May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] Linux version 5.10.0-13-amd64 (debian-kernel at lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.106-1 (2022-03-17) May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.10.0-13-amd64 root=/dev/mapper/monitor--cloud--vg-root ro quiet May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] x86/fpu: x87 FPU will use FXSAVE May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] BIOS-provided physical RAM map: Is VM clock managed by qemu/kvm? Thanks El 11/5/22 a las 16:35, Eneko Lacunza via pve-user escribi?: > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From elacunza at binovo.es Thu May 12 15:15:09 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Thu, 12 May 2022 15:15:09 +0200 Subject: [PVE-User] PVE 7.2 unstability In-Reply-To: References: Message-ID: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es> Hi, This time VM didn't crash but kernel noticed the time issue: May 12 09:48:57 monitor-cloud systemd[1]: session-38.scope: Succeeded. May 12 18:08:57 monitor-cloud kernel: [31097.014795] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: May 12 18:08:57 monitor-cloud kernel: [31097.014803] clocksource:?????????????????????? 'kvm-clock' wd_now: a601cec4c1d9 wd_last: 8aba12430493 mask: ffffffffffffffff May 12 18:08:57 monitor-cloud kernel: [31097.014806] clocksource:?????????????????????? 'tsc' cs_now: 1f116978cedab cs_last: 19f66ae2f9bbb mask: ffffffffffffffff May 12 18:08:57 monitor-cloud kernel: [31097.014810] tsc: Marking TSC unstable due to clocksource watchdog May 12 09:49:02 monitor-cloud systemd[1]: Starting Clean php session files... Seems that the issue is more easily triggered live migrating the VMs, another VM just hung but no time-issues in syslog (I had to hard reset...) We have downgraded from pve-qemu-kvm:amd64 6.2.0-5 to 6.2.0-2 (version before issues started) We have downgraded from qemu-server from 7.2-2 to 7.1-4 (version before issues started): Issue continues. We have seen that when bulk migrating VMs from node1 to node2, VMs in node2 ALSO start to have issues. We'll try setting max workers for bulk actions to 1 next. El 12/5/22 a las 9:33, Eneko Lacunza via pve-user escribi?: > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From elacunza at binovo.es Thu May 12 16:57:14 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Thu, 12 May 2022 16:57:14 +0200 Subject: [PVE-User] PVE 7.2 unstability In-Reply-To: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es> References: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es> Message-ID: <051c0129-770e-dc29-c1e3-3b8ad904e6fb@binovo.es> Hi, Finally we have worked around this issue downgrading to kernel 5.13: apt-get install proxmox-ve=7.1-1; apt-get remove pve-kernel-5.15.35-1-pve (+reboot) No need to downgrade pve-qemu-kvm no qemu-server . Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-( Cheers El 12/5/22 a las 15:15, Eneko Lacunza escribi?: > Hi, > > This time VM didn't crash but kernel noticed the time issue: > > May 12 09:48:57 monitor-cloud systemd[1]: session-38.scope: Succeeded. > May 12 18:08:57 monitor-cloud kernel: [31097.014795] clocksource: > timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable > because the skew is too large: > May 12 18:08:57 monitor-cloud kernel: [31097.014803] > clocksource:?????????????????????? 'kvm-clock' wd_now: a601cec4c1d9 > wd_last: 8aba12430493 mask: ffffffffffffffff > May 12 18:08:57 monitor-cloud kernel: [31097.014806] > clocksource:?????????????????????? 'tsc' cs_now: 1f116978cedab > cs_last: 19f66ae2f9bbb mask: ffffffffffffffff > May 12 18:08:57 monitor-cloud kernel: [31097.014810] tsc: Marking TSC > unstable due to clocksource watchdog > May 12 09:49:02 monitor-cloud systemd[1]: Starting Clean php session > files... > > Seems that the issue is more easily triggered live migrating the VMs, > another VM just hung but no time-issues in syslog (I had to hard reset...) > > We have downgraded from pve-qemu-kvm:amd64 6.2.0-5 to 6.2.0-2 (version > before issues started) > > We have downgraded from qemu-server from 7.2-2 to 7.1-4 (version > before issues started): > > Issue continues. > > We have seen that when bulk migrating VMs from node1 to node2, VMs in > node2 ALSO start to have issues. > > We'll try setting max workers for bulk actions to 1 next. > > > El 12/5/22 a las 9:33, Eneko Lacunza via pve-user escribi?: >> _______________________________________________ >> pve-user mailing list >> pve-user at lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From alain.pean at c2n.upsaclay.fr Thu May 12 17:12:59 2022 From: alain.pean at c2n.upsaclay.fr (=?UTF-8?Q?Alain_P=c3=a9an?=) Date: Thu, 12 May 2022 17:12:59 +0200 Subject: [PVE-User] PVE 7.2 unstability In-Reply-To: References: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es> Message-ID: <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr> Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit?: > Finally we have worked around this issue downgrading to kernel 5.13: > > apt-get install proxmox-ve=7.1-1; apt-get remove > pve-kernel-5.15.35-1-pve (+reboot) > > No need to downgrade pve-qemu-kvm no qemu-server . > > Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-( Hi Eneko, It is strange, as I don't see anybody saying they saw this problem on the forum : https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3 Also, I installed a few weeks ago the kernel 3.15.30-1 that was available for test on PVE 7.1, on my production servers, that solved for me another problem (windows VM not rebooting correctly), and I don't see the problem you encountered. # uname -r 5.15.30-1-pve I will test the upgrade shortly. Alain -- Administrateur Syst?me/R?seau C2N Centre de Nanosciences et Nanotechnologies (UMR 9001) Boulevard Thomas Gobert (ex Avenue de La Vauve), 91120 Palaiseau Tel : 01-70-27-06-88 Bureau A255 From elacunza at binovo.es Thu May 12 18:35:10 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Thu, 12 May 2022 18:35:10 +0200 Subject: [PVE-User] PVE 7.2 unstability In-Reply-To: <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr> References: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es> <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr> Message-ID: <0fb383be-c5bd-8e6d-7319-fc3b7e65a453@binovo.es> Hi Alain, El 12/5/22 a las 17:12, Alain P?an escribi?: > Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit?: >> Finally we have worked around this issue downgrading to kernel 5.13: >> >> apt-get install proxmox-ve=7.1-1; apt-get remove >> pve-kernel-5.15.35-1-pve (+reboot) >> >> No need to downgrade pve-qemu-kvm no qemu-server . >> >> Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-( > > > It is strange, as I don't see anybody saying they saw this problem on > the forum : > https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3 > I think Bengt Nolin in the first page is reporting something like this. > Also, I installed a few weeks ago the kernel 3.15.30-1 that was > available for test on PVE 7.1, on my production servers, that solved > for me another problem (windows VM not rebooting correctly), and I > don't see the problem you encountered. > > # uname -r > 5.15.30-1-pve > > I will test the upgrade shortly. Our problem has been a headache in our tests today :) I asure you it is there, and it is fixed downgrading kernel. I don't know why it's happening, but VMs' clock seems to broke suddenly and spectacularly... :) Nodes have Ryzen CPUs, and storage is Ceph. Network is 10G for Ceph/migrations, 10G for VMs/cluster. Cheers Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From gilberto.nunes32 at gmail.com Thu May 12 19:23:04 2022 From: gilberto.nunes32 at gmail.com (Gilberto Ferreira) Date: Thu, 12 May 2022 14:23:04 -0300 Subject: [PVE-User] PVE 7.2 unstability In-Reply-To: References: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es> <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr> Message-ID: Hi there. A couple of friends also complain about kernel 5.15, regarding WIndows and Linux VMS weird behavior. After downgrading to 5.13 everything seems to be ok. --- Gilberto Nunes Ferreira Em qui., 12 de mai. de 2022 ?s 13:35, Eneko Lacunza via pve-user < pve-user at lists.proxmox.com> escreveu: > > > > ---------- Forwarded message ---------- > From: Eneko Lacunza > To: pve-user at lists.proxmox.com > Cc: > Bcc: > Date: Thu, 12 May 2022 18:35:10 +0200 > Subject: Re: [PVE-User] PVE 7.2 unstability > Hi Alain, > > El 12/5/22 a las 17:12, Alain P?an escribi?: > > Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit : > >> Finally we have worked around this issue downgrading to kernel 5.13: > >> > >> apt-get install proxmox-ve=7.1-1; apt-get remove > >> pve-kernel-5.15.35-1-pve (+reboot) > >> > >> No need to downgrade pve-qemu-kvm no qemu-server . > >> > >> Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-( > > > > > > It is strange, as I don't see anybody saying they saw this problem on > > the forum : > > https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3 > > > > I think Bengt Nolin in the first page is reporting something like this. > > > Also, I installed a few weeks ago the kernel 3.15.30-1 that was > > available for test on PVE 7.1, on my production servers, that solved > > for me another problem (windows VM not rebooting correctly), and I > > don't see the problem you encountered. > > > > # uname -r > > 5.15.30-1-pve > > > > I will test the upgrade shortly. > > Our problem has been a headache in our tests today :) I asure you it is > there, and it is fixed downgrading kernel. > > I don't know why it's happening, but VMs' clock seems to broke suddenly > and spectacularly... :) > > Nodes have Ryzen CPUs, and storage is Ceph. Network is 10G for > Ceph/migrations, 10G for VMs/cluster. > > Cheers > > Eneko Lacunza > Zuzendari teknikoa | Director t?cnico > Binovo IT Human Project > > Tel. +34 943 569 206 |https://www.binovo.es > Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun > > https://www.youtube.com/user/CANALBINOVO > https://www.linkedin.com/company/37269706/ > > > > ---------- Forwarded message ---------- > From: Eneko Lacunza via pve-user > To: pve-user at lists.proxmox.com > Cc: Eneko Lacunza > Bcc: > Date: Thu, 12 May 2022 18:35:10 +0200 > Subject: Re: [PVE-User] PVE 7.2 unstability > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From elacunza at binovo.es Fri May 13 09:46:17 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Fri, 13 May 2022 09:46:17 +0200 Subject: [PVE-User] PVE 7.2 unstability In-Reply-To: References: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es> <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr> Message-ID: <7f326bd4-db6e-2a61-66d9-11b579603803@binovo.es> I have filled a bug: https://bugzilla.proxmox.com/show_bug.cgi?id=4057 El 12/5/22 a las 19:23, Gilberto Ferreira escribi?: > Hi there. > A couple of friends also complain about kernel 5.15, regarding WIndows > and Linux VMS weird behavior. > After downgrading to 5.13 everything seems to be ok. > > --- > Gilberto Nunes Ferreira > > > > > > > Em qui., 12 de mai. de 2022 ?s 13:35, Eneko Lacunza via pve-user > escreveu: > > > > > ---------- Forwarded message ---------- > From:?Eneko Lacunza > To: pve-user at lists.proxmox.com > Cc: > Bcc: > Date:?Thu, 12 May 2022 18:35:10 +0200 > Subject:?Re: [PVE-User] PVE 7.2 unstability > Hi Alain, > > El 12/5/22 a las 17:12, Alain P?an escribi?: > > Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit?: > >> Finally we have worked around this issue downgrading to kernel > 5.13: > >> > >> apt-get install proxmox-ve=7.1-1; apt-get remove > >> pve-kernel-5.15.35-1-pve (+reboot) > >> > >> No need to downgrade pve-qemu-kvm no qemu-server . > >> > >> Sadly VMs running on kernel 5.15.35-1 will crash on live > migration :-( > > > > > > It is strange, as I don't see anybody saying they saw this > problem on > > the forum : > > > https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3 > > > > I think Bengt Nolin in the first page is reporting something like > this. > > > Also, I installed a few weeks ago the kernel 3.15.30-1 that was > > available for test on PVE 7.1, on my production servers, that > solved > > for me another problem (windows VM not rebooting correctly), and I > > don't see the problem you encountered. > > > > # uname -r > > 5.15.30-1-pve > > > > I will test the upgrade shortly. > > Our problem has been a headache in our tests today :) I asure you > it is > there, and it is fixed downgrading kernel. > > I don't know why it's happening, but VMs' clock seems to broke > suddenly > and spectacularly... :) > > Nodes have Ryzen CPUs, and storage is Ceph. Network is 10G for > Ceph/migrations, 10G for VMs/cluster. > > Cheers > > Eneko Lacunza > Zuzendari teknikoa | Director t?cnico > Binovo IT Human Project > > Tel. +34 943 569 206 |https://www.binovo.es > Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun > > https://www.youtube.com/user/CANALBINOVO > https://www.linkedin.com/company/37269706/ > > > > ---------- Forwarded message ---------- > From:?Eneko Lacunza via pve-user > To: pve-user at lists.proxmox.com > Cc:?Eneko Lacunza > Bcc: > Date:?Thu, 12 May 2022 18:35:10 +0200 > Subject:?Re: [PVE-User] PVE 7.2 unstability > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From sebastian at debianfan.de Tue May 17 06:54:43 2022 From: sebastian at debianfan.de (sebastian at debianfan.de) Date: Tue, 17 May 2022 06:54:43 +0200 Subject: [PVE-User] Directory /var/log/journal Message-ID: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de> Hello @all, whats up with this directory. Is it possible to delete all the files in this directory while the server is running. Would there be any problems if i delete the files now without rebooting the pve-Host? I don't need log files this time - i need space on the partition. Tnx Sebastian From nada at verdnatura.es Tue May 17 08:10:46 2022 From: nada at verdnatura.es (nada) Date: Tue, 17 May 2022 08:10:46 +0200 Subject: [PVE-User] Directory /var/log/journal In-Reply-To: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de> References: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de> Message-ID: hi Sebastian depends on type of journal you have check your storage config at /etc/systemd/journald.conf in case you have persistent journal you may clean it e.g. clean old journals each month /usr/bin/journalctl --vacuum-time=1months --rotate Nada On 2022-05-17 06:54, sebastian at debianfan.de wrote: > Hello @all, > > whats up with this directory. > > Is it possible to delete all the files in this directory while the > server is running. > > Would there be any problems if i delete the files now without > rebooting the pve-Host? > > I don't need log files this time - i need space on the partition. > > Tnx > > Sebastian > > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user From sebastian at debianfan.de Tue May 17 11:35:03 2022 From: sebastian at debianfan.de (sebastian at debianfan.de) Date: Tue, 17 May 2022 11:35:03 +0200 Subject: [PVE-User] Directory /var/log/journal In-Reply-To: References: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de> Message-ID: it is possible to delete all the files and reboot the host without "problems" ? i don't need the journal Am 17.05.2022 um 08:10 schrieb nada: > hi Sebastian > depends on type of? journal you have > check your storage config at > /etc/systemd/journald.conf > in case you have persistent journal you may clean it > e.g. clean old journals each month > /usr/bin/journalctl --vacuum-time=1months --rotate > Nada > > On 2022-05-17 06:54, sebastian at debianfan.de wrote: >> Hello @all, >> >> whats up with this directory. >> >> Is it possible to delete all the files in this directory while the >> server is running. >> >> Would there be any problems if i delete the files now without >> rebooting the pve-Host? >> >> I don't need log files this time - i need space on the partition. >> >> Tnx >> >> Sebastian >> >> >> _______________________________________________ >> pve-user mailing list >> pve-user at lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From uwe.sauter.de at gmail.com Tue May 17 11:43:09 2022 From: uwe.sauter.de at gmail.com (Uwe Sauter) Date: Tue, 17 May 2022 11:43:09 +0200 Subject: [PVE-User] Directory /var/log/journal In-Reply-To: References: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de> Message-ID: <8b61e7fe-8c51-752f-d8f4-f3521f9615bb@gmail.com> Then you should change the configuration in /etc/systemd/journald.conf to not save your journal, then reboot, then remove the directory. Am 17.05.22 um 11:35 schrieb sebastian at debianfan.de: > it is possible to delete all the files and reboot the host without "problems" ? > > i don't need the journal > > Am 17.05.2022 um 08:10 schrieb nada: >> hi Sebastian >> depends on type of? journal you have >> check your storage config at >> /etc/systemd/journald.conf >> in case you have persistent journal you may clean it >> e.g. clean old journals each month >> /usr/bin/journalctl --vacuum-time=1months --rotate >> Nada >> >> On 2022-05-17 06:54, sebastian at debianfan.de wrote: >>> Hello @all, >>> >>> whats up with this directory. >>> >>> Is it possible to delete all the files in this directory while the >>> server is running. >>> >>> Would there be any problems if i delete the files now without >>> rebooting the pve-Host? >>> >>> I don't need log files this time - i need space on the partition. >>> >>> Tnx >>> >>> Sebastian >>> >>> >>> _______________________________________________ >>> pve-user mailing list >>> pve-user at lists.proxmox.com >>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> >> _______________________________________________ >> pve-user mailing list >> pve-user at lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user From gaio at lilliput.linux.it Wed May 18 10:04:33 2022 From: gaio at lilliput.linux.it (Marco Gaiarin) Date: Wed, 18 May 2022 10:04:33 +0200 Subject: [PVE-User] Severe disk corruption: PBS, SATA Message-ID: We are depicting some vary severe disk corruption on one of our installation, that is indeed a bit 'niche' but... PVE 6.4 host on a Dell PowerEdge T340: root at sdpve1:~# uname -a Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64 GNU/Linux Debian squeeze i386 on guest: sdinny:~# uname -a Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux boot disk defined as: sata0: local-zfs:vm-120-disk-0,discard=on,size=100G After enabling PBS, everytime the backup of the VM start: root at sdpve1:~# grep vzdump /var/log/syslog.1 May 17 20:27:17 sdpve1 pvedaemon[24825]: starting task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd --remove 0 --mode snapshot May 17 20:36:50 sdpve1 pvedaemon[24825]: end task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 --mode snapshot --mailto sys at admin --quiet 1 --mailnotification failure --storage pbs-BP) May 17 22:00:02 sdpve1 vzdump[1738]: starting task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: vzdump 100 101 120 --mailnotification failure --quiet 1 --mode snapshot --storage pbs-BP --mailto sys at admin May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu) May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50) May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu) May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17) May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu) May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52) May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully May 17 23:31:02 sdpve1 vzdump[1738]: end task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK The VM depicted some massive and severe IO trouble: May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.000749] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY } May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.002175] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY } May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.003559] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY } May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.004894] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY } [...] May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100 May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete VM is still 'alive', and works. But we was forced to do a reboot (power outgage) and after that all the partition of the disk desappeared, we were forced to restore them with some tools like 'testdisk'. Partition on backups the same, desappeared. Note that there's also a 'plain' local backup that run on sunday, and this backup task seems does not generate trouble (but still seems to have partition desappeared, thus was done after an I/O error). We have hit a Kernel/Qemu bug? -- E sempre allegri bisogna stare, che il nostro piangere fa male al Re fa male al ricco, al Cardinale, diventan tristi se noi piangiam... (Fo, Jannacci) From elacunza at binovo.es Wed May 18 10:53:04 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Wed, 18 May 2022 10:53:04 +0200 Subject: [PVE-User] Severe disk corruption: PBS, SATA In-Reply-To: References: Message-ID: Hi Marco, I would try changing that sata0 disk to virtio-blk (maybe in a clone VM first). I think squeeze will support it; then try PBS backup again. El 18/5/22 a las 10:04, Marco Gaiarin escribi?: > We are depicting some vary severe disk corruption on one of our > installation, that is indeed a bit 'niche' but... > > PVE 6.4 host on a Dell PowerEdge T340: > root at sdpve1:~# uname -a > Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64 GNU/Linux > > Debian squeeze i386 on guest: > sdinny:~# uname -a > Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux > > boot disk defined as: > sata0: local-zfs:vm-120-disk-0,discard=on,size=100G > > > After enabling PBS, everytime the backup of the VM start: > > root at sdpve1:~# grep vzdump /var/log/syslog.1 > May 17 20:27:17 sdpve1 pvedaemon[24825]: starting task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: > May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd --remove 0 --mode snapshot > May 17 20:36:50 sdpve1 pvedaemon[24825]: end task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK > May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 --mode snapshot --mailto sys at admin --quiet 1 --mailnotification failure --storage pbs-BP) > May 17 22:00:02 sdpve1 vzdump[1738]: starting task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: > May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: vzdump 100 101 120 --mailnotification failure --quiet 1 --mode snapshot --storage pbs-BP --mailto sys at admin > May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu) > May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50) > May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu) > May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17) > May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu) > May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52) > May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully > May 17 23:31:02 sdpve1 vzdump[1738]: end task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK > > The VM depicted some massive and severe IO trouble: > > May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen > May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.000749] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY } > May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.002175] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY } > May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.003559] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY } > May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.004894] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY } > [...] > May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link > May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100 > May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete > > VM is still 'alive', and works. > But we was forced to do a reboot (power outgage) and after that all the > partition of the disk desappeared, we were forced to restore them with > some tools like 'testdisk'. > Partition on backups the same, desappeared. > > > Note that there's also a 'plain' local backup that run on sunday, and this > backup task seems does not generate trouble (but still seems to have > partition desappeared, thus was done after an I/O error). > > > We have hit a Kernel/Qemu bug? > Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From jmr.richardson at gmail.com Wed May 18 17:24:53 2022 From: jmr.richardson at gmail.com (JR Richardson) Date: Wed, 18 May 2022 10:24:53 -0500 Subject: [PVE-User] VMware SD-WAN Virtual Edge Not Working Message-ID: Hey Folks, We are testing deployment for using VMware/Velo virtual edge appliance on Prox, hypervisor Dell R630 specs: 40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets) Linux 5.4.174-2-pve #1 SMP PVE 5.4.174-2 (Thu, 10 Mar 2022 15:58:44 +0100) pve-manager/6.4-14/15e2bf61 VMware SD-WAN edge appliance version 4.3.1 latest GA release. We can get the VM started OK and connected to orchestrator except the VPN tunnels are not coming up. We are using 'host' processor type and see all the required CPU Flags available to the VM. We are running a bunch of these virtual edge vms on VMware ESXi hypervisors with no issues, but looking to change over to using Proxmox. The only error we get in orchestration when diagnosing the problem is "Edge dataplane service failed" and there is no vpn traffic coming from the VM so it's like something with the VM is not able to access some resource needed to start VPN services. AES-NI, SSSE3, SSE4, RDTSC, RDSEED, RDRAND instruction sets are all available to the VM. Is anyone else successful deploying VMware SD-WAN appliances with Proxmox/KVM or seeing the same issue I'm having? We're opening a support case with VMware, but no word back from them yet. Thanks. JR From nada at verdnatura.es Wed May 18 18:20:40 2022 From: nada at verdnatura.es (nada) Date: Wed, 18 May 2022 18:20:40 +0200 Subject: [PVE-User] Severe disk corruption: PBS, SATA In-Reply-To: References: Message-ID: hi Marco you used some local ZFS filesystem according to your info, so you may try zfs list zpool list -v zpool history zpool import ... zpool replace ... all the best Nada On 2022-05-18 10:04, Marco Gaiarin wrote: > We are depicting some vary severe disk corruption on one of our > installation, that is indeed a bit 'niche' but... > > PVE 6.4 host on a Dell PowerEdge T340: > root at sdpve1:~# uname -a > Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 > 11:08:47 +0100) x86_64 GNU/Linux > > Debian squeeze i386 on guest: > sdinny:~# uname -a > Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 > GNU/Linux > > boot disk defined as: > sata0: local-zfs:vm-120-disk-0,discard=on,size=100G > > > After enabling PBS, everytime the backup of the VM start: > > root at sdpve1:~# grep vzdump /var/log/syslog.1 > May 17 20:27:17 sdpve1 pvedaemon[24825]: starting task > UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: > May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup > job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd > --remove 0 --mode snapshot > May 17 20:36:50 sdpve1 pvedaemon[24825]: end task > UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK > May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 > --mode snapshot --mailto sys at admin --quiet 1 --mailnotification > failure --storage pbs-BP) > May 17 22:00:02 sdpve1 vzdump[1738]: starting task > UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: > May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: > vzdump 100 101 120 --mailnotification failure --quiet 1 --mode > snapshot --storage pbs-BP --mailto sys at admin > May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 > (qemu) > May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 > (00:00:50) > May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 > (qemu) > May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 > (00:01:17) > May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 > (qemu) > May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 > (01:28:52) > May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished > successfully > May 17 23:31:02 sdpve1 vzdump[1738]: end task > UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK > > The VM depicted some massive and severe IO trouble: > > May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception > Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen > May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd > 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.000749] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY > } > May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd > 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.002175] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY > } > May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd > 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.003559] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY > } > May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd > 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.004894] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY > } > [...] > May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting > link > May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 > Gbps (SStatus 113 SControl 300) > May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for > UDMA/100 > May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete > > VM is still 'alive', and works. > But we was forced to do a reboot (power outgage) and after that all the > partition of the disk desappeared, we were forced to restore them with > some tools like 'testdisk'. > Partition on backups the same, desappeared. > > > Note that there's also a 'plain' local backup that run on sunday, and > this > backup task seems does not generate trouble (but still seems to have > partition desappeared, thus was done after an I/O error). > > > We have hit a Kernel/Qemu bug? From jmr.richardson at gmail.com Wed May 18 18:49:32 2022 From: jmr.richardson at gmail.com (JR Richardson) Date: Wed, 18 May 2022 11:49:32 -0500 Subject: [PVE-User] VMware SD-WAN Virtual Edge Not Working [SOLVED] In-Reply-To: References: Message-ID: Quick update, changed CPU to single socket and multi-core, appliance started acting as expected. Not sure why but when using multi-sockets, even with NUMA enabled, the VM would not fully work. I guess something in the appliance code checks for socket/core configs and requires a single socket only. Hope this helps. Regards. JR On Wed, May 18, 2022 at 10:24 AM JR Richardson wrote: > > Hey Folks, > > We are testing deployment for using VMware/Velo virtual edge appliance > on Prox, hypervisor Dell R630 specs: > 40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets) > Linux 5.4.174-2-pve #1 SMP PVE 5.4.174-2 (Thu, 10 Mar 2022 15:58:44 +0100) > pve-manager/6.4-14/15e2bf61 > > VMware SD-WAN edge appliance version 4.3.1 latest GA release. > > We can get the VM started OK and connected to orchestrator except the > VPN tunnels are not coming up. We are using 'host' processor type and > see all the required CPU Flags available to the VM. We are running a > bunch of these virtual edge vms on VMware ESXi hypervisors with no > issues, but looking to change over to using Proxmox. > > The only error we get in orchestration when diagnosing the problem is > "Edge dataplane service failed" and there is no vpn traffic coming > from the VM so it's like something with the VM is not able to access > some resource needed to start VPN services. AES-NI, SSSE3, SSE4, > RDTSC, RDSEED, RDRAND instruction sets are all available to the VM. > > Is anyone else successful deploying VMware SD-WAN appliances with > Proxmox/KVM or seeing the same issue I'm having? We're opening a > support case with VMware, but no word back from them yet. > > Thanks. > JR From wolf at wolfspyre.com Thu May 19 06:07:05 2022 From: wolf at wolfspyre.com (Wolf Noble) Date: Wed, 18 May 2022 23:07:05 -0500 Subject: [PVE-User] Severe disk corruption: PBS, SATA In-Reply-To: References: Message-ID: <4212DB65-25CD-491E-8380-E7D43B9063BF@wolfspyre.com> from over here in the cheap seats, another potential strangeness injector: zfs + any sort of raid controller which plays the abstraction game between raw disk and the OS can cause any number of weird and painful scenarios. ZFS believes it has an accurate idea of the underlying disks. it does it?s voodoo wholly believing that it?s solely responsible for dealing with data durability. with a raid controller in between playing the shell game with IO, things USUALLY work?. RIGHT UNTIL THEY DONT. i?m sure you?re well aware of this, and have probably already mitigated this concern with a jbod controller, or something that isn?t preventing the OS (and thus ZFS) from talking directly to the disks? but It felt worth pointing out on the off chance that this got overlooked. hope you are well and the gremlins are promptly discovered and put back into their comfortable chairs so they can resume their harmless heckling. ?W [= The contents of this message have been written, read, processed, erased, sorted, sniffed, compressed, rewritten, misspelled, overcompensated, lost, found, and most importantly delivered entirely with recycled electrons =] > On May 18, 2022, at 11:21, nada wrote: > > ?hi Marco > you used some local ZFS filesystem according to your info, so you may try > > zfs list > zpool list -v > zpool history > zpool import ... > zpool replace ... > > all the best > Nada > >> On 2022-05-18 10:04, Marco Gaiarin wrote: >> We are depicting some vary severe disk corruption on one of our >> installation, that is indeed a bit 'niche' but... >> PVE 6.4 host on a Dell PowerEdge T340: >> root at sdpve1:~# uname -a >> Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 >> 11:08:47 +0100) x86_64 GNU/Linux >> Debian squeeze i386 on guest: >> sdinny:~# uname -a >> Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux >> boot disk defined as: >> sata0: local-zfs:vm-120-disk-0,discard=on,size=100G >> After enabling PBS, everytime the backup of the VM start: >> root at sdpve1:~# grep vzdump /var/log/syslog.1 >> May 17 20:27:17 sdpve1 pvedaemon[24825]: starting task >> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: >> May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup >> job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd >> --remove 0 --mode snapshot >> May 17 20:36:50 sdpve1 pvedaemon[24825]: end task >> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK >> May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 >> --mode snapshot --mailto sys at admin --quiet 1 --mailnotification >> failure --storage pbs-BP) >> May 17 22:00:02 sdpve1 vzdump[1738]: starting task >> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: >> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: >> vzdump 100 101 120 --mailnotification failure --quiet 1 --mode >> snapshot --storage pbs-BP --mailto sys at admin >> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu) >> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50) >> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu) >> May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17) >> May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu) >> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52) >> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully >> May 17 23:31:02 sdpve1 vzdump[1738]: end task >> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK >> The VM depicted some massive and severe IO trouble: >> May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception >> Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen >> May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd >> 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.000749] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY } >> May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd >> 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.002175] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY } >> May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd >> 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.003559] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY } >> May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd >> 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.004894] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY } >> [...] >> May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link >> May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 >> Gbps (SStatus 113 SControl 300) >> May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100 >> May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete >> VM is still 'alive', and works. >> But we was forced to do a reboot (power outgage) and after that all the >> partition of the disk desappeared, we were forced to restore them with >> some tools like 'testdisk'. >> Partition on backups the same, desappeared. >> Note that there's also a 'plain' local backup that run on sunday, and this >> backup task seems does not generate trouble (but still seems to have >> partition desappeared, thus was done after an I/O error). >> We have hit a Kernel/Qemu bug? > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From elacunza at binovo.es Thu May 19 16:57:44 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Thu, 19 May 2022 16:57:44 +0200 Subject: PVE 7.2 - Avago MegaRAID broken? Message-ID: Hi all, Today we installed PVE 7.1 (ISO) in a relatively old machine. Installation was fine and Proxmox has booted OK. But after configuring non-subscription repository and upgrading to PVE 7.2/kernel 5.15, Proxmox won't boot anymore: Kernel will print lots of messages like "DMAR: DRHD: handling fault status reg 3" "DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000 [fault reason 0x06] PTE Readaccess is not set. (???? there are some missing chars in photos I shot, sorry). After about 2,5 minutes, it would open a shell in initramfs, complaining pve vg was not found and "Gave up waiting for root file system device". I suspected of a faulty controller first, but after booting with 5.13 kernel (even the latest one as of today, -6) all was fine again. We have removed 5.15 kernel, and rebooted 2-3 times, all is good now. :-) Controller is Avago MegaRAID SAS-MFI BIOS Version 6.36.00.2 (Build Sep 11, 2017) HA -0 (Bus 2 Dev 0) AVAGO MegaRAID SAS 9341-4i FW package: 24.21.0-0025 Product AVAGO MegaRAID SAS 9341-4i? is listed as Revision 4.680.01-??? (shot cut there, sorry) Controller has 3 WDC Gold 4TB in RAID5 attached. This is worked-around now, but I'm starting to worry about latest 5.15 kernel in PVE... :-) Thanks Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From s.ivanov at proxmox.com Thu May 19 18:31:38 2022 From: s.ivanov at proxmox.com (Stoiko Ivanov) Date: Thu, 19 May 2022 18:31:38 +0200 Subject: [PVE-User] PVE 7.2 - Avago MegaRAID broken? In-Reply-To: References: Message-ID: <20220519183138.1647bed6@rosa.proxmox.com> Hi, On Thu, 19 May 2022 16:57:44 +0200 Eneko Lacunza via pve-user wrote: > Hi all, > > Today we installed PVE 7.1 (ISO) in a relatively old machine. any more details on what kind of machine this is (CPU generation, if it's an older HP/Dell/Supermicro server or consumerhardware)? > Kernel will print lots of messages like > > "DMAR: DRHD: handling fault status reg 3" > "DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000 > [fault reason 0x06] PTE Readaccess is not set. could you please try (in that order, and until one the suggestions fixes the issue): * adding `iommu=pt` to the kernel cmdline * adding `intel_iommu=off` to the kernel cmdline we have updated the known-issues section of the release-notes to suggest this already after a few similar reports with older hardware/unusual setups in our community forum: https://pve.proxmox.com/wiki/Roadmap#7.2-known-issues see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline for instruction on how to edit the cmdline. > This is worked-around now, but I'm starting to worry about latest 5.15 > kernel in PVE... :-) > I think we have similar reports with each new kernel-series - mostly with older systems, which need to install a small workaround (usually module parameter or kernel cmdline switch). Our tests on many machines in our testlab (covering the past 10 years of hardware more or less well) all did not show any general issues - but it's sadly always a hit and miss. Please let us know if the changes help Kind regards, stoiko From elacunza at binovo.es Thu May 19 18:50:26 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Thu, 19 May 2022 18:50:26 +0200 Subject: [PVE-User] PVE 7.2 - Avago MegaRAID broken? In-Reply-To: <20220519183138.1647bed6@rosa.proxmox.com> References: <20220519183138.1647bed6@rosa.proxmox.com> Message-ID: <493da81e-e10c-f494-7055-1ab5e0760cd0@binovo.es> Hi Stoiko, El 19/5/22 a las 18:31, Stoiko Ivanov escribi?: > >> Today we installed PVE 7.1 (ISO) in a relatively old machine. > any more details on what kind of machine this is > (CPU generation, if it's an older HP/Dell/Supermicro server or > consumerhardware)? The system is in a customer site, but I'll try to gather detailed data tomorrow. CPU is a Xeon E or E3, I can't recall exact model right now. Server has Asus motherboard; this puts it in consumerserver or something like that I guess :-) > Kernel will print lots of messages like > > "DMAR: DRHD: handling fault status reg 3" > "DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000 > [fault reason 0x06] PTE Readaccess is not set. > could you please try (in that order, and until one the suggestions fixes > the issue): > * adding `iommu=pt` to the kernel cmdline > * adding `intel_iommu=off` to the kernel cmdline > we have updated the known-issues section of the release-notes to suggest > this already after a few similar reports with older hardware/unusual > setups in our community forum: > https://pve.proxmox.com/wiki/Roadmap#7.2-known-issues Ok, I didn't notice those known issues, will check them next time. I think I will be unable to try this shortly as system is not local, but if I can will report back, thanks. > > see > https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline > for instruction on how to edit the cmdline. > >> This is worked-around now, but I'm starting to worry about latest 5.15 >> kernel in PVE... :-) >> > I think we have similar reports with each new kernel-series - mostly with > older systems, which need to install a small workaround (usually module > parameter or kernel cmdline switch). > Our tests on many machines in our testlab (covering the past 10 years of > hardware more or less well) all did not show any general issues - but > it's sadly always a hit and miss. Sure, there's way too much hardware out there, this wasn't intended to be a complaint, not at least about your excelent work at Proxmox :) The intent was to warn other users, but your Known issues in release notes are good too. It's the first time I notice this mix of issues with a new kernel version in more than 10 year experience with Proxmox (or Linux on servers), but maybe it's also that our maintained server base is expanding. > Please let us know if the changes help > Thanks for your helpfull replies, will try to test your suggestions and will reply back with the results. Regards Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From piccardi at truelite.it Thu May 19 19:58:09 2022 From: piccardi at truelite.it (Simone Piccardi) Date: Thu, 19 May 2022 19:58:09 +0200 Subject: Strange problem on bridge after upgrade to proxmox 7 Message-ID: <4c3b82eb-06ac-eb4b-64b1-8f7e54b9c15e@truelite.it> Hi, I have a very strange networking problem on a Proxmox server, emerged after upgrading from 6.4 to 7. These the results of pveversion on the server: root at lama10:~# pveversion -V proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve) pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1) pve-kernel-5.15: 7.2-3 pve-kernel-helper: 7.2-3 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.35-1-pve: 5.15.35-2 pve-kernel-5.13.19-6-pve: 5.13.19-15 ceph-fuse: 14.2.21-1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-8 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.1-6 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-1 libpve-storage-perl: 7.2-2 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.12-1 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.1.8-1 proxmox-backup-file-restore: 2.1.8-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.4-10 pve-cluster: 7.2-1 pve-container: 4.2-1 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.4-2 pve-ha-manager: 3.3-4 pve-i18n: 2.7-1 pve-qemu-kvm: 6.2.0-5 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-2 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1 The server has 4 network interfaces, bound in pairs in active-passive mode, then bridged. This is its /etc/network/interfaces: auto eth0 iface eth0 inet manual auto eth1 iface eth1 inet manual auto eth2 iface eth2 inet manual auto eth3 iface eth3 inet manual auto bond0 iface bond0 inet manual bond-slaves eth0 eth1 bond-miimon 100 bond-mode active-backup bond-primary eth0 auto bond1 iface bond1 inet manual bond-slaves eth2 eth3 bond-miimon 100 bond-mode active-backup bond-primary eth2 auto vmbr0 iface vmbr0 inet static address 192.168.250.110/23 gateway 192.168.250.254 bridge-ports bond0 bridge-stp off bridge-fd 0 auto vmbr1 iface vmbr1 inet static address 192.168.223.110/24 bridge-ports bond1 bridge-stp off bridge-fd 0 The network problems comes only for connectiong to the virtual machines hosted by the server (no container are used), there is no problem at all for connecting to the server. The only anomaly I could find is that it seems that the bridge makes mac-address of some of the VM coming from a wrong internal port, so they become unreachable. To explain what this means, I put 3 test VM on the server (two debian 11 and a windows one, just to exclude problem at operating system level) using vmbr1 bridge; their tap interfaces are: root at lama10:~# brctl show vmbr1 bridge name bridge id STP enabled interfaces vmbr1 8000.7a576e974a37 no bond1 tap403i0 tap404i0 tap603i0 Sometime some of them are working and some are not. When I was writing this email the VM 404 was not working. Looking at tap404i0 mac address I got: root at lama10:~# ip -br link show dev tap404i0 tap404i0 UNKNOWN 26:6f:0c:19:95:58 while the 404 VM own mac address is: root at lama10:~# grep vmbr1 /etc/pve/qemu-server/404.conf net0: virtio=BE:47:4C:D5:5D:A9,bridge=vmbr1 and when I look at these mac address seen inside vmbr1 I got: root at lama10:~# brctl showmacs vmbr1 | egrep -i '(26:6f:0c:19:95:58|BE:47:4C:D5:5D:A9)' 4 26:6f:0c:19:95:58 yes 0.00 4 26:6f:0c:19:95:58 yes 0.00 1 be:47:4c:d5:5d:a9 no 0.65 doing the same for another VM that was working (mac address are found as above) I found instead: root at lama10:~# brctl showmacs vmbr1 | egrep -i '(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)' 3 92:4f:ec:7e:8a:e1 yes 0.00 3 92:4f:ec:7e:8a:e1 yes 0.00 3 de:a3:e6:96:0c:6e no 2.32 Note: with "working" I mean that a VM is normally reachable by network without packet loss. I checked in multiple times and in other servers and in all working cases the the ports inside the vmbrX switch are the same for the TAP mac and the VM mac, as expected. When not working the VM own mac seems always to be associated to port 1 (the one of the bonding interface). What I find in a "not working" VM is that ARP reply is never received (looking with tcpdump run using the console). The arp request are sent, and seen in other VM or on the host, but no reply are seen. Having a working VM is almost casual (or at least I could not find a pattern up to now). After stopping and restarting the above working VM I got it not working anymore and the port on the bridge changed: root at lama10:~# brctl showmacs vmbr1 | egrep -i '(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)' 3 92:4f:ec:7e:8a:e1 yes 0.00 3 92:4f:ec:7e:8a:e1 yes 0.00 1 de:a3:e6:96:0c:6e no 0.86 What make this behaviour "strange" is that other two identical machines with same Proxmox version (they are in cluster with this one, and inside a blades rack) are just working fine. And no problem on the cluster (like I said, no network problems at all for the server itself). The only difference on the other two fully working nodes is that their bonding is configures as lacp. That was not possible for this one; it got loop error messages when configured, so I had to remove that configuration to avoid disturbance on the other two nodes, were all production VM were migrated and are running whitout problems. But another standalone server (with the same Proxmox version of all other ones) that's outside the blade rack and it's also configured with active-passive bonding, is working fine. So despite the difference in network configuration between all these servers I still cannot imagine how the different kind of bonding or the use of a different switch can have an impact on this problem. In the previous example I cannot ping 404 VM nor from the server itself nor from the the other working VM hosted inside the server itself, and this kind of traffic is completely internal traffic, done inside vmbr1. So I'm asking directions about what to search, and where to look to find how the ports inside the bridge are allocated, or any other suggestion useful to have some light on this issue. Simone -- Simone Piccardi Truelite Srl piccardi at truelite.it (email/jabber) Via Monferrato, 6 Tel. +39-347-1032433 50142 Firenze http://www.truelite.it Tel. +39-055-7879597 From mailinglists at xaq.nl Thu May 19 20:58:11 2022 From: mailinglists at xaq.nl (Richard Lucassen) Date: Thu, 19 May 2022 20:58:11 +0200 Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7 In-Reply-To: References: Message-ID: <20220519205811.27302a796725fbc679a33948@xaq.nl> On Thu, 19 May 2022 19:58:09 +0200 Simone Piccardi via pve-user wrote: > Hi, I have a very strange networking problem on a Proxmox server, > emerged after upgrading from 6.4 to 7. I have no idea if this can have something to do with it, but not a very long time ago I had two Dell R210 servers connected through a simple failover bond0. The issue I found was that somehow these bond0 devices on two *different* servers got the *same* fixed MAC address. After some searching I stumbled upon this: https://blog.sigterm.se/posts/a-bonding-exercise/ I had some discussion afterward with Patrik and I ended up in adding a fixed MAC address in the /etc/network/interfaces stanza, e.g.: hwaddress ether 4a:89:66:60:e4:97 I just want to notify this phenomena because you can get the most weird behaviour if you have two devices having the same MAC. I tested loading the bonding on some workstations: modprobe -v bonding ip link show bond0 and see what address it gets, it depends on this value: cat /sys/class/net/bond0/addr_assign_type which me be different from host to host. To remove the module: modprobe -rv bonding I had no time to dive deeper into this matter, I just worked around it by adding the "hwaddress ether" in the bond0 stanza. This works fine. My 2cts, R. -- richard lucassen http://contact.xaq.nl/ From alwin at antreich.com Thu May 19 21:06:28 2022 From: alwin at antreich.com (Alwin Antreich) Date: Thu, 19 May 2022 21:06:28 +0200 Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7 In-Reply-To: References: Message-ID: On May 19, 2022 7:58:09 PM GMT+02:00, Simone Piccardi via pve-user wrote: >_______________________________________________ >pve-user mailing list >pve-user at lists.proxmox.com >https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user Hi Simone, Have you seen this section in the upgrade guide? https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Linux_Bridge_MAC-Address_Change In our case, we had an identical machine-id on two of our hosts and that killed the network for both. Cheers, Alwin From piccardi at truelite.it Fri May 20 14:21:39 2022 From: piccardi at truelite.it (Simone Piccardi) Date: Fri, 20 May 2022 14:21:39 +0200 Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7 In-Reply-To: References: Message-ID: <561cd0e7-f6d3-71da-f75d-4b5500c9611a@truelite.it> On 19/05/22 21:06, Alwin Antreich wrote: > Have you seen this section in the upgrade guide? > https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Linux_Bridge_MAC-Address_Change > Yes, I read that one, and it got me some headache, but not in these server. Thanks anyway for the answer. The upgrade was done fine for all the server of the cluster (the external one was installed directly with 7), their machine-id are all different and network communication between hosts has no problems. > In our case, we had an identical machine-id on two of our hosts and that killed the network for both. The problem I got is just between this specific host and the VM it is hosting (also between VM on the same network in other hosts and the ones inside this server, but that just because these are unreachable from the host itself). Simone -- Simone Piccardi Truelite Srl piccardi at truelite.it (email/jabber) Via Monferrato, 6 Tel. +39-347-1032433 50142 Firenze http://www.truelite.it Tel. +39-055-7879597 From piccardi at truelite.it Fri May 20 15:00:30 2022 From: piccardi at truelite.it (Simone Piccardi) Date: Fri, 20 May 2022 15:00:30 +0200 Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7 In-Reply-To: <20220519205811.27302a796725fbc679a33948@xaq.nl> References: <20220519205811.27302a796725fbc679a33948@xaq.nl> Message-ID: <5982355a-14a7-0669-5e1a-5cc1d15e8622@truelite.it> On 19/05/22 20:58, Richard Lucassen wrote: > I have no idea if this can have something to do with it, but not a very > long time ago I had two Dell R210 servers connected through a simple > failover bond0. The issue I found was that somehow these bond0 devices > on two *different* servers got the *same* fixed MAC address. After some > searching I stumbled upon this: > > https://blog.sigterm.se/posts/a-bonding-exercise/ > That's a very interesting reading, thank's for the link. Anyway I checked all bridge mac address on all server, and they are all different (they were anyway installed indipendently and have different machine-id). > I had some discussion afterward with Patrik and I ended up in adding a > fixed MAC address in the /etc/network/interfaces stanza, e.g.: > > hwaddress ether 4a:89:66:60:e4:97 > I think this can solve MAC conflicts (I'll try anyway) but I'm not seeing any duplicate MAC, and I got problem only between the VM inside a bridge and this specific host. The host is perfectly reachable from everywere. What I got is the same problem described here (I found this just this morning): https://forum.proxmox.com/threads/ve-7-1-10-slow-to-forward-arp-replies-over-bridge.106429/ a VM has a port inside the bridge that does not match the one used by its tap interface, I'll investigate the issue explained in this article referenced there: https://bugs.launchpad.net/neutron/+bug/1738659 Simone -- Simone Piccardi Truelite Srl piccardi at truelite.it (email/jabber) Via Monferrato, 6 Tel. +39-347-1032433 50142 Firenze http://www.truelite.it Tel. +39-055-7879597 From gaio at lilliput.linux.it Fri May 20 13:24:33 2022 From: gaio at lilliput.linux.it (Marco Gaiarin) Date: Fri, 20 May 2022 13:24:33 +0200 Subject: [PVE-User] Severe disk corruption: PBS, SATA In-Reply-To: ; from SmartGate on Sat, May 21, 2022 at 10:06:01AM +0200 References: Message-ID: Mandi! Eneko Lacunza via pve-user In chel di` si favelave... > I would try changing that sata0 disk to virtio-blk (maybe in a clone VM > first). I think squeeze will support it; then try PBS backup again. Disks migrated to 'Virtio Block'; now we are doing some tests, but seems to work well. Thanks. To others: seems is not an ZFS trouble, the same cluster run other VMs without fuss... anyway thanks. -- Una volta qualcuno chiese al Mahatma Gandhi cosa ne pensasse della civilt? in occidente. ?Credo che sarebbe una buona idea?, rispose. From gaio at lilliput.linux.it Fri May 20 13:22:03 2022 From: gaio at lilliput.linux.it (Marco Gaiarin) Date: Fri, 20 May 2022 13:22:03 +0200 Subject: [PVE-User] Experimenting with bond on a non-LACP switch... Message-ID: I'm doing some experimentation on a switch that seems does not support LACP, even thus claim that; is a Netgear GS724Tv2: https://www.downloads.netgear.com/files/GDC/GS724Tv2/enus_ds_gs724t.pdf data sheet say: Port Trunking - Manual as per IEEE802.3ad Link Aggregation and 'IEEE802.3ad Link Aggregation' is LACP, right? Anyway, i'm experimenting a bit with other bonding mode, having (un)expected results and troubles, but in: https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond i've stumble upon that sentence: If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported. What exactly mean?! Thanks. -- Molti italiani sognavano di vedere Berlusconi in un cellulare, prima o poi... (Stardust?, da i.n.n-a) From laurentfdumont at gmail.com Sun May 22 03:29:58 2022 From: laurentfdumont at gmail.com (Laurent Dumont) Date: Sat, 21 May 2022 21:29:58 -0400 Subject: [PVE-User] Experimenting with bond on a non-LACP switch... In-Reply-To: References: Message-ID: It's not made very clear from the documentation. I assume there are good technical reasons why the cluster traffic would be impacted. Afaik, proxmox leverages corosync which can leverage multicast for the cluster checks. I don't think it can be badly impacted by LACP but something to keep in mind. There is this old thread with a similar discussion : https://forum.proxmox.com/threads/cluster-lacp.90668/ On Sat, May 21, 2022 at 4:10 AM Marco Gaiarin wrote: > > I'm doing some experimentation on a switch that seems does not support > LACP, > even thus claim that; is a Netgear GS724Tv2: > > > https://www.downloads.netgear.com/files/GDC/GS724Tv2/enus_ds_gs724t.pdf > > data sheet say: > > Port Trunking - Manual as per IEEE802.3ad Link Aggregation > > and 'IEEE802.3ad Link Aggregation' is LACP, right? > > > Anyway, i'm experimenting a bit with other bonding mode, having > (un)expected > results and troubles, but in: > > https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond > > i've stumble upon that sentence: > > If you intend to run your cluster network on the bonding > interfaces, then you have to use active-passive mode on the bonding > interfaces, other modes are unsupported. > > What exactly mean?! Thanks. > > -- > Molti italiani sognavano di vedere Berlusconi in un cellulare, > prima o poi... (Stardust?, da i.n.n-a) > > > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > From wolf at wolfspyre.com Sun May 22 06:12:28 2022 From: wolf at wolfspyre.com (Wolf Noble) Date: Sat, 21 May 2022 23:12:28 -0500 Subject: [PVE-User] Experimenting with bond on a non-LACP switch... In-Reply-To: References: Message-ID: Good Catch Marco! I'd not seen that when I read through that page, but I just re-read it.... My read is that it can introduce ODD edge-case complications. my synthesis of this information is outlined here. I encourage anyone to correct my misunderstandings. network abstraction gets complicated QUICKLY. Network gear vendors implement their support for the different bonding modes in subtly different ways. Firewalls have their own quirks. abstractions on top of abstractions on top of abstractions on top of abstractions on top of .... okay you get the point. we want to avoid asymmetric pathing where possible, because stuff gets quirky and edge-casey quickly. the fewer explicitly supported virtual topologies, the fewer scenarios the engineering teams need to scrutinize the COMPLEX edge case behaviors of, resulting in a better experience for EVERYONE.... heres what I mean: LACP: This is a pretty well known and consistently implemented aggregation mechanism. the behaviors of network interfaces and switching hardware that are involved are pretty consistent. This GENERALLY works fine. the only time I've seen it get a little wonky is LACP across switch chassis behavior can be odd.. -------------- next part -------------- When node 2 sends traffic destined for node 1 via lacp_b.1 ... it may traverse the trunked interconnect, it may not. (it depends) active-backup: I talk, and listen on link0... if link 0 goes down, I switchover... I don't really pay attention to linkl1 otherwise. active-passive: I talk on link0. I listen on link0 and link1. The downside that I've seen here: Arp caching can get wonky, and packets that SHOULD be directed to node0 link0 get directed to node0 link1... or sometimes packets directed to node0.link0 have a destination mac address of the hwaddr of link1 and so get delivered to link1 ... There MAY be some oddities that manifest with this configuration. depending on (node scope configuration) sysctl settings, node0 could just ignore those packets, resulting in weird behavior with the various balance algorithms nodes will see a different hardware addresses for each other, again, this isn't *USUALLY* a problem, but there are still some dragons that lurk within the trunking/bonding code... hardware checksumming can get whacky... especially when VLANs get mixed in... My gut tells me that the main reason for this advise is that using LACP or active/backup provides sufficient durability while introducing as little edge-case wonky as possible, which generally speaking is a GoodThing?? when it comes to intra-cluster-comms. I could be wrong, so don't take this as gospel... if anyone has a better explanation, or can point out my flawed logic, by all means, chime in! :) Hope my understanding HELPS... if it doesn't, throw it away and ignore it ;) ?W This message created and transmitted using 100% recycled electrons. > On May 21, 2022, at 03:11, Marco Gaiarin wrote: > > ? > I'm doing some experimentation on a switch that seems does not support LACP, > even thus claim that; is a Netgear GS724Tv2: > > https://www.downloads.netgear.com/files/GDC/GS724Tv2/enus_ds_gs724t.pdf > > data sheet say: > > Port Trunking - Manual as per IEEE802.3ad Link Aggregation > > and 'IEEE802.3ad Link Aggregation' is LACP, right? > > > Anyway, i'm experimenting a bit with other bonding mode, having (un)expected > results and troubles, but in: > > https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond > > i've stumble upon that sentence: > > If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported. > > What exactly mean?! Thanks. > > -- > Molti italiani sognavano di vedere Berlusconi in un cellulare, > prima o poi... (Stardust?, da i.n.n-a) > > > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From gaio at lilliput.linux.it Tue May 24 23:17:12 2022 From: gaio at lilliput.linux.it (Marco Gaiarin) Date: Tue, 24 May 2022 23:17:12 +0200 Subject: [PVE-User] Experimenting with bond on a non-LACP switch... In-Reply-To: ; from SmartGate on Tue, May 24, 2022 at 23:36:01PM +0200 References: Message-ID: Mandi! Laurent Dumont In chel di` si favelave... > There is this old thread with a similar discussion : > https://forum.proxmox.com/threads/cluster-lacp.90668/ Apart that i have other clusters where corosync run against LACP bonds without roubles, i've a little simpler question: >> If you intend to run your cluster network on the bonding >> interfaces, then you have to use active-passive mode on the bonding >> interfaces, other modes are unsupported. What is 'active-passive'?! Is the same of 'active-backup'? seems a terminology inconsistency to me... -- Se non trovi nessuno vuol dire che siamo scappati alle sei-shell (bash, tcsh,csh...) (Possi) From wolf at wolfspyre.com Wed May 25 04:11:09 2022 From: wolf at wolfspyre.com (Wolf Noble) Date: Tue, 24 May 2022 21:11:09 -0500 Subject: [PVE-User] Experimenting with bond on a non-LACP switch... In-Reply-To: References: Message-ID: <2853D8F3-A64B-43A6-B923-306902AA9949@wolfspyre.com> I was condensing several (i emit traffic on one interface but listen on all bond members) into ?active-passive? this: https://help.ubuntu.com/community/UbuntuBonding does a better job explaining the different modes? hope that helps? ? ( sorry for the confusion ) W [= The contents of this message have been written, read, processed, erased, sorted, sniffed, compressed, rewritten, misspelled, overcompensated, lost, found, and most importantly delivered entirely with recycled electrons =] > On May 24, 2022, at 16:40, Marco Gaiarin wrote: > > ?Mandi! Laurent Dumont > In chel di` si favelave... > >> There is this old thread with a similar discussion : >> https://forum.proxmox.com/threads/cluster-lacp.90668/ > > Apart that i have other clusters where corosync run against LACP bonds > without roubles, i've a little simpler question: > >>> If you intend to run your cluster network on the bonding >>> interfaces, then you have to use active-passive mode on the bonding >>> interfaces, other modes are unsupported. > > What is 'active-passive'?! Is the same of 'active-backup'? seems a > terminology inconsistency to me... > > -- > Se non trovi nessuno vuol dire che siamo scappati alle sei-shell (bash, > tcsh,csh...) (Possi) > > > > _______________________________________________ > pve-user mailing list > pve-user at lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From elacunza at binovo.es Mon May 30 11:55:51 2022 From: elacunza at binovo.es (Eneko Lacunza) Date: Mon, 30 May 2022 11:55:51 +0200 Subject: [PVE-User] PVE 7.2 - Avago MegaRAID broken? In-Reply-To: References: <20220519183138.1647bed6@rosa.proxmox.com> Message-ID: Hi, El 19/5/22 a las 18:50, Eneko Lacunza via pve-user escribi?: > > El 19/5/22 a las 18:31, Stoiko Ivanov escribi?: >> >>> Today we installed PVE 7.1 (ISO) in a relatively old machine. >> any more details on what kind of machine this is >> (CPU generation, if it's an older HP/Dell/Supermicro server or >> consumerhardware)? > > The system is in a customer site, but I'll try to gather detailed data > tomorrow. CPU is a Xeon E or E3, I can't recall exact model right now. > Server has Asus motherboard; this puts it in consumerserver or > something like that I guess :-) System is: Asus P10S-C BIOS 4402 Xeon E3-1270 v6 3.8 lspci shows card as: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] > >> Kernel will print lots of messages like >> >> "DMAR: DRHD: handling fault status reg 3" >> "DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000 >> [fault reason 0x06] PTE Readaccess is not set. >> could you please try (in that order, and until one the suggestions fixes >> the issue): >> * adding `iommu=pt` to the kernel cmdline >> * adding `intel_iommu=off` to the kernel cmdline >> we have updated the known-issues section of the release-notes to suggest >> this already after a few similar reports with older hardware/unusual >> setups in our community forum: >> https://pve.proxmox.com/wiki/Roadmap#7.2-known-issues > > Ok, I didn't notice those known issues, will check them next time. I > think I will be unable to try this shortly as system is not local, but > if I can will report back, thanks. This fixes the issue (iommu=pt). Thanks a lot for pointing this out. Regards Eneko Lacunza Zuzendari teknikoa | Director t?cnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ From rightkicktech at gmail.com Tue May 31 17:17:57 2022 From: rightkicktech at gmail.com (Alex K) Date: Tue, 31 May 2022 18:17:57 +0300 Subject: [PVE-User] Mount issue of thin LVM on boot Message-ID: Hi All, I have created a thin LVM on top the existing *data* thin pool, as follows: lvcreate -V2T -T pve/data -n vms The resulting *vms* LVM is a thin LVM volume which then I format and mount it at boot. I perform this so as to be able to setup a gluster setup with three nodes where the bricks will resize in the mountpoint of the thin LVM volume. I have observed that when the hosts boot randomly, perhaps after a power cut, they might temporarily lose quorum, which is expected until the nodes are booted and the quorum is met. On loss of the quorum the mount point of the thin LVM I have created is not able to mount and I suspect that this is due to Proxmox not enabling the data thin pool if quorum is not met. Is this the case? Can someone confirm this or provide any idea/hint what could be wrong with this approach? I was thinking of checking if creating the thin LVM on top of a different pool and volume group that Proxmox does not manage might resolve the issue. Thanx, Alex