From martin at proxmox.com  Wed May  4 12:27:27 2022
From: martin at proxmox.com (Martin Maurer)
Date: Wed, 4 May 2022 12:27:27 +0200
Subject: [PVE-User] Proxmox VE 7.2 released!
Message-ID: <e32512e9-ebbb-ceea-d2b8-0662111057e3@proxmox.com>

Hi all,

we're excited to announce the release of Proxmox Virtual Environment 
7.2. It's based on Debian 11.3 "Bullseye" but using a newer Linux kernel 
5.15.30, QEMU 6.2, LXC 4, Ceph 16.2.7, and OpenZFS 2.1.4 and countless 
enhancements and bugfixes.

Here is a selection of the highlights

- Support for the accelerated virtio-gl (VirGL) display driver
- Notes templates for backup jobs (e.g. add the name of your VMs and CTs 
to the backup notes)
- Ceph erasure code support
- Updated existing and new LXC container templates (New: Ubuntu 22.04, 
Devuan 4.0, Alpine 3.15)
- ISO: Updated memtest86+ to the completely rewritten 6.0b version, 
adding support for UEFI and modern memory like DDR5
- and many more GUI enhancements

As always, we have included countless bugfixes and improvements on many 
places; see the release notes for all details.

Release notes
https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_7.2

Press release
https://www.proxmox.com/en/news/press-releases/proxmox-virtual-environment-7-2-available

Video tutorial
https://www.proxmox.com/en/training/video-tutorials/item/what-s-new-in-proxmox-ve-7-2

Download
https://www.proxmox.com/en/downloads
Alternate ISO download:
https://enterprise.proxmox.com/iso

Documentation
https://pve.proxmox.com/pve-docs

Community Forum
https://forum.proxmox.com

Bugtracker
https://bugzilla.proxmox.com

Source code
https://git.proxmox.com

We want to shout out a big THANK YOU to our active community for all 
your intensive feedback, testing, bug reporting and patch submitting!

FAQ
Q: Can I upgrade Proxmox VE 7.0 or 7.1 to 7.2 via GUI?
A: Yes.

Q: Can I upgrade Proxmox VE 6.4 to 7.2 with apt?
A: Yes, please follow the upgrade instructions on 
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

Q: Can I install Proxmox VE 7.2 on top of Debian 11.x "Bullseye"?
A: Yes, see 
https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_11_Bullseye

Q: Can I upgrade my Proxmox VE 6.4 cluster with Ceph Octopus to 7.2 with 
Ceph Octopus/Pacific?
A: This is a two step process. First, you have to upgrade Proxmox VE 
from 6.4 to 7.2, and afterwards upgrade Ceph from Octopus to Pacific. 
There are a lot of improvements and changes, so please follow exactly 
the upgrade documentation:
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
https://pve.proxmox.com/wiki/Ceph_Octopus_to_Pacific

Q: Where can I get more information about feature updates?
A: Check the https://pve.proxmox.com/wiki/Roadmap, 
https://forum.proxmox.com/, the https://lists.proxmox.com/, and/or 
subscribe to our https://www.proxmox.com/en/news.
-- 
Best Regards,

Martin Maurer

martin at proxmox.com
https://www.proxmox.com


From dziobek at hlrs.de  Wed May  4 14:10:59 2022
From: dziobek at hlrs.de (Martin Dziobek)
Date: Wed, 4 May 2022 14:10:59 +0200
Subject: [PVE-User] Proxmox VE 7.2 - Problem of understanding
 'bridge-disable-mac-learning'
In-Reply-To: <e32512e9-ebbb-ceea-d2b8-0662111057e3@proxmox.com>
References: <e32512e9-ebbb-ceea-d2b8-0662111057e3@proxmox.com>
Message-ID: <20220504141059.48ab3303@schleppmd.hlrs.de>

Dear all,

In the Release Notes of 7.2, it says:

"Administrators can now disable MAC learning on a bridge in /etc/network/interfaces with the bridge-disable-mac-learning flag.
This reduces the number of packets flooded on all ports (for unknown MAC addresses), preventing issues with certain hosting 
providers (for example, Hetzner), which resulted in the Proxmox VE node getting disconnected"

where as in descriptions of how to disable mac bridge learning
for example on  https://www.xmodulo.com/disable-mac-learning-linux-bridge.html

it says:

"Once MAC learning is turned off, a Linux bridge will flood every incoming packet to the rest of the ports. 
Understand this implication before proceeding."

So flooding is reduced *or* increased ...

May someone shed a light on this ?

Best regards,
Martin


From s.ivanov at proxmox.com  Wed May  4 15:39:59 2022
From: s.ivanov at proxmox.com (Stoiko Ivanov)
Date: Wed, 4 May 2022 15:39:59 +0200
Subject: [PVE-User] Proxmox VE 7.2 - Problem of understanding
 'bridge-disable-mac-learning'
In-Reply-To: <20220504141059.48ab3303@schleppmd.hlrs.de>
References: <e32512e9-ebbb-ceea-d2b8-0662111057e3@proxmox.com>
 <20220504141059.48ab3303@schleppmd.hlrs.de>
Message-ID: <20220504153959.19ffadc1@rosa.proxmox.com>

hi,


On Wed, 4 May 2022 14:10:59 +0200
Martin Dziobek <dziobek at hlrs.de> wrote:

> Dear all,
> 
> In the Release Notes of 7.2, it says:
> 
> "Administrators can now disable MAC learning on a bridge in /etc/network/interfaces with the bridge-disable-mac-learning flag.
> This reduces the number of packets flooded on all ports (for unknown MAC addresses), preventing issues with certain hosting 
> providers (for example, Hetzner), which resulted in the Proxmox VE node getting disconnected"
> 
> where as in descriptions of how to disable mac bridge learning
> for example on  https://www.xmodulo.com/disable-mac-learning-linux-bridge.html
> 
> it says:
> 
> "Once MAC learning is turned off, a Linux bridge will flood every incoming packet to the rest of the ports. 
> Understand this implication before proceeding."
> 
> So flooding is reduced *or* increased ...
> 
> May someone shed a light on this ?
I think the commit message of the relevant commit describes the situation
quite well:
https://git.proxmox.com/?p=pve-common.git;a=commit;h=354ec8dee37d481ebae49b488349a8e932dce736

it disables learning on the individual ports - but at the same time also
the unicast_flood flag is set to false - see `man 8 bridge` - so I'd
expect the combination of the 2 to work as advertised
(and will try to rephrase the release note entry a bit too be less
confusing)

I hope this helps!

Best regards,
stoiko


From Alexandre.DERUMIER at groupe-cyllene.com  Thu May  5 15:24:31 2022
From: Alexandre.DERUMIER at groupe-cyllene.com (DERUMIER, Alexandre)
Date: Thu, 5 May 2022 13:24:31 +0000
Subject: [PVE-User] Proxmox VE 7.2 - Problem of understanding
 'bridge-disable-mac-learning'
In-Reply-To: <20220504153959.19ffadc1@rosa.proxmox.com>
References: <e32512e9-ebbb-ceea-d2b8-0662111057e3@proxmox.com>
 <20220504141059.48ab3303@schleppmd.hlrs.de>
 <20220504153959.19ffadc1@rosa.proxmox.com>
Message-ID: <3eace3432f6ca87b660f01390a8cf13395322e12.camel@groupe-cyllene.com>

mmm,looking at the git, it seem that qemu-server && pve-container patch
es to register mac address in bridge are not applied ...


[pve-devel] [PATCH V2 qemu-server 0/3] add disable bridge learning
feature
	https://lists.proxmox.com/pipermail/pve-devel/2022-March/052210.html
	
[pve-devel] [PATCH V2 pve-container 0/1] add disable bridge learning
feature
		https://lists.proxmox.com/pipermail/pve-devel/2022-March/052206.html


Le mercredi 04 mai 2022 ? 15:39 +0200, Stoiko Ivanov a ?crit?:
> hi,
> 
> 
> On Wed, 4 May 2022 14:10:59 +0200
> Martin Dziobek <dziobek at hlrs.de> wrote:
> 
> > Dear all,
> > 
> > In the Release Notes of 7.2, it says:
> > 
> > "Administrators can now disable MAC learning on a bridge in
> > /etc/network/interfaces with the bridge-disable-mac-learning flag.
> > This reduces the number of packets flooded on all ports (for
> > unknown MAC addresses), preventing issues with certain hosting 
> > providers (for example, Hetzner), which resulted in the Proxmox VE
> > node getting disconnected"
> > 
> > where as in descriptions of how to disable mac bridge learning
> > for example on?
> > https://antiphishing.cetsi.fr/proxy/v3?i=ZUcyY1RmWEJYTXg4endZcf4pHMlLXnVUx16Ppu9iYP8&r=N3ZnQkVkbG1hOHVwcWFJNMLpdiUetyglobBNT6FebFASxxZ1q4z56SmutCfWl0tQ&f=RkdqNzdIQkFjZzVZTkZxbZ21HjwKhyMg-rZGU8E0XD_frmmy_SGxhjX_N0NdVXVt8hYCzR91DADKO1rwT7UlwQ&u=https%3A//www.xmodulo.com/disable-mac-learning-linux-bridge.html&k=YkLs
> > 
> > it says:
> > 
> > "Once MAC learning is turned off, a Linux bridge will flood every
> > incoming packet to the rest of the ports. 
> > Understand this implication before proceeding."
> > 
> > So flooding is reduced *or* increased ...
> > 
> > May someone shed a light on this ?
> I think the commit message of the relevant commit describes the
> situation
> quite well:
> https://antiphishing.cetsi.fr/proxy/v3?i=ZUcyY1RmWEJYTXg4endZcf4pHMlLXnVUx16Ppu9iYP8&r=N3ZnQkVkbG1hOHVwcWFJNMLpdiUetyglobBNT6FebFASxxZ1q4z56SmutCfWl0tQ&f=RkdqNzdIQkFjZzVZTkZxbZ21HjwKhyMg-rZGU8E0XD_frmmy_SGxhjX_N0NdVXVt8hYCzR91DADKO1rwT7UlwQ&u=https%3A//git.proxmox.com/%3Fp%3Dpve-common.git%3Ba%3Dcommit%3Bh%3D354ec8dee37d481ebae49b488349a8e932dce736&k=YkLs
> 
> it disables learning on the individual ports - but at the same time
> also
> the unicast_flood flag is set to false - see `man 8 bridge` - so I'd
> expect the combination of the 2 to work as advertised
> (and will try to rephrase the release note entry a bit too be less
> confusing)
> 
> I hope this helps!
> 
> Best regards,
> stoiko
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://antiphishing.cetsi.fr/proxy/v3?i=ZUcyY1RmWEJYTXg4endZcf4pHMlLXnVUx16Ppu9iYP8&r=N3ZnQkVkbG1hOHVwcWFJNMLpdiUetyglobBNT6FebFASxxZ1q4z56SmutCfWl0tQ&f=RkdqNzdIQkFjZzVZTkZxbZ21HjwKhyMg-rZGU8E0XD_frmmy_SGxhjX_N0NdVXVt8hYCzR91DADKO1rwT7UlwQ&u=https%3A//lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user&k=YkLs
> 


From mlist at jarasoft.net  Thu May  5 15:56:42 2022
From: mlist at jarasoft.net (Jack Raats)
Date: Thu, 5 May 2022 15:56:42 +0200
Subject: [PVE-User] Proces BOOTFB
Message-ID: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net>

Hi,

At this moment I use proxmox 7.2-3. Before this version I could run a VM 
passthrough.
On proxmox 7.2-3 I get an error that BAR1 doesn't have memory anymore.
The memory ism occupied by a proces called BOOTFB.

What is this proces doing?
How to get the passthroug thing working again?

Thanks
Jack Raats


From leesteken+proxmox at pm.me  Thu May  5 16:08:41 2022
From: leesteken+proxmox at pm.me (Arjen)
Date: Thu, 05 May 2022 14:08:41 +0000
Subject: [PVE-User] Proces BOOTFB
In-Reply-To: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net>
References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net>
Message-ID: <UyYrAzFtnjlhdvS8wW11OcjHWAXtk1jVkwCutnvdp-yHIcVnLdDcZRihkaHiLqL-IH2UlRSapLmquQAjeahyzIDls0B1AxXEtmsokHavSK8=@pm.me>

On Thursday, May 5th, 2022 at 15:56, Jack Raats <mlist at jarasoft.net> wrote:

> Hi,
>
> At this moment I use proxmox 7.2-3. Before this version I could run a VM
> passthrough.
> On proxmox 7.2-3 I get an error that BAR1 doesn't have memory anymore.
> The memory ism occupied by a proces called BOOTFB.
>
> What is this proces doing?
> How to get the passthroug thing working again?
>
> Thanks
> Jack Raats

Before (with kernel 5.13) using video=efifb:off video=vesafb:off would fix this (at the expense of boot messages).
With 7.2 (or kernel 5.15), I would expect video=simplefb:off to fix this, but I my experience this does not work for every GPU.

I found that, for AMD GPUs, unblacklisting amdgpu AND not early binding to vfio_pci AND removing those video= parameters works best.
amdgpu just takes over from the bootfb, and does release the GPU nicely to vfio_pci when starting the VM.
(Of course, for AMD vendor-reset and reset_method=device_specific might be required.)
I don't know if this also works for nouveau or i915.

I hope this helps,
Arjen


From ralf.storm at konzept-is.de  Fri May  6 13:10:21 2022
From: ralf.storm at konzept-is.de (storm)
Date: Fri, 6 May 2022 13:10:21 +0200
Subject: [PVE-User] Network Mismatch after upgrade to 7.2
In-Reply-To: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net>
References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net>
Message-ID: <60d3fb98-0c90-d38d-fd4e-3840c17c1367@konzept-is.de>

Hello,

on one of my nodes I have total chaos in the network configuration after 
upgrading from 7.1 to 7.2 (liscensed pve-enterprise repo)

I have some interfaces in the GUI, which are not in the system, the cli 
shows something totally different

one actively used for a client network disappeared totally

the mac address reported by the connected switch for this network cannot 
be found on the node :(


Any clue whats the issue and how to resolve this?

best regards

Ralf


From elacunza at binovo.es  Fri May  6 13:20:17 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Fri, 6 May 2022 13:20:17 +0200
Subject: [PVE-User] Network Mismatch after upgrade to 7.2
In-Reply-To: <60d3fb98-0c90-d38d-fd4e-3840c17c1367@konzept-is.de>
References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net>
 <60d3fb98-0c90-d38d-fd4e-3840c17c1367@konzept-is.de>
Message-ID: <641296a8-89e5-d4c8-7514-8bb179dc5b8f@binovo.es>

Hi,

Maybe kernel changed names of the interfaces.

To fix the issue, you must change old interface names with new names in 
/etc/network/interfaces

El 6/5/22 a las 13:10, storm escribi?:
> Hello,
>
> on one of my nodes I have total chaos in the network configuration 
> after upgrading from 7.1 to 7.2 (liscensed pve-enterprise repo)
>
> I have some interfaces in the GUI, which are not in the system, the 
> cli shows something totally different
>
> one actively used for a client network disappeared totally
>
> the mac address reported by the connected switch for this network 
> cannot be found on the node :(
>
>
> Any clue whats the issue and how to resolve this?
>
> best regards
>
> Ralf
>
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From mlist at jarasoft.net  Fri May  6 13:23:48 2022
From: mlist at jarasoft.net (Jack Raats)
Date: Fri, 6 May 2022 13:23:48 +0200
Subject: [PVE-User] Proces BOOTFB
In-Reply-To: <UyYrAzFtnjlhdvS8wW11OcjHWAXtk1jVkwCutnvdp-yHIcVnLdDcZRihkaHiLqL-IH2UlRSapLmquQAjeahyzIDls0B1AxXEtmsokHavSK8=@pm.me>
References: <6c2c5815-01c6-23fc-9580-2c428fe6290c@jarasoft.net>
 <UyYrAzFtnjlhdvS8wW11OcjHWAXtk1jVkwCutnvdp-yHIcVnLdDcZRihkaHiLqL-IH2UlRSapLmquQAjeahyzIDls0B1AxXEtmsokHavSK8=@pm.me>
Message-ID: <a2d24f02-d819-260f-a436-87124d7f66cd@jarasoft.net>


Op 05-05-2022 om 16:08 schreef Arjen:
> On Thursday, May 5th, 2022 at 15:56, Jack Raats <mlist at jarasoft.net> wrote:
>
>> Hi,
>>
>> At this moment I use proxmox 7.2-3. Before this version I could run a VM
>> passthrough.
>> On proxmox 7.2-3 I get an error that BAR1 doesn't have memory anymore.
>> The memory ism occupied by a proces called BOOTFB.
>>
>> What is this proces doing?
>> How to get the passthroug thing working again?
>>
>> Thanks
>> Jack Raats
> Before (with kernel 5.13) using video=efifb:off video=vesafb:off would fix this (at the expense of boot messages).
> With 7.2 (or kernel 5.15), I would expect video=simplefb:off to fix this, but I my experience this does not work for every GPU.
>
> I found that, for AMD GPUs, unblacklisting amdgpu AND not early binding to vfio_pci AND removing those video= parameters works best.
> amdgpu just takes over from the bootfb, and does release the GPU nicely to vfio_pci when starting the VM.
> (Of course, for AMD vendor-reset and reset_method=device_specific might be required.)
> I don't know if this also works for nouveau or i915.
>
> I hope this helps,
> Arjen

I've tried all the possible, but nothing works...
Until I started the old kernel and everything worked perfectly!

I think that amdgpu, which is included in the kernel, doesn't takes over 
from bootfb

Greetings,
Jack Raats


From elacunza at binovo.es  Wed May 11 16:35:24 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Wed, 11 May 2022 16:35:24 +0200
Subject: PVE 7.2 unstability
Message-ID: <d3e435e5-c304-af5d-c59f-12034a8ba898@binovo.es>

Hi all,

Yesterday we upgraded a 5-node cluster to PVE 7.2 from PVE 7.1:

# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-1-pve: 5.13.19-3
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

We're seen since them some unstability with our VMs, some of them start 
consuming a full CPU core without explanation.

We have seen this issue only with Linux VMs, mostly Debian 9,10,11 (but 
that's the most common OS in out 80+ VMs). Issue happens with 1 core, 2 
cores and 4 cores.

This issue seems to be easily reproduced bulk-migrating VMs. Not all 
bulk-migrated VMs show the issue, but some do. We see the issue in not 
recently migrated VMs too.

Some of the VMs show a timelapse in syslog. For example, in our 
"release" VM:

May 10 15:11:18 release systemd[1]: Stopping User Runtime Directory 
/run/user/1003...
May 10 15:11:18 release systemd[1]: run-user-1003.mount: Succeeded.
May 10 15:11:18 release systemd[1]: user-runtime-dir at 1003.service: 
Succeeded.
May 10 15:11:18 release systemd[1]: Stopped User Runtime Directory 
/run/user/1003.
May 10 15:11:18 release systemd[1]: Removed slice User Slice of UID 1003.
Jan 15 06:42:04 release systemd[1]: Starting Daily apt download 
activities...
Jan 15 06:42:04 release mariadbd[453]: 850115? 6:42:04 [ERROR] mysqld 
got signal 11 ;
Jan 15 06:42:04 release mariadbd[453]: This could be because you hit a 
bug. It is also possible that this binary
Jan 15 06:42:04 release mariadbd[453]: or one of the libraries it was 
linked against is corrupt, improperly built,
Jan 15 06:42:04 release mariadbd[453]: or misconfigured. This error can 
also be caused by malfunctioning hardware.
Jan 15 06:42:04 release mariadbd[453]: To report this bug, see 
https://mariadb.com/kb/en/reporting-bugs
Jan 15 06:42:04 release mariadbd[453]: We will try our best to scrape up 
some info that will hopefully help
Jan 15 06:42:04 release mariadbd[453]: diagnose the problem, but since 
we have already crashed,
Jan 15 06:42:04 release mariadbd[453]: something is definitely wrong and 
this may fail.
Jan 15 06:42:04 release mariadbd[453]: Server version: 
10.5.15-MariaDB-0+deb11u1
Jan 15 06:42:04 release mariadbd[453]: key_buffer_size=134217728
Jan 15 06:42:04 release mariadbd[453]: read_buffer_size=131072
Jan 15 06:42:04 release mariadbd[453]: max_used_connections=3
Jan 15 06:42:04 release mariadbd[453]: max_threads=153
Jan 15 06:42:04 release mariadbd[453]: thread_count=0
Jan 15 06:42:04 release mariadbd[453]: It is possible that mysqld could 
use up to
Jan 15 06:42:04 release mariadbd[453]: key_buffer_size + 
(read_buffer_size + sort_buffer_size)*max_threads = 467872 K? bytes of 
memory
Jan 15 06:42:04 release mariadbd[453]: Hope that's ok; if not, decrease 
some variables in the equation.
Jan 15 06:42:04 release mariadbd[453]: Thread pointer: 0x0
Jan 15 06:42:04 release mariadbd[453]: Attempting backtrace. You can use 
the following information to find out
Jan 15 06:42:04 release mariadbd[453]: where mysqld died. If you see no 
messages after this, something went
Jan 15 06:42:04 release mariadbd[453]: terribly wrong...
Jan 15 06:42:04 release mariadbd[453]: stack_bottom = 0x0 thread_stack 
0x49000
Jan 15 06:42:04 release systemd[1]: Starting Online ext4 Metadata Check 
for All Filesystems...
Jan 15 06:42:04 release systemd[1]: Starting Clean php session files...
Jan 15 06:42:04 release systemd[1]: Starting Cleanup of Temporary 
Directories...
Jan 15 06:42:04 release systemd[1]: Starting Rotate log files...
Jan 15 06:42:04 release systemd[1]: Starting Daily man-db regeneration...
Jan 15 06:42:04 release systemd[1]: e2scrub_all.service: Succeeded.
Jan 15 06:42:04 release systemd[1]: Finished Online ext4 Metadata Check 
for All Filesystems.
Jan 15 06:42:04 release systemd[1]: systemd-tmpfiles-clean.service: 
Succeeded.
Jan 15 06:42:04 release systemd[1]: Finished Cleanup of Temporary 
Directories.
Jan 15 06:42:04 release systemd[1]: phpsessionclean.service: Succeeded.
Jan 15 06:42:04 release systemd[1]: Finished Clean php session files.
Jan 15 06:42:04 release systemd[1]: apt-daily.service: Succeeded.
Jan 15 06:42:04 release systemd[1]: Finished Daily apt download activities.
Jan 15 06:42:04 release systemd[1]: Starting Daily apt upgrade and clean 
activities...
Jan 15 06:42:04 release systemd[1]: apt-daily-upgrade.service: Succeeded.
Jan 15 06:42:04 release systemd[1]: Finished Daily apt upgrade and clean 
activities.
Jan 15 06:42:04 release systemd[1]: Reloading The Apache HTTP Server.
Jan 15 06:42:04 release systemd[1]: Looping too fast. Throttling 
execution a little.
[...reset...]

Is anyone seeing this issue?

Those servers have AMD Ryzen procesors.

Cheers

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From elacunza at binovo.es  Thu May 12 09:33:29 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Thu, 12 May 2022 09:33:29 +0200
Subject: [PVE-User] PVE 7.2 unstability
In-Reply-To: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
References: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
Message-ID: <c40b64ff-d0ea-1429-8b1a-be430fba8ecc@binovo.es>

Hi all,

Definitively there's some issue with time. I just captured this event in 
a Debian 11 VM:

May 12 09:18:42 monitor-cloud systemd[1]: Stopped User Runtime Directory 
/run/user/1001.
May 12 09:18:42 monitor-cloud systemd[1]: Removed slice User Slice of 
UID 1001.
May 12 17:32:35 monitor-cloud icinga2[943]: [2022-05-12 17:32:35 +0200] 
information/Application: We jumped forward in time: 29633.8 seconds
[...reset...]
May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] Linux version 
5.10.0-13-amd64 (debian-kernel at lists.debian.org) (gcc-10 (Debian 
10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 
SMP Debian 5.10.106-1 (2022-03-17)
May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] Command line: 
BOOT_IMAGE=/vmlinuz-5.10.0-13-amd64 
root=/dev/mapper/monitor--cloud--vg-root ro quiet
May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] x86/fpu: x87 FPU 
will use FXSAVE
May 12 09:30:43 monitor-cloud kernel: [??? 0.000000] BIOS-provided 
physical RAM map:

Is VM clock managed by qemu/kvm?

Thanks

El 11/5/22 a las 16:35, Eneko Lacunza via pve-user escribi?:
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From elacunza at binovo.es  Thu May 12 15:15:09 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Thu, 12 May 2022 15:15:09 +0200
Subject: [PVE-User] PVE 7.2 unstability
In-Reply-To: <mailman.131.1652340849.362.pve-user@lists.proxmox.com>
References: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
 <mailman.131.1652340849.362.pve-user@lists.proxmox.com>
Message-ID: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es>

Hi,

This time VM didn't crash but kernel noticed the time issue:

May 12 09:48:57 monitor-cloud systemd[1]: session-38.scope: Succeeded.
May 12 18:08:57 monitor-cloud kernel: [31097.014795] clocksource: 
timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable 
because the skew is too large:
May 12 18:08:57 monitor-cloud kernel: [31097.014803] 
clocksource:?????????????????????? 'kvm-clock' wd_now: a601cec4c1d9 
wd_last: 8aba12430493 mask: ffffffffffffffff
May 12 18:08:57 monitor-cloud kernel: [31097.014806] 
clocksource:?????????????????????? 'tsc' cs_now: 1f116978cedab cs_last: 
19f66ae2f9bbb mask: ffffffffffffffff
May 12 18:08:57 monitor-cloud kernel: [31097.014810] tsc: Marking TSC 
unstable due to clocksource watchdog
May 12 09:49:02 monitor-cloud systemd[1]: Starting Clean php session 
files...

Seems that the issue is more easily triggered live migrating the VMs, 
another VM just hung but no time-issues in syslog (I had to hard reset...)

We have downgraded from pve-qemu-kvm:amd64 6.2.0-5 to 6.2.0-2 (version 
before issues started)

We have downgraded from qemu-server from 7.2-2 to 7.1-4 (version before 
issues started):

Issue continues.

We have seen that when bulk migrating VMs from node1 to node2, VMs in 
node2 ALSO start to have issues.

We'll try setting max workers for bulk actions to 1 next.


El 12/5/22 a las 9:33, Eneko Lacunza via pve-user escribi?:
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From elacunza at binovo.es  Thu May 12 16:57:14 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Thu, 12 May 2022 16:57:14 +0200
Subject: [PVE-User] PVE 7.2 unstability
In-Reply-To: <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es>
References: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
 <mailman.131.1652340849.362.pve-user@lists.proxmox.com>
 <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es>
Message-ID: <051c0129-770e-dc29-c1e3-3b8ad904e6fb@binovo.es>

Hi,

Finally we have worked around this issue downgrading to kernel 5.13:

apt-get install proxmox-ve=7.1-1; apt-get remove 
pve-kernel-5.15.35-1-pve (+reboot)

No need to downgrade pve-qemu-kvm no qemu-server .

Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-(

Cheers

El 12/5/22 a las 15:15, Eneko Lacunza escribi?:
> Hi,
>
> This time VM didn't crash but kernel noticed the time issue:
>
> May 12 09:48:57 monitor-cloud systemd[1]: session-38.scope: Succeeded.
> May 12 18:08:57 monitor-cloud kernel: [31097.014795] clocksource: 
> timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable 
> because the skew is too large:
> May 12 18:08:57 monitor-cloud kernel: [31097.014803] 
> clocksource:?????????????????????? 'kvm-clock' wd_now: a601cec4c1d9 
> wd_last: 8aba12430493 mask: ffffffffffffffff
> May 12 18:08:57 monitor-cloud kernel: [31097.014806] 
> clocksource:?????????????????????? 'tsc' cs_now: 1f116978cedab 
> cs_last: 19f66ae2f9bbb mask: ffffffffffffffff
> May 12 18:08:57 monitor-cloud kernel: [31097.014810] tsc: Marking TSC 
> unstable due to clocksource watchdog
> May 12 09:49:02 monitor-cloud systemd[1]: Starting Clean php session 
> files...
>
> Seems that the issue is more easily triggered live migrating the VMs, 
> another VM just hung but no time-issues in syslog (I had to hard reset...)
>
> We have downgraded from pve-qemu-kvm:amd64 6.2.0-5 to 6.2.0-2 (version 
> before issues started)
>
> We have downgraded from qemu-server from 7.2-2 to 7.1-4 (version 
> before issues started):
>
> Issue continues.
>
> We have seen that when bulk migrating VMs from node1 to node2, VMs in 
> node2 ALSO start to have issues.
>
> We'll try setting max workers for bulk actions to 1 next.
>
>
> El 12/5/22 a las 9:33, Eneko Lacunza via pve-user escribi?:
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From alain.pean at c2n.upsaclay.fr  Thu May 12 17:12:59 2022
From: alain.pean at c2n.upsaclay.fr (=?UTF-8?Q?Alain_P=c3=a9an?=)
Date: Thu, 12 May 2022 17:12:59 +0200
Subject: [PVE-User] PVE 7.2 unstability
In-Reply-To: <mailman.152.1652367475.362.pve-user@lists.proxmox.com>
References: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
 <mailman.131.1652340849.362.pve-user@lists.proxmox.com>
 <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es>
 <mailman.152.1652367475.362.pve-user@lists.proxmox.com>
Message-ID: <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr>

Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit?:
> Finally we have worked around this issue downgrading to kernel 5.13:
>
> apt-get install proxmox-ve=7.1-1; apt-get remove 
> pve-kernel-5.15.35-1-pve (+reboot)
>
> No need to downgrade pve-qemu-kvm no qemu-server .
>
> Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-(

Hi Eneko,

It is strange, as I don't see anybody saying they saw this problem on 
the forum :
https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3

Also, I installed a few weeks ago the kernel 3.15.30-1 that was 
available for test on PVE 7.1, on my production servers, that solved for 
me another problem (windows VM not rebooting correctly), and I don't see 
the problem you encountered.

# uname -r
5.15.30-1-pve

I will test the upgrade shortly.

Alain

-- 
Administrateur Syst?me/R?seau
C2N Centre de Nanosciences et Nanotechnologies (UMR 9001)
Boulevard Thomas Gobert (ex Avenue de La Vauve), 91120 Palaiseau
Tel : 01-70-27-06-88 Bureau A255


From elacunza at binovo.es  Thu May 12 18:35:10 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Thu, 12 May 2022 18:35:10 +0200
Subject: [PVE-User] PVE 7.2 unstability
In-Reply-To: <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr>
References: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
 <mailman.131.1652340849.362.pve-user@lists.proxmox.com>
 <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es>
 <mailman.152.1652367475.362.pve-user@lists.proxmox.com>
 <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr>
Message-ID: <0fb383be-c5bd-8e6d-7319-fc3b7e65a453@binovo.es>

Hi Alain,

El 12/5/22 a las 17:12, Alain P?an escribi?:
> Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit?:
>> Finally we have worked around this issue downgrading to kernel 5.13:
>>
>> apt-get install proxmox-ve=7.1-1; apt-get remove 
>> pve-kernel-5.15.35-1-pve (+reboot)
>>
>> No need to downgrade pve-qemu-kvm no qemu-server .
>>
>> Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-(
>
>
> It is strange, as I don't see anybody saying they saw this problem on 
> the forum :
> https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3
>

I think Bengt Nolin in the first page is reporting something like this.

> Also, I installed a few weeks ago the kernel 3.15.30-1 that was 
> available for test on PVE 7.1, on my production servers, that solved 
> for me another problem (windows VM not rebooting correctly), and I 
> don't see the problem you encountered.
>
> # uname -r
> 5.15.30-1-pve
>
> I will test the upgrade shortly.

Our problem has been a headache in our tests today :) I asure you it is 
there, and it is fixed downgrading kernel.

I don't know why it's happening, but VMs' clock seems to broke suddenly 
and spectacularly... :)

Nodes have Ryzen CPUs, and storage is Ceph. Network is 10G for 
Ceph/migrations, 10G for VMs/cluster.

Cheers

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From gilberto.nunes32 at gmail.com  Thu May 12 19:23:04 2022
From: gilberto.nunes32 at gmail.com (Gilberto Ferreira)
Date: Thu, 12 May 2022 14:23:04 -0300
Subject: [PVE-User] PVE 7.2 unstability
In-Reply-To: <mailman.156.1652373323.362.pve-user@lists.proxmox.com>
References: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
 <mailman.131.1652340849.362.pve-user@lists.proxmox.com>
 <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es>
 <mailman.152.1652367475.362.pve-user@lists.proxmox.com>
 <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr>
 <mailman.156.1652373323.362.pve-user@lists.proxmox.com>
Message-ID: <CAOKSTBsHG7=D=YJc9G9TCmf2GAmLuS_LT+TW3CAMW9Jki7Jr7A@mail.gmail.com>

Hi there.
A couple of friends also complain about kernel 5.15, regarding WIndows and
Linux VMS weird behavior.
After downgrading to 5.13 everything seems to be ok.

---
Gilberto Nunes Ferreira


Em qui., 12 de mai. de 2022 ?s 13:35, Eneko Lacunza via pve-user <
pve-user at lists.proxmox.com> escreveu:

>
>
>
> ---------- Forwarded message ----------
> From: Eneko Lacunza <elacunza at binovo.es>
> To: pve-user at lists.proxmox.com
> Cc:
> Bcc:
> Date: Thu, 12 May 2022 18:35:10 +0200
> Subject: Re: [PVE-User] PVE 7.2 unstability
> Hi Alain,
>
> El 12/5/22 a las 17:12, Alain P?an escribi?:
> > Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit :
> >> Finally we have worked around this issue downgrading to kernel 5.13:
> >>
> >> apt-get install proxmox-ve=7.1-1; apt-get remove
> >> pve-kernel-5.15.35-1-pve (+reboot)
> >>
> >> No need to downgrade pve-qemu-kvm no qemu-server .
> >>
> >> Sadly VMs running on kernel 5.15.35-1 will crash on live migration :-(
> >
> >
> > It is strange, as I don't see anybody saying they saw this problem on
> > the forum :
> > https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3
> >
>
> I think Bengt Nolin in the first page is reporting something like this.
>
> > Also, I installed a few weeks ago the kernel 3.15.30-1 that was
> > available for test on PVE 7.1, on my production servers, that solved
> > for me another problem (windows VM not rebooting correctly), and I
> > don't see the problem you encountered.
> >
> > # uname -r
> > 5.15.30-1-pve
> >
> > I will test the upgrade shortly.
>
> Our problem has been a headache in our tests today :) I asure you it is
> there, and it is fixed downgrading kernel.
>
> I don't know why it's happening, but VMs' clock seems to broke suddenly
> and spectacularly... :)
>
> Nodes have Ryzen CPUs, and storage is Ceph. Network is 10G for
> Ceph/migrations, 10G for VMs/cluster.
>
> Cheers
>
> Eneko Lacunza
> Zuzendari teknikoa | Director t?cnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
>
>
>
> ---------- Forwarded message ----------
> From: Eneko Lacunza via pve-user <pve-user at lists.proxmox.com>
> To: pve-user at lists.proxmox.com
> Cc: Eneko Lacunza <elacunza at binovo.es>
> Bcc:
> Date: Thu, 12 May 2022 18:35:10 +0200
> Subject: Re: [PVE-User] PVE 7.2 unstability
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>


From elacunza at binovo.es  Fri May 13 09:46:17 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Fri, 13 May 2022 09:46:17 +0200
Subject: [PVE-User] PVE 7.2 unstability
In-Reply-To: <CAOKSTBsHG7=D=YJc9G9TCmf2GAmLuS_LT+TW3CAMW9Jki7Jr7A@mail.gmail.com>
References: <mailman.119.1652279766.362.pve-user@lists.proxmox.com>
 <mailman.131.1652340849.362.pve-user@lists.proxmox.com>
 <1b089826-050a-35ca-20b8-611f8bf64cf0@binovo.es>
 <mailman.152.1652367475.362.pve-user@lists.proxmox.com>
 <7d6f9cc7-1dfe-63aa-b166-64d7f9b4c816@c2n.upsaclay.fr>
 <mailman.156.1652373323.362.pve-user@lists.proxmox.com>
 <CAOKSTBsHG7=D=YJc9G9TCmf2GAmLuS_LT+TW3CAMW9Jki7Jr7A@mail.gmail.com>
Message-ID: <7f326bd4-db6e-2a61-66d9-11b579603803@binovo.es>


I have filled a bug: https://bugzilla.proxmox.com/show_bug.cgi?id=4057

El 12/5/22 a las 19:23, Gilberto Ferreira escribi?:
> Hi there.
> A couple of friends also complain about kernel 5.15, regarding WIndows 
> and Linux VMS weird behavior.
> After downgrading to 5.13 everything seems to be ok.
>
> ---
> Gilberto Nunes Ferreira
>
>
>
>
>
>
> Em qui., 12 de mai. de 2022 ?s 13:35, Eneko Lacunza via pve-user 
> <pve-user at lists.proxmox.com> escreveu:
>
>
>
>
>     ---------- Forwarded message ----------
>     From:?Eneko Lacunza <elacunza at binovo.es>
>     To: pve-user at lists.proxmox.com
>     Cc:
>     Bcc:
>     Date:?Thu, 12 May 2022 18:35:10 +0200
>     Subject:?Re: [PVE-User] PVE 7.2 unstability
>     Hi Alain,
>
>     El 12/5/22 a las 17:12, Alain P?an escribi?:
>     > Le 12/05/2022 ? 16:57, Eneko Lacunza via pve-user a ?crit?:
>     >> Finally we have worked around this issue downgrading to kernel
>     5.13:
>     >>
>     >> apt-get install proxmox-ve=7.1-1; apt-get remove
>     >> pve-kernel-5.15.35-1-pve (+reboot)
>     >>
>     >> No need to downgrade pve-qemu-kvm no qemu-server .
>     >>
>     >> Sadly VMs running on kernel 5.15.35-1 will crash on live
>     migration :-(
>     >
>     >
>     > It is strange, as I don't see anybody saying they saw this
>     problem on
>     > the forum :
>     >
>     https://forum.proxmox.com/threads/proxmox-ve-7-2-released.108970/page-3
>     >
>
>     I think Bengt Nolin in the first page is reporting something like
>     this.
>
>     > Also, I installed a few weeks ago the kernel 3.15.30-1 that was
>     > available for test on PVE 7.1, on my production servers, that
>     solved
>     > for me another problem (windows VM not rebooting correctly), and I
>     > don't see the problem you encountered.
>     >
>     > # uname -r
>     > 5.15.30-1-pve
>     >
>     > I will test the upgrade shortly.
>
>     Our problem has been a headache in our tests today :) I asure you
>     it is
>     there, and it is fixed downgrading kernel.
>
>     I don't know why it's happening, but VMs' clock seems to broke
>     suddenly
>     and spectacularly... :)
>
>     Nodes have Ryzen CPUs, and storage is Ceph. Network is 10G for
>     Ceph/migrations, 10G for VMs/cluster.
>
>     Cheers
>
>     Eneko Lacunza
>     Zuzendari teknikoa | Director t?cnico
>     Binovo IT Human Project
>
>     Tel. +34 943 569 206 |https://www.binovo.es
>     Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun
>
>     https://www.youtube.com/user/CANALBINOVO
>     https://www.linkedin.com/company/37269706/
>
>
>
>     ---------- Forwarded message ----------
>     From:?Eneko Lacunza via pve-user <pve-user at lists.proxmox.com>
>     To: pve-user at lists.proxmox.com
>     Cc:?Eneko Lacunza <elacunza at binovo.es>
>     Bcc:
>     Date:?Thu, 12 May 2022 18:35:10 +0200
>     Subject:?Re: [PVE-User] PVE 7.2 unstability
>     _______________________________________________
>     pve-user mailing list
>     pve-user at lists.proxmox.com
>     https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From sebastian at debianfan.de  Tue May 17 06:54:43 2022
From: sebastian at debianfan.de (sebastian at debianfan.de)
Date: Tue, 17 May 2022 06:54:43 +0200
Subject: [PVE-User] Directory /var/log/journal
Message-ID: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de>

Hello @all,

whats up with this directory.

Is it possible to delete all the files in this directory while the 
server is running.

Would there be any problems if i delete the files now without rebooting 
the pve-Host?

I don't need log files this time - i need space on the partition.

Tnx

Sebastian


From nada at verdnatura.es  Tue May 17 08:10:46 2022
From: nada at verdnatura.es (nada)
Date: Tue, 17 May 2022 08:10:46 +0200
Subject: [PVE-User] Directory /var/log/journal
In-Reply-To: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de>
References: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de>
Message-ID: <c65574e4d1105a016ecdd381b9626ec6@verdnatura.es>

hi Sebastian
depends on type of  journal you have
check your storage config at
/etc/systemd/journald.conf
in case you have persistent journal you may clean it
e.g. clean old journals each month
/usr/bin/journalctl --vacuum-time=1months --rotate
Nada

On 2022-05-17 06:54, sebastian at debianfan.de wrote:
> Hello @all,
> 
> whats up with this directory.
> 
> Is it possible to delete all the files in this directory while the
> server is running.
> 
> Would there be any problems if i delete the files now without
> rebooting the pve-Host?
> 
> I don't need log files this time - i need space on the partition.
> 
> Tnx
> 
> Sebastian
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


From sebastian at debianfan.de  Tue May 17 11:35:03 2022
From: sebastian at debianfan.de (sebastian at debianfan.de)
Date: Tue, 17 May 2022 11:35:03 +0200
Subject: [PVE-User] Directory /var/log/journal
In-Reply-To: <c65574e4d1105a016ecdd381b9626ec6@verdnatura.es>
References: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de>
 <c65574e4d1105a016ecdd381b9626ec6@verdnatura.es>
Message-ID: <f3f80143-c951-08c5-d5d6-b4458b13ff1b@debianfan.de>

it is possible to delete all the files and reboot the host without 
"problems" ?

i don't need the journal

Am 17.05.2022 um 08:10 schrieb nada:
> hi Sebastian
> depends on type of? journal you have
> check your storage config at
> /etc/systemd/journald.conf
> in case you have persistent journal you may clean it
> e.g. clean old journals each month
> /usr/bin/journalctl --vacuum-time=1months --rotate
> Nada
> 
> On 2022-05-17 06:54, sebastian at debianfan.de wrote:
>> Hello @all,
>>
>> whats up with this directory.
>>
>> Is it possible to delete all the files in this directory while the
>> server is running.
>>
>> Would there be any problems if i delete the files now without
>> rebooting the pve-Host?
>>
>> I don't need log files this time - i need space on the partition.
>>
>> Tnx
>>
>> Sebastian
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 


From uwe.sauter.de at gmail.com  Tue May 17 11:43:09 2022
From: uwe.sauter.de at gmail.com (Uwe Sauter)
Date: Tue, 17 May 2022 11:43:09 +0200
Subject: [PVE-User] Directory /var/log/journal
In-Reply-To: <f3f80143-c951-08c5-d5d6-b4458b13ff1b@debianfan.de>
References: <2cd22c8a-31f7-964a-81f6-530897cf5112@debianfan.de>
 <c65574e4d1105a016ecdd381b9626ec6@verdnatura.es>
 <f3f80143-c951-08c5-d5d6-b4458b13ff1b@debianfan.de>
Message-ID: <8b61e7fe-8c51-752f-d8f4-f3521f9615bb@gmail.com>

Then you should change the configuration in /etc/systemd/journald.conf to not save your journal,
then reboot, then remove the directory.

Am 17.05.22 um 11:35 schrieb sebastian at debianfan.de:
> it is possible to delete all the files and reboot the host without "problems" ?
> 
> i don't need the journal
> 
> Am 17.05.2022 um 08:10 schrieb nada:
>> hi Sebastian
>> depends on type of? journal you have
>> check your storage config at
>> /etc/systemd/journald.conf
>> in case you have persistent journal you may clean it
>> e.g. clean old journals each month
>> /usr/bin/journalctl --vacuum-time=1months --rotate
>> Nada
>>
>> On 2022-05-17 06:54, sebastian at debianfan.de wrote:
>>> Hello @all,
>>>
>>> whats up with this directory.
>>>
>>> Is it possible to delete all the files in this directory while the
>>> server is running.
>>>
>>> Would there be any problems if i delete the files now without
>>> rebooting the pve-Host?
>>>
>>> I don't need log files this time - i need space on the partition.
>>>
>>> Tnx
>>>
>>> Sebastian
>>>
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


From gaio at lilliput.linux.it  Wed May 18 10:04:33 2022
From: gaio at lilliput.linux.it (Marco Gaiarin)
Date: Wed, 18 May 2022 10:04:33 +0200
Subject: [PVE-User] Severe disk corruption: PBS, SATA
Message-ID: <ftleli-4kd.ln1@hermione.lilliput.linux.it>


We are depicting some vary severe disk corruption on one of our
installation, that is indeed a bit 'niche' but...

PVE 6.4 host on a Dell PowerEdge T340:
	root at sdpve1:~# uname -a
	Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64 GNU/Linux

Debian squeeze i386 on guest:
	sdinny:~# uname -a
	Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux

boot disk defined as:
	sata0: local-zfs:vm-120-disk-0,discard=on,size=100G


After enabling PBS, everytime the backup of the VM start:

 root at sdpve1:~# grep vzdump /var/log/syslog.1 
 May 17 20:27:17 sdpve1 pvedaemon[24825]: <root at pam> starting task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam:
 May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd --remove 0 --mode snapshot
 May 17 20:36:50 sdpve1 pvedaemon[24825]: <root at pam> end task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK
 May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 --mode snapshot --mailto sys at admin --quiet 1 --mailnotification failure --storage pbs-BP)
 May 17 22:00:02 sdpve1 vzdump[1738]: <root at pam> starting task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam:
 May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: vzdump 100 101 120 --mailnotification failure --quiet 1 --mode snapshot --storage pbs-BP --mailto sys at admin
 May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu)
 May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50)
 May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu)
 May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17)
 May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu)
 May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52)
 May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully
 May 17 23:31:02 sdpve1 vzdump[1738]: <root at pam> end task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK

The VM depicted some massive and severe IO trouble:

 May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
 May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.000749]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY }
 May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.002175]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY }
 May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.003559]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY }
 May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.004894]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY }
 [...]
 May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link
 May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100
 May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete

VM is still 'alive', and works.
But we was forced to do a reboot (power outgage) and after that all the
partition of the disk desappeared, we were forced to restore them with
some tools like 'testdisk'.
Partition on backups the same, desappeared.


Note that there's also a 'plain' local backup that run on sunday, and this
backup task seems does not generate trouble (but still seems to have
partition desappeared, thus was done after an I/O error).


We have hit a Kernel/Qemu bug?

-- 
  E sempre allegri bisogna stare, che il nostro piangere fa male al Re
  fa male al ricco, al Cardinale,
  diventan tristi se noi piangiam...			(Fo, Jannacci)


From elacunza at binovo.es  Wed May 18 10:53:04 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Wed, 18 May 2022 10:53:04 +0200
Subject: [PVE-User] Severe disk corruption: PBS, SATA
In-Reply-To: <ftleli-4kd.ln1@hermione.lilliput.linux.it>
References: <ftleli-4kd.ln1@hermione.lilliput.linux.it>
Message-ID: <e3244c5a-7389-52fa-390f-6eb5ac0ac1fc@binovo.es>

Hi Marco,

I would try changing that sata0 disk to virtio-blk (maybe in a clone VM 
first). I think squeeze will support it; then try PBS backup again.

El 18/5/22 a las 10:04, Marco Gaiarin escribi?:
> We are depicting some vary severe disk corruption on one of our
> installation, that is indeed a bit 'niche' but...
>
> PVE 6.4 host on a Dell PowerEdge T340:
> 	root at sdpve1:~# uname -a
> 	Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64 GNU/Linux
>
> Debian squeeze i386 on guest:
> 	sdinny:~# uname -a
> 	Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux
>
> boot disk defined as:
> 	sata0: local-zfs:vm-120-disk-0,discard=on,size=100G
>
>
> After enabling PBS, everytime the backup of the VM start:
>
>   root at sdpve1:~# grep vzdump /var/log/syslog.1
>   May 17 20:27:17 sdpve1 pvedaemon[24825]: <root at pam> starting task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam:
>   May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd --remove 0 --mode snapshot
>   May 17 20:36:50 sdpve1 pvedaemon[24825]: <root at pam> end task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK
>   May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 --mode snapshot --mailto sys at admin --quiet 1 --mailnotification failure --storage pbs-BP)
>   May 17 22:00:02 sdpve1 vzdump[1738]: <root at pam> starting task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam:
>   May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: vzdump 100 101 120 --mailnotification failure --quiet 1 --mode snapshot --storage pbs-BP --mailto sys at admin
>   May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu)
>   May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50)
>   May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu)
>   May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17)
>   May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu)
>   May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52)
>   May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully
>   May 17 23:31:02 sdpve1 vzdump[1738]: <root at pam> end task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK
>
> The VM depicted some massive and severe IO trouble:
>
>   May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
>   May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.000749]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY }
>   May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.002175]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY }
>   May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.003559]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY }
>   May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.004894]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY }
>   [...]
>   May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link
>   May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>   May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100
>   May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete
>
> VM is still 'alive', and works.
> But we was forced to do a reboot (power outgage) and after that all the
> partition of the disk desappeared, we were forced to restore them with
> some tools like 'testdisk'.
> Partition on backups the same, desappeared.
>
>
> Note that there's also a 'plain' local backup that run on sunday, and this
> backup task seems does not generate trouble (but still seems to have
> partition desappeared, thus was done after an I/O error).
>
>
> We have hit a Kernel/Qemu bug?
>

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From jmr.richardson at gmail.com  Wed May 18 17:24:53 2022
From: jmr.richardson at gmail.com (JR Richardson)
Date: Wed, 18 May 2022 10:24:53 -0500
Subject: [PVE-User] VMware SD-WAN Virtual Edge Not Working
Message-ID: <CA+U74VPvyJOVybnBduVzuR3mgdZ_mSm=vP5FeTBTCsu_2y+Taw@mail.gmail.com>

Hey Folks,

We are testing deployment for using VMware/Velo virtual edge appliance
on Prox, hypervisor Dell R630 specs:
40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
Linux 5.4.174-2-pve #1 SMP PVE 5.4.174-2 (Thu, 10 Mar 2022 15:58:44 +0100)
pve-manager/6.4-14/15e2bf61

VMware SD-WAN edge appliance version 4.3.1 latest GA release.

We can get the VM started OK and connected to orchestrator except the
VPN tunnels are not coming up. We are using 'host' processor type and
see all the required CPU Flags available to the VM. We are running a
bunch of these virtual edge vms on VMware ESXi hypervisors with no
issues, but looking to change over to using Proxmox.

The only error we get in orchestration when diagnosing the problem is
"Edge dataplane service failed" and there is no vpn traffic coming
from the VM so it's like something with the VM is not able to access
some resource needed to start VPN services. AES-NI, SSSE3, SSE4,
RDTSC, RDSEED, RDRAND instruction sets are all available to the VM.

Is anyone else successful deploying VMware SD-WAN appliances with
Proxmox/KVM or seeing the same issue I'm having? We're opening a
support case with VMware, but no word back from them yet.

Thanks.
JR


From nada at verdnatura.es  Wed May 18 18:20:40 2022
From: nada at verdnatura.es (nada)
Date: Wed, 18 May 2022 18:20:40 +0200
Subject: [PVE-User] Severe disk corruption: PBS, SATA
In-Reply-To: <ftleli-4kd.ln1@hermione.lilliput.linux.it>
References: <ftleli-4kd.ln1@hermione.lilliput.linux.it>
Message-ID: <a2c163bfd7b2e28a6bf940351a024e59@verdnatura.es>

hi Marco
you used some local ZFS filesystem according to your info, so you may 
try

zfs list
zpool list -v
zpool history
zpool import ...
zpool replace ...

all the best
Nada

On 2022-05-18 10:04, Marco Gaiarin wrote:
> We are depicting some vary severe disk corruption on one of our
> installation, that is indeed a bit 'niche' but...
> 
> PVE 6.4 host on a Dell PowerEdge T340:
> 	root at sdpve1:~# uname -a
> 	Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021
> 11:08:47 +0100) x86_64 GNU/Linux
> 
> Debian squeeze i386 on guest:
> 	sdinny:~# uname -a
> 	Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 
> GNU/Linux
> 
> boot disk defined as:
> 	sata0: local-zfs:vm-120-disk-0,discard=on,size=100G
> 
> 
> After enabling PBS, everytime the backup of the VM start:
> 
>  root at sdpve1:~# grep vzdump /var/log/syslog.1
>  May 17 20:27:17 sdpve1 pvedaemon[24825]: <root at pam> starting task
> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam:
>  May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup
> job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd
> --remove 0 --mode snapshot
>  May 17 20:36:50 sdpve1 pvedaemon[24825]: <root at pam> end task
> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK
>  May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120
> --mode snapshot --mailto sys at admin --quiet 1 --mailnotification
> failure --storage pbs-BP)
>  May 17 22:00:02 sdpve1 vzdump[1738]: <root at pam> starting task
> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam:
>  May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job:
> vzdump 100 101 120 --mailnotification failure --quiet 1 --mode
> snapshot --storage pbs-BP --mailto sys at admin
>  May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 
> (qemu)
>  May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 
> (00:00:50)
>  May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 
> (qemu)
>  May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 
> (00:01:17)
>  May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 
> (qemu)
>  May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 
> (01:28:52)
>  May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished 
> successfully
>  May 17 23:31:02 sdpve1 vzdump[1738]: <root at pam> end task
> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK
> 
> The VM depicted some massive and severe IO trouble:
> 
>  May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception
> Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
>  May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd
> 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.000749]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY 
> }
>  May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd
> 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.002175]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY 
> }
>  May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd
> 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.003559]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY 
> }
>  May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd
> 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.004894]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY 
> }
>  [...]
>  May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting 
> link
>  May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5
> Gbps (SStatus 113 SControl 300)
>  May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for 
> UDMA/100
>  May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete
> 
> VM is still 'alive', and works.
> But we was forced to do a reboot (power outgage) and after that all the
> partition of the disk desappeared, we were forced to restore them with
> some tools like 'testdisk'.
> Partition on backups the same, desappeared.
> 
> 
> Note that there's also a 'plain' local backup that run on sunday, and 
> this
> backup task seems does not generate trouble (but still seems to have
> partition desappeared, thus was done after an I/O error).
> 
> 
> We have hit a Kernel/Qemu bug?


From jmr.richardson at gmail.com  Wed May 18 18:49:32 2022
From: jmr.richardson at gmail.com (JR Richardson)
Date: Wed, 18 May 2022 11:49:32 -0500
Subject: [PVE-User] VMware SD-WAN Virtual Edge Not Working [SOLVED]
In-Reply-To: <CA+U74VPvyJOVybnBduVzuR3mgdZ_mSm=vP5FeTBTCsu_2y+Taw@mail.gmail.com>
References: <CA+U74VPvyJOVybnBduVzuR3mgdZ_mSm=vP5FeTBTCsu_2y+Taw@mail.gmail.com>
Message-ID: <CA+U74VPico-nqmKWFv-B+sKr=JVKzuVAcOTOUsgpzKQF5MqsWA@mail.gmail.com>

Quick update, changed CPU to single socket and multi-core, appliance
started acting as expected. Not sure why but when using multi-sockets,
even with NUMA enabled, the VM would not fully work. I guess something
in the appliance code checks for socket/core configs and requires a
single socket only.

Hope this helps.

Regards.
JR

On Wed, May 18, 2022 at 10:24 AM JR Richardson <jmr.richardson at gmail.com> wrote:
>
> Hey Folks,
>
> We are testing deployment for using VMware/Velo virtual edge appliance
> on Prox, hypervisor Dell R630 specs:
> 40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
> Linux 5.4.174-2-pve #1 SMP PVE 5.4.174-2 (Thu, 10 Mar 2022 15:58:44 +0100)
> pve-manager/6.4-14/15e2bf61
>
> VMware SD-WAN edge appliance version 4.3.1 latest GA release.
>
> We can get the VM started OK and connected to orchestrator except the
> VPN tunnels are not coming up. We are using 'host' processor type and
> see all the required CPU Flags available to the VM. We are running a
> bunch of these virtual edge vms on VMware ESXi hypervisors with no
> issues, but looking to change over to using Proxmox.
>
> The only error we get in orchestration when diagnosing the problem is
> "Edge dataplane service failed" and there is no vpn traffic coming
> from the VM so it's like something with the VM is not able to access
> some resource needed to start VPN services. AES-NI, SSSE3, SSE4,
> RDTSC, RDSEED, RDRAND instruction sets are all available to the VM.
>
> Is anyone else successful deploying VMware SD-WAN appliances with
> Proxmox/KVM or seeing the same issue I'm having? We're opening a
> support case with VMware, but no word back from them yet.
>
> Thanks.
> JR


From wolf at wolfspyre.com  Thu May 19 06:07:05 2022
From: wolf at wolfspyre.com (Wolf Noble)
Date: Wed, 18 May 2022 23:07:05 -0500
Subject: [PVE-User] Severe disk corruption: PBS, SATA
In-Reply-To: <a2c163bfd7b2e28a6bf940351a024e59@verdnatura.es>
References: <a2c163bfd7b2e28a6bf940351a024e59@verdnatura.es>
Message-ID: <4212DB65-25CD-491E-8380-E7D43B9063BF@wolfspyre.com>

from over here in the cheap seats, another potential strangeness injector:

zfs + any sort of raid controller which plays the abstraction game between raw disk and the OS can cause any number of weird and painful scenarios.

ZFS believes it has an accurate idea of the underlying disks.

it does it?s voodoo wholly believing that it?s solely responsible for dealing with data durability.

with a raid controller in between playing the shell game with IO, things USUALLY work?. RIGHT UNTIL THEY DONT.

i?m sure you?re well aware of this, and have probably already mitigated this concern with a jbod controller, or something that isn?t preventing the OS (and thus ZFS) from talking directly to the disks? but It felt worth pointing out on the off chance that this got overlooked.

hope you are well and the gremlins are promptly discovered and put back into their comfortable chairs so they can resume their harmless heckling.


?W


[= The contents of this message have been written, read, processed, erased, sorted, sniffed, compressed, rewritten, misspelled, overcompensated, lost, found, and most importantly delivered entirely with recycled electrons =]

> On May 18, 2022, at 11:21, nada <nada at verdnatura.es> wrote:
> 
> ?hi Marco
> you used some local ZFS filesystem according to your info, so you may try
> 
> zfs list
> zpool list -v
> zpool history
> zpool import ...
> zpool replace ...
> 
> all the best
> Nada
> 
>> On 2022-05-18 10:04, Marco Gaiarin wrote:
>> We are depicting some vary severe disk corruption on one of our
>> installation, that is indeed a bit 'niche' but...
>> PVE 6.4 host on a Dell PowerEdge T340:
>>    root at sdpve1:~# uname -a
>>    Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021
>> 11:08:47 +0100) x86_64 GNU/Linux
>> Debian squeeze i386 on guest:
>>    sdinny:~# uname -a
>>    Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux
>> boot disk defined as:
>>    sata0: local-zfs:vm-120-disk-0,discard=on,size=100G
>> After enabling PBS, everytime the backup of the VM start:
>> root at sdpve1:~# grep vzdump /var/log/syslog.1
>> May 17 20:27:17 sdpve1 pvedaemon[24825]: <root at pam> starting task
>> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam:
>> May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup
>> job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd
>> --remove 0 --mode snapshot
>> May 17 20:36:50 sdpve1 pvedaemon[24825]: <root at pam> end task
>> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK
>> May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120
>> --mode snapshot --mailto sys at admin --quiet 1 --mailnotification
>> failure --storage pbs-BP)
>> May 17 22:00:02 sdpve1 vzdump[1738]: <root at pam> starting task
>> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam:
>> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job:
>> vzdump 100 101 120 --mailnotification failure --quiet 1 --mode
>> snapshot --storage pbs-BP --mailto sys at admin
>> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu)
>> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50)
>> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu)
>> May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17)
>> May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu)
>> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52)
>> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully
>> May 17 23:31:02 sdpve1 vzdump[1738]: <root at pam> end task
>> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK
>> The VM depicted some massive and severe IO trouble:
>> May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception
>> Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
>> May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd
>> 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.000749]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY }
>> May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd
>> 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.002175]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY }
>> May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd
>> 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.003559]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY }
>> May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd
>> 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.004894]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY }
>> [...]
>> May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link
>> May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5
>> Gbps (SStatus 113 SControl 300)
>> May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100
>> May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete
>> VM is still 'alive', and works.
>> But we was forced to do a reboot (power outgage) and after that all the
>> partition of the disk desappeared, we were forced to restore them with
>> some tools like 'testdisk'.
>> Partition on backups the same, desappeared.
>> Note that there's also a 'plain' local backup that run on sunday, and this
>> backup task seems does not generate trouble (but still seems to have
>> partition desappeared, thus was done after an I/O error).
>> We have hit a Kernel/Qemu bug?
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 


From elacunza at binovo.es  Thu May 19 16:57:44 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Thu, 19 May 2022 16:57:44 +0200
Subject: PVE 7.2 - Avago MegaRAID broken?
Message-ID: <c93844be-d13b-32c0-87bb-66ef8a18bd1b@binovo.es>

Hi all,

Today we installed PVE 7.1 (ISO) in a relatively old machine.

Installation was fine and Proxmox has booted OK. But after configuring 
non-subscription repository and upgrading to PVE 7.2/kernel 5.15, 
Proxmox won't boot anymore:

Kernel will print lots of messages like

"DMAR: DRHD: handling fault status reg 3"
"DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000 
[fault reason 0x06] PTE Readaccess is not set.

(???? there are some missing chars in photos I shot, sorry).

After about 2,5 minutes, it would open a shell in initramfs, complaining 
pve vg was not found and "Gave up waiting for root file system device".

I suspected of a faulty controller first, but after booting with 5.13 
kernel (even the latest one as of today, -6) all was fine again.

We have removed 5.15 kernel, and rebooted 2-3 times, all is good now. :-)

Controller is

Avago MegaRAID SAS-MFI BIOS
Version 6.36.00.2 (Build Sep 11, 2017)
HA -0 (Bus 2 Dev 0) AVAGO MegaRAID SAS 9341-4i
FW package: 24.21.0-0025

Product AVAGO MegaRAID SAS 9341-4i? is listed as Revision 4.680.01-??? 
(shot cut there, sorry)

Controller has 3 WDC Gold 4TB in RAID5 attached.

This is worked-around now, but I'm starting to worry about latest 5.15 
kernel in PVE... :-)

Thanks

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From s.ivanov at proxmox.com  Thu May 19 18:31:38 2022
From: s.ivanov at proxmox.com (Stoiko Ivanov)
Date: Thu, 19 May 2022 18:31:38 +0200
Subject: [PVE-User] PVE 7.2 - Avago MegaRAID broken?
In-Reply-To: <mailman.94.1652972304.356.pve-user@lists.proxmox.com>
References: <mailman.94.1652972304.356.pve-user@lists.proxmox.com>
Message-ID: <20220519183138.1647bed6@rosa.proxmox.com>

Hi,

On Thu, 19 May 2022 16:57:44 +0200
Eneko Lacunza via pve-user <pve-user at lists.proxmox.com> wrote:

> Hi all,
> 
> Today we installed PVE 7.1 (ISO) in a relatively old machine.
any more details on what kind of machine this is
(CPU generation, if it's an older HP/Dell/Supermicro server or
consumerhardware)?

> Kernel will print lots of messages like
> 
> "DMAR: DRHD: handling fault status reg 3"
> "DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000 
> [fault reason 0x06] PTE Readaccess is not set.
could you please try (in that order, and until one the suggestions fixes
the issue):
* adding `iommu=pt` to the kernel cmdline
* adding `intel_iommu=off` to the kernel cmdline
we have updated the known-issues section of the release-notes to suggest
this already after a few similar reports with older hardware/unusual
setups in our community forum:
https://pve.proxmox.com/wiki/Roadmap#7.2-known-issues

see
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline
for instruction on how to edit the cmdline.
 
> This is worked-around now, but I'm starting to worry about latest 5.15 
> kernel in PVE... :-)
> 
I think we have similar reports with each new kernel-series - mostly with
older systems, which need to install a small workaround (usually module
parameter or kernel cmdline switch).
Our tests on many machines in our testlab (covering the past 10 years of
hardware more or less well) all did not show any general issues - but
it's sadly always a hit and miss.

Please let us know if the changes help

Kind regards,
stoiko


From elacunza at binovo.es  Thu May 19 18:50:26 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Thu, 19 May 2022 18:50:26 +0200
Subject: [PVE-User] PVE 7.2 - Avago MegaRAID broken?
In-Reply-To: <20220519183138.1647bed6@rosa.proxmox.com>
References: <mailman.94.1652972304.356.pve-user@lists.proxmox.com>
 <20220519183138.1647bed6@rosa.proxmox.com>
Message-ID: <493da81e-e10c-f494-7055-1ab5e0760cd0@binovo.es>

Hi Stoiko,

El 19/5/22 a las 18:31, Stoiko Ivanov escribi?:
>
>> Today we installed PVE 7.1 (ISO) in a relatively old machine.
> any more details on what kind of machine this is
> (CPU generation, if it's an older HP/Dell/Supermicro server or
> consumerhardware)?

The system is in a customer site, but I'll try to gather detailed data 
tomorrow. CPU is a Xeon E or E3, I can't recall exact model right now. 
Server has Asus motherboard; this puts it in consumerserver or something 
like that I guess :-)

> Kernel will print lots of messages like
>
> "DMAR: DRHD: handling fault status reg 3"
> "DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000
> [fault reason 0x06] PTE Readaccess is not set.
> could you please try (in that order, and until one the suggestions fixes
> the issue):
> * adding `iommu=pt` to the kernel cmdline
> * adding `intel_iommu=off` to the kernel cmdline
> we have updated the known-issues section of the release-notes to suggest
> this already after a few similar reports with older hardware/unusual
> setups in our community forum:
> https://pve.proxmox.com/wiki/Roadmap#7.2-known-issues

Ok, I didn't notice those known issues, will check them next time. I 
think I will be unable to try this shortly as system is not local, but 
if I can will report back, thanks.

>
> see
> https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline
> for instruction on how to edit the cmdline.
>   
>> This is worked-around now, but I'm starting to worry about latest 5.15
>> kernel in PVE... :-)
>>
> I think we have similar reports with each new kernel-series - mostly with
> older systems, which need to install a small workaround (usually module
> parameter or kernel cmdline switch).
> Our tests on many machines in our testlab (covering the past 10 years of
> hardware more or less well) all did not show any general issues - but
> it's sadly always a hit and miss.

Sure, there's way too much hardware out there, this wasn't intended to 
be a complaint, not at least about your excelent work at Proxmox :) The 
intent was to warn other users, but your Known issues in release notes 
are good too.

It's the first time I notice this mix of issues with a new kernel 
version in more than 10 year experience with Proxmox (or Linux on 
servers), but maybe it's also that our maintained server base is expanding.

> Please let us know if the changes help
>

Thanks for your helpfull replies, will try to test your suggestions and 
will reply back with the results.

Regards

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From piccardi at truelite.it  Thu May 19 19:58:09 2022
From: piccardi at truelite.it (Simone Piccardi)
Date: Thu, 19 May 2022 19:58:09 +0200
Subject: Strange problem on bridge after upgrade to proxmox 7
Message-ID: <4c3b82eb-06ac-eb4b-64b1-8f7e54b9c15e@truelite.it>

Hi, I have a very strange networking problem on a Proxmox server, 
emerged after upgrading from 6.4 to 7.

These the results of pveversion on the server:

root at lama10:~# pveversion -V
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

The server has 4 network interfaces, bound in pairs in active-passive 
mode, then bridged. This is its /etc/network/interfaces:

auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
auto eth2
iface eth2 inet manual
auto eth3
iface eth3 inet manual
auto bond0
iface bond0 inet manual
	bond-slaves eth0 eth1
	bond-miimon 100
	bond-mode active-backup
	bond-primary eth0
auto bond1
iface bond1 inet manual
	bond-slaves eth2 eth3
	bond-miimon 100
	bond-mode active-backup
	bond-primary eth2
auto vmbr0
iface vmbr0 inet static
	address 192.168.250.110/23
	gateway 192.168.250.254
	bridge-ports bond0
	bridge-stp off
	bridge-fd 0
auto vmbr1
iface vmbr1 inet static
	address 192.168.223.110/24
	bridge-ports bond1
	bridge-stp off
	bridge-fd 0


The network problems comes only for connectiong to the virtual machines 
hosted by the server (no container are used), there is no problem at all 
for connecting to the server. The only anomaly I could find is that it 
seems that the bridge makes mac-address of some of the VM coming from a 
wrong internal port, so they become unreachable.

To explain what this means, I put 3 test VM on the server (two debian 11 
and a windows one, just to exclude problem at operating system level) 
using vmbr1 bridge; their tap interfaces are:

root at lama10:~# brctl show vmbr1
bridge name	bridge id		STP enabled	interfaces
vmbr1		8000.7a576e974a37	no		bond1
							tap403i0
							tap404i0
							tap603i0

Sometime some of them are working and some are not. When I was writing 
this email the VM 404 was not working. Looking at tap404i0 mac address I 
got:

root at lama10:~# ip -br link show dev tap404i0
tap404i0         UNKNOWN        26:6f:0c:19:95:58 
<BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP>

while the 404 VM own mac address is:

root at lama10:~# grep vmbr1 /etc/pve/qemu-server/404.conf
net0: virtio=BE:47:4C:D5:5D:A9,bridge=vmbr1

and when I look at these mac address seen inside vmbr1 I got:

root at lama10:~# brctl showmacs vmbr1 | egrep -i 
'(26:6f:0c:19:95:58|BE:47:4C:D5:5D:A9)'
   4	26:6f:0c:19:95:58	yes		   0.00
   4	26:6f:0c:19:95:58	yes		   0.00
   1	be:47:4c:d5:5d:a9	no		   0.65

doing the same for another VM that was working (mac address are found as 
above) I found instead:

root at lama10:~# brctl showmacs vmbr1 | egrep -i 
'(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)'
   3	92:4f:ec:7e:8a:e1	yes		   0.00
   3	92:4f:ec:7e:8a:e1	yes		   0.00
   3	de:a3:e6:96:0c:6e	no		   2.32

Note: with "working" I mean that a VM is normally reachable by network 
without packet loss. I checked in multiple times and in other servers 
and in all working cases the the ports inside the vmbrX switch are the 
same for the TAP mac and the VM mac, as expected. When not working the 
VM own mac seems always to be associated to port 1 (the one of the 
bonding interface).

What I find in a "not working" VM is that ARP reply is never received 
(looking with tcpdump run using the console). The arp request are sent, 
and seen in other VM or on the host, but no reply are seen.

Having a working VM is almost casual (or at least I could not find a 
pattern up to now). After stopping and restarting the above working VM I 
got it not working anymore and the port on the bridge changed:

root at lama10:~# brctl showmacs vmbr1 | egrep -i 
'(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)'
   3	92:4f:ec:7e:8a:e1	yes		   0.00
   3	92:4f:ec:7e:8a:e1	yes		   0.00
   1	de:a3:e6:96:0c:6e	no		   0.86

What make this behaviour "strange" is that other two identical machines 
with same Proxmox version (they are in cluster with this one, and inside 
a blades rack) are just working fine. And no problem on the cluster 
(like I said, no network problems at all for the server itself).

The only difference on the other two fully working nodes is that their 
bonding is configures as lacp. That was not possible for this one; it 
got loop error messages when configured, so I had to remove that 
configuration to avoid disturbance on the other two nodes, were all 
production VM were migrated and are running whitout problems.

But another standalone server (with the same Proxmox version of all 
other ones) that's outside the blade rack and it's also configured with 
active-passive bonding, is working fine.

So despite the difference in network configuration between all these 
servers I still cannot imagine how the different kind of bonding or the 
use of a different switch can have an impact on this problem. In the 
previous example I cannot ping 404 VM nor from the server itself nor 
from the the other working VM hosted inside the server itself, and this 
kind of traffic is completely internal traffic, done inside vmbr1.

So I'm asking directions about what to search, and where to look to find 
how the ports inside the bridge are allocated, or any other suggestion 
useful to have some light on this issue.

Simone
-- 
Simone Piccardi                                 Truelite Srl
piccardi at truelite.it (email/jabber)             Via Monferrato, 6
Tel. +39-347-1032433                            50142 Firenze
http://www.truelite.it                          Tel. +39-055-7879597


From mailinglists at xaq.nl  Thu May 19 20:58:11 2022
From: mailinglists at xaq.nl (Richard Lucassen)
Date: Thu, 19 May 2022 20:58:11 +0200
Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7
In-Reply-To: <mailman.97.1652983686.356.pve-user@lists.proxmox.com>
References: <mailman.97.1652983686.356.pve-user@lists.proxmox.com>
Message-ID: <20220519205811.27302a796725fbc679a33948@xaq.nl>

On Thu, 19 May 2022 19:58:09 +0200
Simone Piccardi via pve-user <pve-user at lists.proxmox.com> wrote:

> Hi, I have a very strange networking problem on a Proxmox server, 
> emerged after upgrading from 6.4 to 7.

I have no idea if this can have something to do with it, but not a very
long time ago I had two Dell R210 servers connected through a simple
failover bond0. The issue I found was that somehow these bond0 devices
on two *different* servers got the *same* fixed MAC address. After some
searching I stumbled upon this:

https://blog.sigterm.se/posts/a-bonding-exercise/

I had some discussion afterward with Patrik and I ended up in adding a
fixed MAC address in the /etc/network/interfaces stanza, e.g.:

hwaddress ether 4a:89:66:60:e4:97

I just want to notify this phenomena because you can get the most weird
behaviour if you have two devices having the same MAC. I tested loading
the bonding on some workstations:

modprobe -v bonding
ip link show bond0

and see what address it gets, it depends on this value:
cat /sys/class/net/bond0/addr_assign_type
which me be different from host to host.

To remove the module:
modprobe -rv bonding

I had no time to dive deeper into this matter, I just worked around it
by adding the "hwaddress ether" in the bond0 stanza. This works fine.

My 2cts,

R.

-- 
richard lucassen
http://contact.xaq.nl/


From alwin at antreich.com  Thu May 19 21:06:28 2022
From: alwin at antreich.com (Alwin Antreich)
Date: Thu, 19 May 2022 21:06:28 +0200
Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7
In-Reply-To: <mailman.97.1652983686.356.pve-user@lists.proxmox.com>
References: <mailman.97.1652983686.356.pve-user@lists.proxmox.com>
Message-ID: <E6005306-133F-43CE-A9AE-6B6275EC0F75@antreich.com>

On May 19, 2022 7:58:09 PM GMT+02:00, Simone Piccardi via pve-user <pve-user at lists.proxmox.com> wrote:
>_______________________________________________
>pve-user mailing list
>pve-user at lists.proxmox.com
>https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Hi Simone,

Have you seen this section in the upgrade guide?
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Linux_Bridge_MAC-Address_Change

In our case, we had an identical machine-id on two of our hosts and that killed the network for both.

Cheers,
Alwin 


From piccardi at truelite.it  Fri May 20 14:21:39 2022
From: piccardi at truelite.it (Simone Piccardi)
Date: Fri, 20 May 2022 14:21:39 +0200
Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7
In-Reply-To: <E6005306-133F-43CE-A9AE-6B6275EC0F75@antreich.com>
References: <mailman.97.1652983686.356.pve-user@lists.proxmox.com>
 <E6005306-133F-43CE-A9AE-6B6275EC0F75@antreich.com>
Message-ID: <561cd0e7-f6d3-71da-f75d-4b5500c9611a@truelite.it>

On 19/05/22 21:06, Alwin Antreich wrote:
> Have you seen this section in the upgrade guide?
> https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Linux_Bridge_MAC-Address_Change
> 
Yes, I read that one, and it got me some headache, but not in these 
server. Thanks anyway for the answer.

The upgrade was done fine for all the server of the cluster (the 
external one was installed directly with 7), their machine-id are all 
different and network communication between hosts has no problems.

> In our case, we had an identical machine-id on two of our hosts and that killed the network for both.

The problem I got is just between this specific host and the VM it is 
hosting (also between VM on the same network in other hosts and the ones 
inside this server, but that just because these are unreachable from the 
host itself).

Simone
-- 
Simone Piccardi                                 Truelite Srl
piccardi at truelite.it (email/jabber)             Via Monferrato, 6
Tel. +39-347-1032433                            50142 Firenze
http://www.truelite.it                          Tel. +39-055-7879597


From piccardi at truelite.it  Fri May 20 15:00:30 2022
From: piccardi at truelite.it (Simone Piccardi)
Date: Fri, 20 May 2022 15:00:30 +0200
Subject: [PVE-User] Strange problem on bridge after upgrade to proxmox 7
In-Reply-To: <20220519205811.27302a796725fbc679a33948@xaq.nl>
References: <mailman.97.1652983686.356.pve-user@lists.proxmox.com>
 <20220519205811.27302a796725fbc679a33948@xaq.nl>
Message-ID: <5982355a-14a7-0669-5e1a-5cc1d15e8622@truelite.it>

On 19/05/22 20:58, Richard Lucassen wrote:
> I have no idea if this can have something to do with it, but not a very
> long time ago I had two Dell R210 servers connected through a simple
> failover bond0. The issue I found was that somehow these bond0 devices
> on two *different* servers got the *same* fixed MAC address. After some
> searching I stumbled upon this:
> 
> https://blog.sigterm.se/posts/a-bonding-exercise/
> 
That's a very interesting reading, thank's for the link.

Anyway I checked all bridge mac address on all server, and they are all 
different (they were anyway installed indipendently and have different 
machine-id).


> I had some discussion afterward with Patrik and I ended up in adding a
> fixed MAC address in the /etc/network/interfaces stanza, e.g.:
> 
> hwaddress ether 4a:89:66:60:e4:97
> 

I think this can solve MAC conflicts (I'll try anyway) but I'm not 
seeing any duplicate MAC, and I got problem only between the VM inside a 
bridge and this specific host. The host is perfectly reachable from 
everywere.

What I got is the same problem described here (I found this just this 
morning):

https://forum.proxmox.com/threads/ve-7-1-10-slow-to-forward-arp-replies-over-bridge.106429/

a VM has a port inside the bridge that does not match the one used by 
its tap interface,

I'll investigate the issue explained in this article referenced there:

https://bugs.launchpad.net/neutron/+bug/1738659


Simone
-- 
Simone Piccardi                                 Truelite Srl
piccardi at truelite.it (email/jabber)             Via Monferrato, 6
Tel. +39-347-1032433                            50142 Firenze
http://www.truelite.it                          Tel. +39-055-7879597


From gaio at lilliput.linux.it  Fri May 20 13:24:33 2022
From: gaio at lilliput.linux.it (Marco Gaiarin)
Date: Fri, 20 May 2022 13:24:33 +0200
Subject: [PVE-User] Severe disk corruption: PBS, SATA
In-Reply-To: <mailman.74.1652864024.356.pve-user@lists.proxmox.com>;
 from SmartGate on Sat, May 21, 2022 at 10:06:01AM +0200
References: <ftleli-4kd.ln1@hermione.lilliput.linux.it>
 <mailman.74.1652864024.356.pve-user@lists.proxmox.com>
Message-ID: <fcakli-tan.ln1@hermione.lilliput.linux.it>

Mandi! Eneko Lacunza via pve-user
  In chel di` si favelave...

> I would try changing that sata0 disk to virtio-blk (maybe in a clone VM 
> first). I think squeeze will support it; then try PBS backup again.

Disks migrated to 'Virtio Block'; now we are doing some tests, but seems to
work well. Thanks.


To others: seems is not an ZFS trouble, the same cluster run other VMs
without fuss... anyway thanks.

-- 
  Una volta qualcuno chiese al Mahatma Gandhi cosa ne pensasse della civilt?
  in occidente. ?Credo che sarebbe una buona idea?, rispose.


From gaio at lilliput.linux.it  Fri May 20 13:22:03 2022
From: gaio at lilliput.linux.it (Marco Gaiarin)
Date: Fri, 20 May 2022 13:22:03 +0200
Subject: [PVE-User] Experimenting with bond on a non-LACP switch...
Message-ID: <p7akli-tan.ln1@hermione.lilliput.linux.it>


I'm doing some experimentation on a switch that seems does not support LACP,
even thus claim that; is a Netgear GS724Tv2:

	https://www.downloads.netgear.com/files/GDC/GS724Tv2/enus_ds_gs724t.pdf

data sheet say:

	Port Trunking - Manual as per IEEE802.3ad Link Aggregation

and 'IEEE802.3ad Link Aggregation' is LACP, right?


Anyway, i'm experimenting a bit with other bonding mode, having (un)expected
results and troubles, but in:

	https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond

i've stumble upon that sentence:

	If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported.

What exactly mean?! Thanks.

-- 
  Molti italiani sognavano di vedere Berlusconi in un cellulare,
  prima o poi...			(Stardust?, da i.n.n-a)


From laurentfdumont at gmail.com  Sun May 22 03:29:58 2022
From: laurentfdumont at gmail.com (Laurent Dumont)
Date: Sat, 21 May 2022 21:29:58 -0400
Subject: [PVE-User] Experimenting with bond on a non-LACP switch...
In-Reply-To: <p7akli-tan.ln1@hermione.lilliput.linux.it>
References: <p7akli-tan.ln1@hermione.lilliput.linux.it>
Message-ID: <CAOAKi8wdwhc6OGRjxjCc6UCmP1S5bNqJ0prJQcSiKTSbZFudng@mail.gmail.com>

It's not made very clear from the documentation. I assume there are good
technical reasons why the cluster traffic would be impacted.

Afaik, proxmox leverages corosync which can leverage multicast for the
cluster checks. I don't think it can be badly impacted by LACP but
something to keep in mind.

There is this old thread with a similar discussion :

https://forum.proxmox.com/threads/cluster-lacp.90668/

On Sat, May 21, 2022 at 4:10 AM Marco Gaiarin <gaio at lilliput.linux.it>
wrote:

>
> I'm doing some experimentation on a switch that seems does not support
> LACP,
> even thus claim that; is a Netgear GS724Tv2:
>
>
> https://www.downloads.netgear.com/files/GDC/GS724Tv2/enus_ds_gs724t.pdf
>
> data sheet say:
>
>         Port Trunking - Manual as per IEEE802.3ad Link Aggregation
>
> and 'IEEE802.3ad Link Aggregation' is LACP, right?
>
>
> Anyway, i'm experimenting a bit with other bonding mode, having
> (un)expected
> results and troubles, but in:
>
>         https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond
>
> i've stumble upon that sentence:
>
>         If you intend to run your cluster network on the bonding
> interfaces, then you have to use active-passive mode on the bonding
> interfaces, other modes are unsupported.
>
> What exactly mean?! Thanks.
>
> --
>   Molti italiani sognavano di vedere Berlusconi in un cellulare,
>   prima o poi...                        (Stardust?, da i.n.n-a)
>
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>


From wolf at wolfspyre.com  Sun May 22 06:12:28 2022
From: wolf at wolfspyre.com (Wolf Noble)
Date: Sat, 21 May 2022 23:12:28 -0500
Subject: [PVE-User] Experimenting with bond on a non-LACP switch...
In-Reply-To: <p7akli-tan.ln1@hermione.lilliput.linux.it>
References: <p7akli-tan.ln1@hermione.lilliput.linux.it>
Message-ID: <B5D934B8-28D3-4668-8293-7524BC20EA54@wolfspyre.com>


Good Catch Marco!

I'd not seen that when I read through that page, but I just re-read it.... My read is that it can introduce ODD edge-case complications.

my synthesis of this information is  outlined here. I encourage anyone to correct my misunderstandings. 

network abstraction gets complicated QUICKLY.  Network gear vendors implement their support for the different bonding modes in subtly different ways.
Firewalls have their own quirks.

abstractions on top of abstractions on top of abstractions on top of abstractions on top of .... okay you get the point.


we want to avoid asymmetric pathing where possible, because stuff gets quirky and edge-casey quickly. the fewer explicitly supported virtual topologies, the fewer scenarios the engineering teams need to scrutinize the COMPLEX edge case behaviors of, resulting in a better experience for EVERYONE.... heres what I mean:

LACP: 
This is a pretty well known and consistently implemented aggregation mechanism. the behaviors of network interfaces and switching hardware that are involved are pretty consistent. 

This GENERALLY works fine. the only time I've seen it get a little wonky is LACP across switch chassis behavior can be odd..
                       
-------------- next part --------------


When node 2 sends traffic destined for node 1 via lacp_b.1 ... it may traverse the trunked interconnect, it may not. (it depends)


active-backup:  

I talk, and listen on link0... if link 0 goes down, I switchover... I don't really pay attention to linkl1 otherwise.

active-passive:
  I talk on link0. I listen on link0 and link1.

The downside that I've seen here:


Arp caching can get wonky, and packets that SHOULD be directed to node0 link0 
        get directed to node0 link1... or  sometimes packets directed to node0.link0 have a destination mac address of the hwaddr of link1 and so get delivered to link1 ...  

There MAY be some oddities that manifest with this configuration. 

depending on (node scope configuration) sysctl settings, node0 could just ignore those packets, resulting in weird behavior

with the various balance algorithms nodes will see a different hardware addresses for each other, again, this isn't *USUALLY* a problem, but there are still some dragons that lurk within the trunking/bonding code... hardware checksumming can get whacky...  especially when VLANs get mixed in... 

My gut tells me that the main reason for this advise is that using LACP or active/backup provides sufficient durability while introducing as little edge-case wonky as possible, which generally speaking is a GoodThing?? when it comes to intra-cluster-comms.


I could be wrong, so don't take this as gospel... if anyone has a better explanation, or can point out my flawed logic, by all means, chime in! :) 


Hope my understanding HELPS... if it doesn't, throw it away and ignore it ;)

?W


This message created and transmitted using 100% recycled electrons.

> On May 21, 2022, at 03:11, Marco Gaiarin <gaio at lilliput.linux.it> wrote:
> 
> ?
> I'm doing some experimentation on a switch that seems does not support LACP,
> even thus claim that; is a Netgear GS724Tv2:
> 
>    https://www.downloads.netgear.com/files/GDC/GS724Tv2/enus_ds_gs724t.pdf
> 
> data sheet say:
> 
>    Port Trunking - Manual as per IEEE802.3ad Link Aggregation
> 
> and 'IEEE802.3ad Link Aggregation' is LACP, right?
> 
> 
> Anyway, i'm experimenting a bit with other bonding mode, having (un)expected
> results and troubles, but in:
> 
>    https://pve.proxmox.com/wiki/Network_Configuration#_linux_bond
> 
> i've stumble upon that sentence:
> 
>    If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported.
> 
> What exactly mean?! Thanks.
> 
> -- 
>  Molti italiani sognavano di vedere Berlusconi in un cellulare,
>  prima o poi...            (Stardust?, da i.n.n-a)
> 
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 

From gaio at lilliput.linux.it  Tue May 24 23:17:12 2022
From: gaio at lilliput.linux.it (Marco Gaiarin)
Date: Tue, 24 May 2022 23:17:12 +0200
Subject: [PVE-User] Experimenting with bond on a non-LACP switch...
In-Reply-To: <CAOAKi8wdwhc6OGRjxjCc6UCmP1S5bNqJ0prJQcSiKTSbZFudng@mail.gmail.com>;
 from SmartGate on Tue, May 24, 2022 at 23:36:01PM +0200
References: <p7akli-tan.ln1@hermione.lilliput.linux.it>
 <CAOAKi8wdwhc6OGRjxjCc6UCmP1S5bNqJ0prJQcSiKTSbZFudng@mail.gmail.com>
Message-ID: <mjuvli-q5g1.ln1@hermione.lilliput.linux.it>

Mandi! Laurent Dumont
  In chel di` si favelave...

> There is this old thread with a similar discussion :
> https://forum.proxmox.com/threads/cluster-lacp.90668/

Apart that i have other clusters where corosync run against LACP bonds
without roubles, i've a little simpler question:

>>         If you intend to run your cluster network on the bonding
>> interfaces, then you have to use active-passive mode on the bonding
>> interfaces, other modes are unsupported.

What is 'active-passive'?! Is the same of 'active-backup'? seems a
terminology inconsistency to me...

-- 
  Se non trovi nessuno vuol dire che siamo scappati alle sei-shell (bash,
  tcsh,csh...)							(Possi)


From wolf at wolfspyre.com  Wed May 25 04:11:09 2022
From: wolf at wolfspyre.com (Wolf Noble)
Date: Tue, 24 May 2022 21:11:09 -0500
Subject: [PVE-User] Experimenting with bond on a non-LACP switch...
In-Reply-To: <mjuvli-q5g1.ln1@hermione.lilliput.linux.it>
References: <mjuvli-q5g1.ln1@hermione.lilliput.linux.it>
Message-ID: <2853D8F3-A64B-43A6-B923-306902AA9949@wolfspyre.com>

I was condensing several (i emit traffic on one interface but listen on all bond members) into ?active-passive?

this: 
https://help.ubuntu.com/community/UbuntuBonding

does a better job explaining the different modes? 

hope that helps?

? ( sorry for the confusion )

W

[= The contents of this message have been written, read, processed, erased, sorted, sniffed, compressed, rewritten, misspelled, overcompensated, lost, found, and most importantly delivered entirely with recycled electrons =]

> On May 24, 2022, at 16:40, Marco Gaiarin <gaio at lilliput.linux.it> wrote:
> 
> ?Mandi! Laurent Dumont
>  In chel di` si favelave...
> 
>> There is this old thread with a similar discussion :
>> https://forum.proxmox.com/threads/cluster-lacp.90668/
> 
> Apart that i have other clusters where corosync run against LACP bonds
> without roubles, i've a little simpler question:
> 
>>>        If you intend to run your cluster network on the bonding
>>> interfaces, then you have to use active-passive mode on the bonding
>>> interfaces, other modes are unsupported.
> 
> What is 'active-passive'?! Is the same of 'active-backup'? seems a
> terminology inconsistency to me...
> 
> -- 
>  Se non trovi nessuno vuol dire che siamo scappati alle sei-shell (bash,
>  tcsh,csh...)                            (Possi)
> 
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 


From elacunza at binovo.es  Mon May 30 11:55:51 2022
From: elacunza at binovo.es (Eneko Lacunza)
Date: Mon, 30 May 2022 11:55:51 +0200
Subject: [PVE-User] PVE 7.2 - Avago MegaRAID broken?
In-Reply-To: <mailman.95.1652979066.356.pve-user@lists.proxmox.com>
References: <mailman.94.1652972304.356.pve-user@lists.proxmox.com>
 <20220519183138.1647bed6@rosa.proxmox.com>
 <mailman.95.1652979066.356.pve-user@lists.proxmox.com>
Message-ID: <fe41cdc5-de91-05b0-bbca-cdebb930f744@binovo.es>

Hi,

El 19/5/22 a las 18:50, Eneko Lacunza via pve-user escribi?:
>
> El 19/5/22 a las 18:31, Stoiko Ivanov escribi?:
>>
>>> Today we installed PVE 7.1 (ISO) in a relatively old machine.
>> any more details on what kind of machine this is
>> (CPU generation, if it's an older HP/Dell/Supermicro server or
>> consumerhardware)?
>
> The system is in a customer site, but I'll try to gather detailed data 
> tomorrow. CPU is a Xeon E or E3, I can't recall exact model right now. 
> Server has Asus motherboard; this puts it in consumerserver or 
> something like that I guess :-)

System is:

Asus P10S-C BIOS 4402
Xeon E3-1270 v6 3.8

lspci shows card as:
Broadcom / LSI MegaRAID SAS-3 3008 [Fury]

>
>> Kernel will print lots of messages like
>>
>> "DMAR: DRHD: handling fault status reg 3"
>> "DMAR: [DMA Read NO_PASID] Request device [02:00.0] fault ???? 4b311000
>> [fault reason 0x06] PTE Readaccess is not set.
>> could you please try (in that order, and until one the suggestions fixes
>> the issue):
>> * adding `iommu=pt` to the kernel cmdline
>> * adding `intel_iommu=off` to the kernel cmdline
>> we have updated the known-issues section of the release-notes to suggest
>> this already after a few similar reports with older hardware/unusual
>> setups in our community forum:
>> https://pve.proxmox.com/wiki/Roadmap#7.2-known-issues
>
> Ok, I didn't notice those known issues, will check them next time. I 
> think I will be unable to try this shortly as system is not local, but 
> if I can will report back, thanks.

This fixes the issue (iommu=pt). Thanks a lot for pointing this out.


Regards

Eneko Lacunza
Zuzendari teknikoa | Director t?cnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2? izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


From rightkicktech at gmail.com  Tue May 31 17:17:57 2022
From: rightkicktech at gmail.com (Alex K)
Date: Tue, 31 May 2022 18:17:57 +0300
Subject: [PVE-User] Mount issue of thin LVM on boot
Message-ID: <CABMULtJUs5ckwkBf+fFh2FPy6HvMwE_Tu3pdz7Ya2xMnoYfiqw@mail.gmail.com>

Hi All,

I have created a thin LVM on top the existing *data* thin pool, as follows:

lvcreate -V2T -T pve/data -n vms

The resulting *vms* LVM is a thin LVM volume which then I format and mount
it at boot.
I perform this so as to be able to setup a gluster setup with three nodes
where the bricks will resize in the mountpoint of the thin LVM volume.

I have observed that when the hosts boot randomly, perhaps after a power
cut, they might temporarily lose quorum, which is expected until the nodes
are booted and the quorum is met. On loss of the quorum the mount point of
the thin LVM I have created is not able to mount and I suspect that this is
due to Proxmox not enabling the data thin pool if quorum is not met.

Is this the case? Can someone confirm this or provide any idea/hint what
could be wrong with this approach? I was thinking of checking if creating
the thin LVM on top of a different pool and volume group that Proxmox does
not manage might resolve the issue.

Thanx,
Alex