Parallel VM creation/destruction issue
elacunza at binovo.es
Mon Jul 5 12:36:57 CEST 2021
We have split the BIG 88 node cluster in 6 clusters of 15 nodes each
(there where some spare servers); now things seem much better :)
Sadly, we are seeing some issues when VDI management system (USD
Enterprise) is performing mass (in the order of 100s or even 1000s)
destruction and creation of VMs. In a fraction of the clone operations,
clone will fail with the following message:
"Error: clone failed. Failed to change directory to
'/mnt/pve/vdi-prod1/images/103': No such file or directory at
/usr/share/perl5/PVE/Storage/Plugin.pm line 708."
This happens when destroy for that VMID was some seconds before (5s-14s
for example). When another clone tries to use that VMID later (as soon
as 54s after destruction), it works ok.
PVE version is 6.4 ISO (details below), and storage is NFS 4.2 with pNFS
with two pairs of NetApp servers in HA.
Seems like a "race condition" is happening, where the node that is
cloning sees the storage directory removed by destruction late (?).
I have checked "qemu-server.git/PVE/QemuServer.pm:sub destroy_vm" and I
see first storage disk are freed and after that VM config is removed,
which seems quite correct. Could it be the NFS servers that are a bit
"late" propagating directory removal to the client nodes?
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
More information about the pve-user