Parallel VM creation/destruction issue

Mon Jul 5 12:36:57 CEST 2021

Hi all,

We have split the BIG 88 node cluster in 6 clusters of 15 nodes each 
(there where some spare servers); now things seem much better :)

Sadly, we are seeing some issues when VDI management system (USD 
Enterprise) is performing mass (in the order of 100s or even 1000s) 
destruction and creation of VMs. In a fraction of the clone operations, 
clone will fail with the following message:

"Error: clone failed. Failed to change directory to 
'/mnt/pve/vdi-prod1/images/103': No such file or directory at 
/usr/share/perl5/PVE/Storage/Plugin.pm line 708."

This happens when destroy for that VMID was some seconds before (5s-14s 
for example). When another clone tries to use that VMID later (as soon 
as 54s after destruction), it works ok.

PVE version is 6.4 ISO (details below), and storage is NFS 4.2 with pNFS 
with two pairs of NetApp servers in HA.

Seems like a "race condition" is happening, where the node that is 
cloning sees the storage directory removed by destruction late (?).

I have checked "qemu-server.git/PVE/QemuServer.pm:sub destroy_vm" and I 
see first storage disk are freed and after that VM config is removed, 
which seems quite correct. Could it be the NFS servers that are a bit 
"late" propagating directory removal to the client nodes?

Any ideas?

Thanks

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/