[PVE-User] Weird backup hang this night

Eneko Lacunza elacunza at binovo.es
Mon Dec 29 17:58:32 CET 2014


Hi all,

After restoring affected services, I'm in the process of troubleshooting 
a backup hang occurred tonight on a client's cluster.

I have cleared the events, but can't determine the reason for the problem.

This is a 3-node cluster, proxmox-ve-2.6.32: 3.3-139 (running kernel: 
2.6.32-34-pve) 2.6.32-139

We have 2 virtualization nodes (plus ceph mon, 1 ceph osd each), one 
additional node for quorum, additional backups, additional ceph monitor

Virtualization nodes are IBM x3550M2 1xCPU Intel Xeon E5506, 12GB RAM, 
1xNIC VM net, 1xNIC proxmox net, 2xNIC bond ceph net:
- proxmox1: 2x72G for system, 1xSSD for ceph OSD Intel S3500 300GB
- proxmox2: 2x72G for system, 1xSSD for ceph OSD Crucial M550 256GB, 
2x1TB for backups exported by NFS

All three nodes are connected to the very same network switch (Procurve 
8x1gbit).
Ceph OSDs have inline journal. I know Crucial M550 is not a great drive 
for the task, but it was a fast and cheap replacement for a yet much 
worse Samsung 840 pro.

Events are as follows:
00:00:00 Backup job starts on proxmox1/proxmox2 to nfs-backup (hosted on 
proxmox2 2x1TB RAID1)
00:00:00 proxmox1 vzdump of firewall2 VM (5GB) hosted on ceph, normally 
takes seconds to complete (not this time)
00:00:03 proxmox2 vzdump of backup VM (5GB system disk, data disk is not 
dumped) hosted on local storage (the very same dir exported by NFS 
nfs-backup), hangs after ~33s at 23%
00:00:46 proxmox2 pvestatd[3288]: WARNING: command 'df -P -B 1 
/mnt/pve/nfs-backup' failed: got timeout [this happens too on regular OK 
backups]
00:00:53 proxmox1/proxmox2/proxmox3 Ceph monitors report heartbeat_map 
is_healthy timeout after 15
00:01:07 proxmox1/proxmox2/proxmox3 Ceph monitors report 1 slow request 
(>30.54s) (read). [Slow requests start to queue (mostly writes)]
00:03:08.59 proxmox2 Ceph OSD.1 assert failed (hit suicide timeout)
00:03:08.63 proxmox2 Ceph OSD.1 Caught signal (aborted)
[...]
08:36:xx Windows 2012 VM1 on proxmox2 responds to Ctrl+Alt+Supr
08:36:xx Windows 2012 VM2 on proxmox2 doesn't respond to Ctrl+Alt+Supr
08:36:xx Operator tries to stop/start Windows 2012 VM2
08:38:49 proxmox2 kernel: INFO: task ksmd:62 blocked for more than 120 
seconds. (+call trace)
08:38:49 proxmox2 kernel: INFO: task kvm:18654 blocked for more than 120 
seconds. (+call trace)
08:38:49 proxmox2 kernel: INFO: task kvm:18690 blocked for more than 120 
seconds. (+call trace)
08:38:49 proxmox2 kernel: INFO: task task UPID:proxm:1009852 blocked for 
more than 120 seconds. (+call trace, this is the backup task)
08:40:49 proxmox2 kernel: INFO: task ksmd:62 blocked for more than 120 
seconds. (+call trace)
08:4x:00 proxmox2 Operator reset via Linux sysctl
08:45:xx proxmox2 Start
08:45:xx proxmox2 Start all VMs (backup VM fails due to lock, qm unlock 
and start works OK)
08:52:38 proxmox1 Backup task ends. Dump of firewall needed +8.5h, seems 
to unblock after proxmox2 is reset :)

This same backup job worked flawlessly one week ago (22th Dec). Some 
configs were changed on 19th Dec.

I see the NFS server on proxmox2 hanging. This is not the first time I 
see this and I won't be surprised if it not were for the OSD.1 abort event.
I see ceph RBD operations queueing on OSD.1, and this finally making it 
out of map and aborting himself.

How do these two events correlate to each other, I can't see. I don't 
believe this was a coincidence. Network for NFS (proxmox net) and Ceph 
have different physical ports and logical networks (same VLAN though).

All disks are on the same IBM ServeRAID-MR10i SAS/SATA Controller:
FW Version         : 1.40.282-1279
BIOS Version       : 2.07.00
WebBIOS Version    : 2.2-22-e_14-Rel
Preboot CLI Version: 01.40-010:#%00008
Boot Block Version : 1.00.00.01-0012

I was able to SSH to proxmox2, and Proxmox web GUI worked ok too.

If anyone has any suggestion, I'd appreciate.

Thanks a lot
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
       943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es




More information about the pve-user mailing list