[PVE-User] Weird backup hang this night
Eneko Lacunza
elacunza at binovo.es
Mon Dec 29 17:58:32 CET 2014
Hi all,
After restoring affected services, I'm in the process of troubleshooting
a backup hang occurred tonight on a client's cluster.
I have cleared the events, but can't determine the reason for the problem.
This is a 3-node cluster, proxmox-ve-2.6.32: 3.3-139 (running kernel:
2.6.32-34-pve) 2.6.32-139
We have 2 virtualization nodes (plus ceph mon, 1 ceph osd each), one
additional node for quorum, additional backups, additional ceph monitor
Virtualization nodes are IBM x3550M2 1xCPU Intel Xeon E5506, 12GB RAM,
1xNIC VM net, 1xNIC proxmox net, 2xNIC bond ceph net:
- proxmox1: 2x72G for system, 1xSSD for ceph OSD Intel S3500 300GB
- proxmox2: 2x72G for system, 1xSSD for ceph OSD Crucial M550 256GB,
2x1TB for backups exported by NFS
All three nodes are connected to the very same network switch (Procurve
8x1gbit).
Ceph OSDs have inline journal. I know Crucial M550 is not a great drive
for the task, but it was a fast and cheap replacement for a yet much
worse Samsung 840 pro.
Events are as follows:
00:00:00 Backup job starts on proxmox1/proxmox2 to nfs-backup (hosted on
proxmox2 2x1TB RAID1)
00:00:00 proxmox1 vzdump of firewall2 VM (5GB) hosted on ceph, normally
takes seconds to complete (not this time)
00:00:03 proxmox2 vzdump of backup VM (5GB system disk, data disk is not
dumped) hosted on local storage (the very same dir exported by NFS
nfs-backup), hangs after ~33s at 23%
00:00:46 proxmox2 pvestatd[3288]: WARNING: command 'df -P -B 1
/mnt/pve/nfs-backup' failed: got timeout [this happens too on regular OK
backups]
00:00:53 proxmox1/proxmox2/proxmox3 Ceph monitors report heartbeat_map
is_healthy timeout after 15
00:01:07 proxmox1/proxmox2/proxmox3 Ceph monitors report 1 slow request
(>30.54s) (read). [Slow requests start to queue (mostly writes)]
00:03:08.59 proxmox2 Ceph OSD.1 assert failed (hit suicide timeout)
00:03:08.63 proxmox2 Ceph OSD.1 Caught signal (aborted)
[...]
08:36:xx Windows 2012 VM1 on proxmox2 responds to Ctrl+Alt+Supr
08:36:xx Windows 2012 VM2 on proxmox2 doesn't respond to Ctrl+Alt+Supr
08:36:xx Operator tries to stop/start Windows 2012 VM2
08:38:49 proxmox2 kernel: INFO: task ksmd:62 blocked for more than 120
seconds. (+call trace)
08:38:49 proxmox2 kernel: INFO: task kvm:18654 blocked for more than 120
seconds. (+call trace)
08:38:49 proxmox2 kernel: INFO: task kvm:18690 blocked for more than 120
seconds. (+call trace)
08:38:49 proxmox2 kernel: INFO: task task UPID:proxm:1009852 blocked for
more than 120 seconds. (+call trace, this is the backup task)
08:40:49 proxmox2 kernel: INFO: task ksmd:62 blocked for more than 120
seconds. (+call trace)
08:4x:00 proxmox2 Operator reset via Linux sysctl
08:45:xx proxmox2 Start
08:45:xx proxmox2 Start all VMs (backup VM fails due to lock, qm unlock
and start works OK)
08:52:38 proxmox1 Backup task ends. Dump of firewall needed +8.5h, seems
to unblock after proxmox2 is reset :)
This same backup job worked flawlessly one week ago (22th Dec). Some
configs were changed on 19th Dec.
I see the NFS server on proxmox2 hanging. This is not the first time I
see this and I won't be surprised if it not were for the OSD.1 abort event.
I see ceph RBD operations queueing on OSD.1, and this finally making it
out of map and aborting himself.
How do these two events correlate to each other, I can't see. I don't
believe this was a coincidence. Network for NFS (proxmox net) and Ceph
have different physical ports and logical networks (same VLAN though).
All disks are on the same IBM ServeRAID-MR10i SAS/SATA Controller:
FW Version : 1.40.282-1279
BIOS Version : 2.07.00
WebBIOS Version : 2.2-22-e_14-Rel
Preboot CLI Version: 01.40-010:#%00008
Boot Block Version : 1.00.00.01-0012
I was able to SSH to proxmox2, and Proxmox web GUI worked ok too.
If anyone has any suggestion, I'd appreciate.
Thanks a lot
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
More information about the pve-user
mailing list