[PVE-User] DLM bug and quorum device
Laurent LEGENDRE
geminux50 at gmail.com
Thu Jul 12 15:09:43 CEST 2012
Hi,
We are building a 2 nodes cluster (proxmoxdev1 and proxmoxdev2) with
- LVMed iSCSI as shared storage
- Dell BMC IPMI card as fencing devices
- An iSCSI quorum disk
Each server has 2 NIC, one for Storage Network (iSCSI), one for user access
and cluster communication (will be separated with a third NIC in the
furture)
Software versions used :
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-10-pve: 2.6.32-63
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
All nodes quorates, live migration works...Now let's run this scenario :
- Unplug the user access NIC on proxmoxdev2
- Heuristic checks fails, proxmoxdev2 is fenced, ressources restarts on
proxmoxdev1
- proxmoxdev2 restarts and does NOT quorate. This is normal, NIC is still
unpluged.
- Replug the NIC, and check logs (the details lines have been removed):
Jul 12 14:02:57 proxmoxdev2 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link
becomes ready
Jul 12 14:03:08 proxmoxdev2 corosync[1589]: [CLM ] CLM CONFIGURATION
CHANGE
Jul 12 14:03:08 proxmoxdev2 corosync[1589]: [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Jul 12 14:03:28 proxmoxdev2 pmxcfs[1473]: [status] notice: node has quorum
Jul 12 14:03:28 proxmoxdev2 corosync[1589]: [MAIN ] Completed service
synchronization, ready to provide service.
Jul 12 14:03:28 proxmoxdev2 pmxcfs[1473]: [dcdb] notice: all data is up to
date
Jul 12 14:03:29 proxmoxdev2 rgmanager[1997]: Quorum formed
Jul 12 14:03:29 proxmoxdev2 kernel: dlm: no local IP address has been set
Jul 12 14:03:29 proxmoxdev2 kernel: dlm: cannot start dlm lowcomms -107
Jul 12 14:03:31 proxmoxdev2 corosync[1589]: [QUORUM] Members[2]: 1 2
"kernel: dlm" error lines seems to refer to a known bug already fixed by
redhat (rhbz#688154 and rhbz#679274)
Apparently, it is a bad timer check in qdiskd wich breaks votes for
quorum...
Here's a diff from redhat :
https://www.redhat.com/archives/cluster-devel/2011-March/msg00074.html
Other link : http://comments.gmane.org/gmane.linux.redhat.cluster/19598
No services (pvevm) are shown and rgmanager is not running on proxmoxdev2.
Running clustat returns :
Member Status: Quorate
Member Name ID Status
--------------------------------------------
proxmoxdev1 1 Online
proxmoxdev2 2 Online, Local
/dev/block/8:17 0 Online, Quorum Disk
Running clustat on proxmoxdev1 returns:
Member Status: Quorate
Member Name ID Status
--------------------------------------------
proxmoxdev1 1 Online, Local, rgmanager
proxmoxdev2 2 Online
/dev/block/8:17 0 Online, Quorum Disk
Service Name Owner (Last) State
----------------------------------------------------------
pvevm:100 proxmoxdev1 started
The only way to retreive au fully functional 2-nodes cluster is to restart
manualy proxmoxdev2 AFTER having replug the NIC
is it really the same bug as the redhat one and is there a workaround in
Proxmox ?
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20120712/77087fef/attachment.htm>
More information about the pve-user
mailing list