[pve-devel] pveceph createmon timeout in probing state

Mon Oct 22 10:30:37 CEST 2018

Hi Lorenz,

On Fri, Oct 19, 2018 at 08:13:35PM +0200, Lorenz Schori wrote:
> Hi,
> 
> This is about a 3-node PVE ceph cluster where one monitor failed and a
> 3-day struggle to revive it again.
> 
> First day
> I just waited and hoped that ceph will do its magic and heal itself.
> Sure enough I checked everything else (firewall, network, time, moved
> around VMs and rebooted nodes). Nothing.
> 
> Second day
> I scraped the failing monitor and attempted to recreate it. Repeatedly.
> 
> pveceph destroymon --exclude-manager 1 <nodename>
> pveceph createmon --exclude-manager 1
> admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
> INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
> [...]
> 
> This went on until it eventually failed after 10 minutes. I checked
> logs and looked for hints in bugreports, mailinglists and forum posts.
> I lost a couple of hours digging into cephx because of the following
> log line which suggests authentication problems:
> 
> 2018-10-19 16:14:16.162125 XXX  0 mon.<leader-name>@0(leader) e4 ms_verify_authorizer bad authorizer from mon <node-ip>:6789/0
> 2018-10-19 16:14:16.162179 XXX  0 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: challenging authorizer
> 
> The "bad authorizer" line does not imply that the prospective monitor
> did send bad credentials. The relevant line is the second one
> indicating that the leader node responds to the prospective monitor with
> a challenge.
> 
> At least it was clear by now that the prospective monitor is actually
> communicating with the other nodes. But other than that, still nothing.
> 
> Third day
> Spent a lot of time producing an ansible script which bumps up
> loglevels, captures network traffic, collects logs and pcap files on
> all nodes at the same time. This made it a little bit easier to
> correlate debug information. The following log lines are key to
> understand the problem:
> 
> on leader:
> 2018-10-19 16:14:16.162506 XXX 10 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept setting up session_security.
> 2018-10-19 16:14:16.162516 XXX 10 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept existing <other-con-id>.gseq 0 <= 2, looks ok
> 2018-10-19 16:14:16.162521 XXX  0 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
> 2018-10-19 16:14:16.162533 XXX 10 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connection race, existing <other-con-id>.cseq 0 == 0, sending WAIT
> 
> on prospective monitor:
> 2018-10-19 16:14:16.162724 XXX  1 -- <node-ip>:6789/0 >> <leader-ip>:6789/0 conn(<con-id> :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got WAIT (connection race)
> 2018-10-19 16:14:16.162735 XXX  1 -- <node-ip>:6789/0 >> <leader-ip>:6789/0 conn(<con-id> :-1 s=STATE_WAIT pgs=0 cs=0 l=0).process enter wait state, failing
> 
> The leader instructs the prospective monitor to reuse a connection
> which is supposed to be already open. However, this is the very first
> attempt of a freshly created monitor to talk to the leader. There
> simply is no connection which could be reused.
> 
> Interestingly it is possible to inspect open connections/sessions using
> the following command:
> 
> ceph daemon mon.<leader-name> sessions
> 
> There should be some sessions for osds, other monitors and managers. If
> the output lists many sessions which are not obviously related to any
> of the aforementioned components, this might indicate stale
> connections. I have no idea what circumstances are leading to that
> situation.
Might come from not (proper) responding Ceph services. Might well fit to
your failed MON.

> 
> Long story short, in order to persuade the forlorn monitor to rejoin
> its pals it was enough to simply restart the ceph-mon processes on the
> other nodes shortly before calling pveceph createmon.
Depending in what state the MON failed, it might have left open the
connection. AFAIK, a destruction of the MON will not close any leftover
connections.

> 
> I'm not sure whether this merits a bug report. I have no idea where to
> report it in the first place since it is not clear to me if this is a
> problem in PVE, ceph or any of the gazillion cluster components.
A bug report is never wrong, it's a different form of a tracked
discussion. If it isn't then it can always be closed. ;)

But I would suggest, for a start, ask also on the ceph-users list, if
anything like this is known. As the component that failed is a MON, it
is most likely a Ceph issue (if at all).

--
Cheers,
Alwin