[PVE-User] lxc hang situation
Stephan Leemburg
sleemburg at it-functions.nl
Fri Nov 30 16:51:50 CET 2018
Hi @proxmox,
Since some months we are experiencing frequent 'hang' situations on our
proxmox nodes.
Today, again, such a situation occured. So we took some time to look at the
situation on hand.
The situation 'started' when we did a
pct start 1310
This did not return. And looking at the process list showed that we had this:
21462 ? Ss 0:00 /usr/bin/lxc-start -n 1310
21619 ? Z 0:00 \_ [lxc-start] <defunct>
21758 ? Ss 0:00 [lxc monitor] /var/lib/lxc 1310
24681 ? D 0:00 \_ [lxc monitor] /var/lib/lxc 1310
situation.
When looking at the wait-channel, the namespaces and the stack of 24681 we
noticed that it was blocked in
[<0>] copy_net_ns+0x
After some more searching, we found with
grep copy_net_ns /proc/[0-9]*/stack
that there where 2 more processes also blocked on copy_net_ns. These where
two ionclean processes in other containers. Killing them (with -9) showed
that restarted ionclean processes immediatly blocked again on copy_net_ns.
The system on which proxmox is running has 2 Intel(R) Xeon(R) CPU E5-2690 v4
CPU's with 14 cores and 28 threads. In proxmox with multithreading this shows
as 56 cpu's. So real concurrency is possible.
The problem seems like a race condition on some resource. But killing (with -9)
all the processes that are hanging on copy_net_ns does not make the kernel
release the contented resource. After killing all the processes on copy_net_ns
and with no process having a stack showing copy_net_ns, starting a new container
immediately blocks again on copy_net_ns. So only a reboot (as far as we know)
solves this.
We played around with ip li set netns, on the veth devices, etc. but we could
not get the machine out of this situation in any way other then reboot.
Based on all this we found that in
https://github.com/lxc/lxd/issues/4468
it says that this problem should be solved in kernel 4.17.
We run the latest proxmox enterprise updates on this machine and it's kernel is
PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100)
As the kernel is ubuntu based would it be possible to start using the ubuntu
18.10 kernel which is 4.18 to get around this problem?
--
Kind regards,
Stephan Leemburg
IT Functions
More information about the pve-user
mailing list