From mark at openvs.co.uk Mon Dec 4 19:51:04 2017 From: mark at openvs.co.uk (Mark Adams) Date: Mon, 4 Dec 2017 18:51:04 +0000 Subject: [PVE-User] HA Fencing In-Reply-To: <0bc124d5-a080-2e38-af47-bdcf36507bc3@proxmox.com> References: <0bc124d5-a080-2e38-af47-bdcf36507bc3@proxmox.com> Message-ID: Hi, On 17 November 2017 at 10:55, Thomas Lamprecht wrote: > Hi, > > On 11/16/2017 07:20 PM, Mark Adams wrote: > > Hi all, > > > > It looks like in newer versions of proxmox, the only fencing type advised > > is watchdog. Is that the case? > > > > Yes, since PVE 4.0 watchdog fencing is the norm. > There is a patch set of mine which implements the use of external fence > device, > but it has seen no review. I should probably dust it up, look over it and > re send > it again, it's about time we finally get this feature. > I think you should definitely get this feature in - I would even say it is necessary for an enterprise HA setup? > > Is it still possible to do PDU fencing as well? This should enable us to > be > > able to fail over faster as the fence will not fail if the machine has no > > power right? > > > > No, at the moment external fence devices are not integrated. > You can expect an faster recovery with external fence devices, at least in > simple setups (i.e., not multiple fence device hierachy) > > cheers, > Thomas > From wolfgang.bucher at netland-mn.de Mon Dec 4 19:52:38 2017 From: wolfgang.bucher at netland-mn.de (=?utf-8?Q?Wolfgang_Bucher?=) Date: Mon, 4 Dec 2017 19:52:38 +0100 Subject: [PVE-User] HA Fencing Message-ID: Vielen Dank! Gesendet ?ber BlackBerry Hub f?r Android Von: mark at openvs.co.uk Gesendet: 4. Dezember 2017 19:52 An: t.lamprecht at proxmox.com Cc: pve-user at pve.proxmox.com Betreff: Re: [PVE-User] HA Fencing Hi, On 17 November 2017 at 10:55, Thomas Lamprecht wrote: > Hi, > > On 11/16/2017 07:20 PM, Mark Adams wrote: > > Hi all, > > > > It looks like in newer versions of proxmox, the only fencing type advised > > is watchdog. Is that the case? > > > > Yes, since PVE 4.0 watchdog fencing is the norm. > There is a patch set of mine which implements the use of external fence > device, > but it has seen no review. I should probably dust it up, look over it and > re send > it again, it's about time we finally get this feature. > I think you should definitely get this feature in - I would even say it is necessary for an enterprise HA setup? > > Is it still possible to do PDU fencing as well? This should enable us to > be > > able to fail over faster as the fence will not fail if the machine has no > > power right? > > > > No, at the moment external fence devices are not integrated. > You can expect an faster recovery with external fence devices, at least in > simple setups (i.e., not multiple fence device hierachy) > > cheers, > Thomas > _______________________________________________ pve-user mailing list pve-user at pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user From t.lamprecht at proxmox.com Tue Dec 5 09:52:41 2017 From: t.lamprecht at proxmox.com (Thomas Lamprecht) Date: Tue, 5 Dec 2017 09:52:41 +0100 Subject: [PVE-User] HA Fencing In-Reply-To: References: <0bc124d5-a080-2e38-af47-bdcf36507bc3@proxmox.com> Message-ID: <4fac619c-73e9-f95a-f00c-e5817c932e4b@proxmox.com> Hi, On 12/04/2017 07:51 PM, Mark Adams wrote: > On 17 November 2017 at 10:55, Thomas Lamprecht wrote: >> On 11/16/2017 07:20 PM, Mark Adams wrote: >>> Hi all, >>> >>> It looks like in newer versions of proxmox, the only fencing type advised >>> is watchdog. Is that the case? >>> >> >> Yes, since PVE 4.0 watchdog fencing is the norm. >> There is a patch set of mine which implements the use of external fence >> device, >> but it has seen no review. I should probably dust it up, look over it and >> re send >> it again, it's about time we finally get this feature. >> > > I think you should definitely get this feature in - I would even say it is > necessary for an enterprise HA setup? > Not really a necessary. Watchdog based fencing is no less secure than traditional fence devices. In fact, as there's much less to configure, and much less protocols between them I'd say its the opposite. I.e., you do not must fire up a command over TCP/IP to fence a node to a device. Here are multiple problem points, Link problems, high load problems delaying fencing, fence devices whit a setup not well tested, at least not under failure conditions, ... A watchdog, which triggers as soon as the node did not pulled it up, independent of link failures, cluster load is here the safer bet. They are often the norm in highly-secure critical embedded systems to, not without reason. It's the difference between a emergency shutdown button and a dead-man-switch. Maybe you didn't even meant the reliability stand point but that a better best-case SLA could be possible with fence devices? But nonetheless agreeing that we should really get it in. I'll try to pickup the series before this month ends, after the Cluster over API stuff got in. cheers, Thomas From mark at openvs.co.uk Tue Dec 5 10:25:50 2017 From: mark at openvs.co.uk (Mark Adams) Date: Tue, 5 Dec 2017 09:25:50 +0000 Subject: [PVE-User] HA Fencing In-Reply-To: <4fac619c-73e9-f95a-f00c-e5817c932e4b@proxmox.com> References: <0bc124d5-a080-2e38-af47-bdcf36507bc3@proxmox.com> <4fac619c-73e9-f95a-f00c-e5817c932e4b@proxmox.com> Message-ID: On 5 December 2017 at 08:52, Thomas Lamprecht wrote: > Hi, > > On 12/04/2017 07:51 PM, Mark Adams wrote: > > On 17 November 2017 at 10:55, Thomas Lamprecht > wrote: > >> On 11/16/2017 07:20 PM, Mark Adams wrote: > >>> Hi all, > >>> > >>> It looks like in newer versions of proxmox, the only fencing type > advised > >>> is watchdog. Is that the case? > >>> > >> > >> Yes, since PVE 4.0 watchdog fencing is the norm. > >> There is a patch set of mine which implements the use of external fence > >> device, > >> but it has seen no review. I should probably dust it up, look over it > and > >> re send > >> it again, it's about time we finally get this feature. > >> > > > > I think you should definitely get this feature in - I would even say it > is > > necessary for an enterprise HA setup? > > > > Not really a necessary. Watchdog based fencing is no less secure than > traditional > fence devices. In fact, as there's much less to configure, and much less > protocols > between them I'd say its the opposite. I.e., you do not must fire up a > command > over TCP/IP to fence a node to a device. Here are multiple problem points, > Link problems, high load problems delaying fencing, fence devices whit a > setup not > well tested, at least not under failure conditions, ... > A watchdog, which triggers as soon as the node did not pulled it up, > independent > of link failures, cluster load is here the safer bet. They are often the > norm in > highly-secure critical embedded systems to, not without reason. > It's the difference between a emergency shutdown button and a > dead-man-switch. > AFAIK It's the only way to know for sure, that your server has actually been fenced when it is not contactable by other means, For instance some network issue on the host. Yes the Watchdog on the machine that goes offline should fence itself, but still the only way to know for sure that the machine is dead is to power it off right? > Maybe you didn't even meant the reliability stand point but that a better > best-case SLA could be possible with fence devices? > This does make a difference too, it could fail over in seconds with faster fencing. > > But nonetheless agreeing that we should really get it in. I'll try to > pickup the > series before this month ends, after the Cluster over API stuff got in. > Thanks it would be great to see it in. > > cheers, > Thomas > From t.lamprecht at proxmox.com Tue Dec 5 11:05:11 2017 From: t.lamprecht at proxmox.com (Thomas Lamprecht) Date: Tue, 5 Dec 2017 11:05:11 +0100 Subject: [PVE-User] HA Fencing In-Reply-To: References: <0bc124d5-a080-2e38-af47-bdcf36507bc3@proxmox.com> <4fac619c-73e9-f95a-f00c-e5817c932e4b@proxmox.com> Message-ID: On 12/05/2017 10:25 AM, Mark Adams wrote: > On 5 December 2017 at 08:52, Thomas Lamprecht > wrote: >> On 12/04/2017 07:51 PM, Mark Adams wrote: >>> On 17 November 2017 at 10:55, Thomas Lamprecht >>>> wrote: >>>> On 11/16/2017 07:20 PM, Mark Adams wrote: >>>>> Hi all, >>>>> >>>>> It looks like in newer versions of proxmox, the only fencing type >>>>> advised is watchdog. Is that the case? >>>>> >>>> >>>> Yes, since PVE 4.0 watchdog fencing is the norm. >>>> There is a patch set of mine which implements the use of external fence >>>> device, but it has seen no review. I should probably dust it up, look >>>> over it and re send it again, it's about time we finally get this feature. >>>> >>> >>> I think you should definitely get this feature in - I would even say it >>> is necessary for an enterprise HA setup? >>> >> >> Not really a necessary. Watchdog based fencing is no less secure than >> traditional >> fence devices. In fact, as there's much less to configure, and much less >> protocols >> between them I'd say its the opposite. I.e., you do not must fire up a >> command >> over TCP/IP to fence a node to a device. Here are multiple problem points, >> Link problems, high load problems delaying fencing, fence devices whit a >> setup not >> well tested, at least not under failure conditions, ... >> A watchdog, which triggers as soon as the node did not pulled it up, >> independent >> of link failures, cluster load is here the safer bet. They are often the >> norm in >> highly-secure critical embedded systems to, not without reason. >> It's the difference between a emergency shutdown button and a >> dead-man-switch. >> > > AFAIK It's the only way to know for sure, that your server has actually > been fenced when it is not contactable by other means, For instance some > network issue on the host. > Both the Fence devices and a Watchdog can be possibly "wrong", thus we *always* acquire a cluster wide lock to ensure that we only do anything HA related if we're in the quorate partition and in an OK state. With the watchdog you know that it released all resources for sure if the node went out of the quorate partition for a certain time. We then try to acquire the nodes local resource manager lock, only then we start recovery of the fenced services. This lock together with the watchdog guarantees us that we do not access the same resource twice. Even if the node starts now up OK again it won't get its lock immediately and thus won't start any HA service. Only once the recovery had been taken place and completed it can reintegrate in the cluster and do work again. If you just power it down with a external fence device it always needs manual intervention, with the watchdog mechanism you won't need that if the source of the quorum loss was a temporary switch hiccup or similar - a bit rare but not unheard of. > Yes the Watchdog on the machine that goes offline should fence itself, but > still the only way to know for sure that the machine is dead is to power it > off right? > Not necessarily (see above). Also network fencing is a thing, i.e. cut all network links related to shared resources (storage, public network, ...) This allows to investigate the still running, but fenced off, node for the failure reason - if wished. > >> Maybe you didn't even meant the reliability stand point but that a better >> best-case SLA could be possible with fence devices? >> > > This does make a difference too, it could fail over in seconds with faster > fencing. > Depends a bit on the fencing devices used, I had some experiences where it was slower than I expected when testing, but yes still a tad faster than the "wait for the watchdog+lock" approach, though. cheers, Thomas maybe you can find some more information here, if not read already: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html From mark at openvs.co.uk Tue Dec 5 16:43:44 2017 From: mark at openvs.co.uk (Mark Adams) Date: Tue, 5 Dec 2017 15:43:44 +0000 Subject: [PVE-User] ZFS Replication Message-ID: Im just trying out the zfs replication in proxmox, nice work! Just a few questions.. - Is it possible to change the network that does the replication? (IE be good to use a direct connected with balance-rr for throughput) - Is it possible to replicate between machines that are not in the same cluster? Both can be easily done via zfs send/recv in cli of course, but wonder if this is possible through the web interface? And lastly, what is the correct procedure for using a replicated VM, should it be needed? Thanks, Mark From gilberto.nunes32 at gmail.com Tue Dec 5 17:07:06 2017 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Tue, 5 Dec 2017 14:07:06 -0200 Subject: [PVE-User] ZFS Replication In-Reply-To: References: Message-ID: For my experience if you set the host in /etc/hosts put different IP addresses you can use different network to traffic cluster and Replication Em 5 de dez de 2017 13:44, "Mark Adams" escreveu: Im just trying out the zfs replication in proxmox, nice work! Just a few questions.. - Is it possible to change the network that does the replication? (IE be good to use a direct connected with balance-rr for throughput) - Is it possible to replicate between machines that are not in the same cluster? Both can be easily done via zfs send/recv in cli of course, but wonder if this is possible through the web interface? And lastly, what is the correct procedure for using a replicated VM, should it be needed? Thanks, Mark _______________________________________________ pve-user mailing list pve-user at pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user From w.link at proxmox.com Wed Dec 6 08:05:05 2017 From: w.link at proxmox.com (Wolfgang Link) Date: Wed, 6 Dec 2017 08:05:05 +0100 (CET) Subject: [PVE-User] ZFS Replication In-Reply-To: References: Message-ID: <1935809464.5.1512543905883@webmail.proxmox.com> Hi Mark, > - Is it possible to change the network that does the replication? (IE be > good to use a direct connected with balance-rr for throughput) You can change the replication network in the datacenter.conf option migration. > - Is it possible to replicate between machines that are not in the same > cluster? For this task you have to use pve-zsync. > Both can be easily done via zfs send/recv in cli of course, but wonder if > this is possible through the web interface? No it is not. From davel at upilab.com Wed Dec 6 17:56:11 2017 From: davel at upilab.com (David Lawley) Date: Wed, 6 Dec 2017 11:56:11 -0500 Subject: [PVE-User] bridge issue after last update Message-ID: <55d665ad-2fdc-83fe-36e1-31c8442558c6@upilab.com> Have a single node server for a test bed sort of.. Applied updates this morning. Afterward I lost connectivity between the network and bridged VMs This was good practice as there was no pressure ;) Anyway found that these items had been changed from 0 to 1 bridge-nf-call-arptables bridge-nf-call-iptables bridge-nf-call-ip6tablesP Not sure how it got changed, checked it against my production servers Did something else happen that I missed that might have been part of Prox? Just not yet clear of what conditions would have done it. Or just a one off crap shoot that will never happen again??? proxmox-ve: 5.1-30 (running kernel: 4.13.8-3-pve) pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777) pve-kernel-4.13.3-1-pve: 4.13.3-2 pve-kernel-4.13.4-1-pve: 4.13.4-26 pve-kernel-4.10.17-4-pve: 4.10.17-24 pve-kernel-4.10.17-2-pve: 4.10.17-20 pve-kernel-4.10.15-1-pve: 4.10.15-15 pve-kernel-4.13.8-3-pve: 4.13.8-30 pve-kernel-4.10.17-3-pve: 4.10.17-23 pve-kernel-4.10.17-1-pve: 4.10.17-18 libpve-http-server-perl: 2.0-7 lvm2: 2.02.168-pve6 corosync: 2.4.2-pve3 libqb0: 1.0.1-1 pve-cluster: 5.0-17 qemu-server: 5.0-17 pve-firmware: 2.0-3 libpve-common-perl: 5.0-22 libpve-guest-common-perl: 2.0-13 libpve-access-control: 5.0-7 libpve-storage-perl: 5.0-17 pve-libspice-server1: 0.12.8-3 vncterm: 1.5-3 pve-docs: 5.1-12 pve-qemu-kvm: 2.9.1-3 pve-container: 2.0-17 pve-firewall: 3.0-4 pve-ha-manager: 2.0-4 ksm-control-daemon: 1.2-2 glusterfs-client: 3.8.8-1 lxc-pve: 2.1.1-2 lxcfs: 2.0.8-1 criu: 2.11.1-1~bpo90 novnc-pve: 0.6-4 smartmontools: 6.5+svn4324-1 zfsutils-linux: 0.7.3-pve1~bpo9 From andreas at mx20.org Wed Dec 6 18:32:00 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Wed, 6 Dec 2017 18:32:00 +0100 Subject: [PVE-User] bridge issue after last update In-Reply-To: <55d665ad-2fdc-83fe-36e1-31c8442558c6@upilab.com> References: <55d665ad-2fdc-83fe-36e1-31c8442558c6@upilab.com> Message-ID: Hi there, On 06.12.2017 17:56, David Lawley wrote: > Have a single node server for a test bed sort of.. > > Applied updates this morning. > > Afterward I lost connectivity between the network and bridged VMs > > This was good practice as there was no pressure ;) > > Anyway found that these items had been changed from 0 to 1 > > bridge-nf-call-arptables > bridge-nf-call-iptables > bridge-nf-call-ip6tablesP > > Not sure how it got changed, checked it against my production servers ACK, but the problem is tricky: /etc/sysctl.d/pve.conf was changed to /etc/sysctl.d/pve.conf/sysctl.conf and is ignored. Have a look at Manual page sysctl.conf(5): /etc/sysctl.d/*.conf Andreas From andreas at mx20.org Wed Dec 6 18:43:46 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Wed, 6 Dec 2017 18:43:46 +0100 Subject: [PVE-User] WARNING: Upgrade and Watchdog kills Server in HA-Mode Message-ID: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> Hi there, be warned: the actual update may reboot your server if in HA-Mode. It happened on 2 of 5 servers! The following packages will be upgraded: libpve-common-perl (5.0-20 => 5.0-22) libpve-http-server-perl (2.0-6 => 2.0-7) libpve-storage-perl (5.0-16 => 5.0-17) lxc-pve (2.1.0-2 => 2.1.1-2) lxcfs (2.0.7-pve4 => 2.0.8-1) pve-cluster (5.0-15 => 5.0-17) pve-firewall (3.0-3 => 3.0-4) pve-ha-manager (2.0-3 => 2.0-4) pve-manager (5.1-36 => 5.1-38) pve-qemu-kvm (2.9.1-2 => 2.9.1-3) spiceterm (3.0-4 => 3.0-5) vncterm (1.5-2 => 1.5-3) Installing new version of config file /etc/rc.d/init.d/lxcfs ... Processing triggers for man-db (2.7.6.1-2) ... Setting up pve-cluster (5.0-17) ... Removing obsolete conffile /etc/default/pve-cluster ... Setting up pve-firewall (3.0-4) ... Setting up lxc-pve (2.1.1-2) ... Installing new version of config file /etc/apparmor.d/abstractions/lxc/container-base ... Setting up libpve-http-server-perl (2.0-7) ... Setting up libpve-storage-perl (5.0-17) ... Setting up pve-ha-manager (2.0-4) ... watchdog-mux.service is a disabled or a static unit, not starting it. Setting up pve-manager (5.1-38) ... Installing new version of config file /etc/logrotate.d/pve ... .... REBOOT root at nethcn-b1:~# apt-get upgrade E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the problem. root at nethcn-b1:~# dpkg --configure -a Setting up pve-manager (5.1-38) ... Processing triggers for libc-bin (2.24-11+deb9u1) ... Andreas From davel at upilab.com Wed Dec 6 19:25:35 2017 From: davel at upilab.com (David Lawley) Date: Wed, 6 Dec 2017 13:25:35 -0500 Subject: [PVE-User] bridge issue after last update In-Reply-To: References: <55d665ad-2fdc-83fe-36e1-31c8442558c6@upilab.com> Message-ID: <91bed31a-9083-8772-cef6-45808cbf89f5@upilab.com> On 12/6/2017 12:32 PM, Andreas Herrmann wrote: Ok, got it. I see area you are talking about Guess it must be missing it, as fs.aio-max-nr is incorrect too. sysctl -a is showing fs.aio-max-nr = 65536 pve.conf is suppose to set it to fs.aio-max-nr = 1048576 My install may be botched, its been inop a few times since its an older server that I have had to fall back kernel versions once or twice, since 5.1 has been hit/miss on some older hardware... > ACK, but the problem is tricky: > > /etc/sysctl.d/pve.conf was changed to /etc/sysctl.d/pve.conf/sysctl.conf > and is ignored. > > Have a look at Manual page sysctl.conf(5): /etc/sysctl.d/*.conf > > Andreas > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From f.gruenbichler at proxmox.com Wed Dec 6 20:38:43 2017 From: f.gruenbichler at proxmox.com (Fabian =?iso-8859-1?Q?Gr=FCnbichler?=) Date: Wed, 6 Dec 2017 20:38:43 +0100 Subject: [PVE-User] bridge issue after last update In-Reply-To: <91bed31a-9083-8772-cef6-45808cbf89f5@upilab.com> References: <55d665ad-2fdc-83fe-36e1-31c8442558c6@upilab.com> <91bed31a-9083-8772-cef6-45808cbf89f5@upilab.com> Message-ID: <20171206193843.e5mfuygx26lybepr@nora.maurer-it.com> On Wed, Dec 06, 2017 at 01:25:35PM -0500, David Lawley wrote: > > > On 12/6/2017 12:32 PM, Andreas Herrmann wrote: > > Ok, got it. I see area you are talking about > > Guess it must be missing it, as fs.aio-max-nr is incorrect too. > > sysctl -a is showing fs.aio-max-nr = 65536 > > pve.conf is suppose to set it to fs.aio-max-nr = 1048576 > > My install may be botched, its been inop a few times since its an older > server that I have had to fall back kernel versions once or twice, since 5.1 > has been hit/miss on some older hardware... > > > > > ACK, but the problem is tricky: > > > > /etc/sysctl.d/pve.conf was changed to /etc/sysctl.d/pve.conf/sysctl.conf > > and is ignored. > > > > Have a look at Manual page sysctl.conf(5): /etc/sysctl.d/*.conf > > > > Andreas that is a bug that slipped through while refactoring the packaging of pve-cluster, I'll send a patch to pve-devel and updated packages will be available tomorrow! From t.lamprecht at proxmox.com Thu Dec 7 08:57:56 2017 From: t.lamprecht at proxmox.com (Thomas Lamprecht) Date: Thu, 7 Dec 2017 08:57:56 +0100 Subject: [PVE-User] WARNING: Upgrade and Watchdog kills Server in HA-Mode In-Reply-To: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> References: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> Message-ID: <6e4940d4-6c10-f253-7dad-f93959c111fc@proxmox.com> Hi, some more information would be great to check this. First, do you have a daemon(like) service loading sysctl configs on the fly? If not we may rule out the sysctl config problem as a trigger for this. On 12/06/2017 06:43 PM, Andreas Herrmann wrote: > Hi there, > > be warned: the actual update may reboot your server if in HA-Mode. It happened > on 2 of 5 servers! > > The following packages will be upgraded: > libpve-common-perl (5.0-20 => 5.0-22) > libpve-http-server-perl (2.0-6 => 2.0-7) > libpve-storage-perl (5.0-16 => 5.0-17) > lxc-pve (2.1.0-2 => 2.1.1-2) > lxcfs (2.0.7-pve4 => 2.0.8-1) > pve-cluster (5.0-15 => 5.0-17) > pve-firewall (3.0-3 => 3.0-4) Can you describe your firewall setup a bit? Do you use Firewall groups? > pve-ha-manager (2.0-3 => 2.0-4) > pve-manager (5.1-36 => 5.1-38) > pve-qemu-kvm (2.9.1-2 => 2.9.1-3) > spiceterm (3.0-4 => 3.0-5) > vncterm (1.5-2 => 1.5-3) > > > Installing new version of config file /etc/rc.d/init.d/lxcfs ... > Processing triggers for man-db (2.7.6.1-2) ... > Setting up pve-cluster (5.0-17) ... > Removing obsolete conffile /etc/default/pve-cluster ... > Setting up pve-firewall (3.0-4) ... > Setting up lxc-pve (2.1.1-2) ... > Installing new version of config file > /etc/apparmor.d/abstractions/lxc/container-base ... > Setting up libpve-http-server-perl (2.0-7) ... > Setting up libpve-storage-perl (5.0-17) ... > Setting up pve-ha-manager (2.0-4) ... > watchdog-mux.service is a disabled or a static unit, not starting it. > Setting up pve-manager (5.1-38) ... > Installing new version of config file /etc/logrotate.d/pve ... > .... > REBOOT > Do you got some log entries around that time? Or a persistent journal? thanks, Thomas > > root at nethcn-b1:~# apt-get upgrade > E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to > correct the problem. > root at nethcn-b1:~# dpkg --configure -a > Setting up pve-manager (5.1-38) ... > Processing triggers for libc-bin (2.24-11+deb9u1) ... > > > Andreas > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From andreas at mx20.org Thu Dec 7 13:08:38 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Thu, 7 Dec 2017 13:08:38 +0100 Subject: [PVE-User] WARNING: Upgrade and Watchdog kills Server in HA-Mode In-Reply-To: <6e4940d4-6c10-f253-7dad-f93959c111fc@proxmox.com> References: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> <6e4940d4-6c10-f253-7dad-f93959c111fc@proxmox.com> Message-ID: Hi, On 07.12.2017 08:57, Thomas Lamprecht wrote: > some more information would be great to check this. > First, do you have a daemon(like) service loading sysctl > configs on the fly? If not we may rule out the sysctl config problem > as a trigger for this. No. It's a quite new installation from ISO without upgrade from Proxmox 4 and really less modifications. > Can you describe your firewall setup a bit? > Do you use Firewall groups? We don't use Proxmox firewall at all. We have uif based rules and no limitations between the proxmox hosts: # Zugriff der Nodes untereinander in+ s=nethcn-b-vl58(4),nethcn-b-vl802(4) # Die beiden Corosync HA Ringe in+ i=coro1 s=nethcn-b-ha1(4) in+ i=coro2 s=nethcn-b-ha2(4) # Ceph Traffic in+ i=ceph s=nethcn-b-store(4) > Do you got some log entries around that time? > Or a persistent journal? Some logs are attached. nethcn-b5 rebootet after I restarted services with needrestart. nethcn-b4 rebootet in between the update. Maybe are problem with communication between watchdog-mux.service und Proxmox. Maybe I should change to hardware watchdog provided by Supermicro X10SRW-F mainboard. Andreas -------------- next part -------------- Dec 6 17:51:08 nethcn-b2 systemd[1]: Created slice User Slice of root. Dec 6 17:51:08 nethcn-b2 systemd[1]: Starting User Manager for UID 0... Dec 6 17:51:08 nethcn-b2 systemd[1]: Started Session 2294 of user root. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Listening on GnuPG cryptographic agent and passphrase cache. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Dec 6 17:51:08 nethcn-b2 systemd[25841]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Dec 6 17:51:08 nethcn-b2 systemd[25841]: Reached target Paths. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Reached target Timers. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Listening on GnuPG network certificate management daemon. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Listening on GnuPG cryptographic agent (access for web browsers). Dec 6 17:51:08 nethcn-b2 systemd[25841]: Reached target Sockets. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Reached target Basic System. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Reached target Default. Dec 6 17:51:08 nethcn-b2 systemd[25841]: Startup finished in 21ms. Dec 6 17:51:08 nethcn-b2 systemd[1]: Started User Manager for UID 0. Dec 6 17:51:14 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:14 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 15min 25.622828s random time. Dec 6 17:51:14 nethcn-b2 systemd[1]: apt-daily.timer: Adding 6h 6min 27.629758s random time. Dec 6 17:51:14 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 24min 42.371776s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 10h 23min 49.731837s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 25min 49.899301s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 1h 1min 44.339369s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 53min 41.700970s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 4h 19min 32.155871s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 33min 33.939842s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 10h 3min 29.743451s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 26min 34.968617s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 10h 29min 18.753427s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 28min 47.463310s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 1h 32min 44.821502s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 15min 11.470765s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 43min 12.485912s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 11min 29.546795s random time. Dec 6 17:51:15 nethcn-b2 systemd[1]: apt-daily.timer: Adding 2h 35min 42.196692s random time. Dec 6 17:51:16 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:17 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 49min 31.062780s random time. Dec 6 17:51:17 nethcn-b2 systemd[1]: apt-daily.timer: Adding 4h 32min 30.982647s random time. Dec 6 17:51:17 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:17 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 45min 39.993857s random time. Dec 6 17:51:17 nethcn-b2 systemd[1]: apt-daily.timer: Adding 3h 8min 26.608575s random time. Dec 6 17:51:17 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:17 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 37min 6.641514s random time. Dec 6 17:51:17 nethcn-b2 systemd[1]: apt-daily.timer: Adding 11h 34min 54.498924s random time. Dec 6 17:51:18 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:18 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 8min 17.506967s random time. Dec 6 17:51:18 nethcn-b2 systemd[1]: apt-daily.timer: Adding 3h 55min 54.889100s random time. Dec 6 17:51:18 nethcn-b2 systemd[1]: Stopping The Proxmox VE cluster filesystem... Dec 6 17:51:18 nethcn-b2 pmxcfs[9987]: [main] notice: teardown filesystem Dec 6 17:51:20 nethcn-b2 pmxcfs[9987]: [main] notice: exit proxmox configuration filesystem (0) Dec 6 17:51:20 nethcn-b2 systemd[1]: Stopped The Proxmox VE cluster filesystem. Dec 6 17:51:20 nethcn-b2 systemd[1]: Starting The Proxmox VE cluster filesystem... Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [status] notice: update cluster info (cluster name NETHCN-B, version = 7) Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [status] notice: node has quorum Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [dcdb] notice: members: 1/10104, 2/28566, 3/29106, 4/30188, 5/10652 Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [dcdb] notice: starting data syncronisation Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [dcdb] notice: received sync request (epoch 1/10104/00000011) Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [status] notice: members: 1/10104, 2/28566, 3/29106, 4/30188, 5/10652 Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [status] notice: starting data syncronisation Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [status] notice: received sync request (epoch 1/10104/00000011) Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [dcdb] notice: received all states Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [dcdb] notice: leader is 1/10104 Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [dcdb] notice: synced members: 1/10104, 2/28566, 3/29106, 4/30188, 5/10652 Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [dcdb] notice: all data is up to date Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [status] notice: received all states Dec 6 17:51:20 nethcn-b2 pmxcfs[28566]: [status] notice: all data is up to date Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ERROR: Connection refused Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: server received shutdown request Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: server stopped Dec 6 17:51:20 nethcn-b2 watchdog-mux[3397]: client did not stop watchdog - disable watchdog updates Dec 6 17:51:20 nethcn-b2 systemd[1]: pve-ha-crm.service: Main process exited, code=exited, status=255/n/a Dec 6 17:51:21 nethcn-b2 systemd[1]: Started The Proxmox VE cluster filesystem. Dec 6 17:51:21 nethcn-b2 systemd[1]: Reloading Proxmox VE firewall. Dec 6 17:51:21 nethcn-b2 systemd[1]: pve-ha-crm.service: Unit entered failed state. Dec 6 17:51:21 nethcn-b2 systemd[1]: pve-ha-crm.service: Failed with result 'exit-code'. Dec 6 17:51:21 nethcn-b2 pve-ha-lrm[13145]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:51:21 nethcn-b2 watchdog-mux[3397]: exit watchdog-mux with active connections Dec 6 17:51:21 nethcn-b2 kernel: [88876.361477] watchdog: watchdog0: watchdog did not stop! Dec 6 17:51:21 nethcn-b2 pve-firewall[28714]: send HUP to 10566 Dec 6 17:51:21 nethcn-b2 pve-firewall[10566]: received signal HUP Dec 6 17:51:21 nethcn-b2 pve-firewall[10566]: server shutdown (restart) Dec 6 17:51:21 nethcn-b2 systemd[1]: Reloaded Proxmox VE firewall. Dec 6 17:51:22 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 43min 57.561346s random time. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily.timer: Adding 1h 53min 46.711159s random time. Dec 6 17:51:22 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 23min 666.457ms random time. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily.timer: Adding 5h 40min 48.607339s random time. Dec 6 17:51:22 nethcn-b2 systemd[1]: Stopping Proxmox VE firewall logger... Dec 6 17:51:22 nethcn-b2 pvepw-logger[22772]: received terminate request (signal) Dec 6 17:51:22 nethcn-b2 pvepw-logger[22772]: stopping pvefw logger Dec 6 17:51:22 nethcn-b2 pve-firewall[10566]: restarting server Dec 6 17:51:22 nethcn-b2 systemd[1]: Stopped Proxmox VE firewall logger. Dec 6 17:51:22 nethcn-b2 systemd[1]: Starting Proxmox VE firewall logger... Dec 6 17:51:22 nethcn-b2 pvefw-logger[28896]: starting pvefw logger Dec 6 17:51:22 nethcn-b2 systemd[1]: Started Proxmox VE firewall logger. Dec 6 17:51:22 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 16min 39.381721s random time. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily.timer: Adding 1h 23min 56.458060s random time. Dec 6 17:51:22 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 53min 22.748230s random time. Dec 6 17:51:22 nethcn-b2 systemd[1]: apt-daily.timer: Adding 4h 43min 27.611334s random time. Dec 6 17:51:22 nethcn-b2 kernel: [88877.490542] audit: type=1400 audit(1512579082.769:14): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=28947 comm="apparmor_parser" Dec 6 17:51:22 nethcn-b2 kernel: [88877.677013] audit: type=1400 audit(1512579082.955:15): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default" pid=28951 comm="apparmor_parser" Dec 6 17:51:22 nethcn-b2 kernel: [88877.693940] audit: type=1400 audit(1512579082.955:16): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-cgns" pid=28951 comm="apparmor_parser" Dec 6 17:51:22 nethcn-b2 kernel: [88877.711368] audit: type=1400 audit(1512579082.956:17): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-with-mounting" pid=28951 comm="apparmor_parser" Dec 6 17:51:23 nethcn-b2 kernel: [88877.729675] audit: type=1400 audit(1512579082.956:18): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-with-nesting" pid=28951 comm="apparmor_parser" Dec 6 17:51:23 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:23 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 8min 32.222778s random time. Dec 6 17:51:23 nethcn-b2 systemd[1]: apt-daily.timer: Adding 9h 48min 39.647146s random time. Dec 6 17:51:23 nethcn-b2 pvestatd[10618]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:51:23 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:24 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 36min 23.950978s random time. Dec 6 17:51:24 nethcn-b2 systemd[1]: apt-daily.timer: Adding 7h 27min 37.724293s random time. Dec 6 17:51:24 nethcn-b2 systemd[1]: Reloading. Dec 6 17:51:24 nethcn-b2 systemd[1]: apt-daily-upgrade.timer: Adding 28min 57.947572s random time. Dec 6 17:51:24 nethcn-b2 systemd[1]: apt-daily.timer: Adding 3h 8min 51.079398s random time. Dec 6 17:51:24 nethcn-b2 systemd[1]: Started Session 2296 of user root. Dec 6 17:51:24 nethcn-b2 systemd[1]: Stopping PVE Local HA Ressource Manager Daemon... Dec 6 17:51:24 nethcn-b2 pve-ha-lrm[13145]: received signal TERM Dec 6 17:51:24 nethcn-b2 pve-ha-lrm[13145]: restart LRM, freeze all services -------------- next part -------------- Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [dcdb] notice: members: 1/10104, 2/9987, 3/9969, 4/9839 Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [dcdb] notice: starting data syncronisation Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [status] notice: members: 1/10104, 2/9987, 3/9969, 4/9839 Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [status] notice: starting data syncronisation Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [dcdb] notice: received sync request (epoch 1/10104/00000008) Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [status] notice: received sync request (epoch 1/10104/00000008) Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [dcdb] notice: received all states Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [dcdb] notice: leader is 1/10104 Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [dcdb] notice: synced members: 1/10104, 2/9987, 3/9969, 4/9839 Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [dcdb] notice: all data is up to date Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [status] notice: received all states Dec 6 17:27:33 nethcn-b2 pmxcfs[9987]: [status] notice: all data is up to date Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [dcdb] notice: members: 1/10104, 2/9987, 3/9969, 4/9839, 5/14789 Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [dcdb] notice: starting data syncronisation Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [status] notice: members: 1/10104, 2/9987, 3/9969, 4/9839, 5/14789 Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [status] notice: starting data syncronisation Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [dcdb] notice: received sync request (epoch 1/10104/00000009) Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [status] notice: received sync request (epoch 1/10104/00000009) Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [dcdb] notice: received all states Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [dcdb] notice: leader is 1/10104 Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [dcdb] notice: synced members: 1/10104, 2/9987, 3/9969, 4/9839, 5/14789 Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [dcdb] notice: all data is up to date Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [status] notice: received all states Dec 6 17:27:34 nethcn-b2 pmxcfs[9987]: [status] notice: all data is up to date Dec 6 17:28:00 nethcn-b2 systemd[1]: Starting Proxmox VE replication runner... Dec 6 17:28:01 nethcn-b2 systemd[1]: Started Proxmox VE replication runner. Dec 6 17:28:01 nethcn-b2 CRON[21670]: (root) CMD ( sleep $((RANDOM % 20)); /usr/local/sbin/check_ipmi.sh) Dec 6 17:28:20 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:20Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:23 nethcn-b2 corosync[10210]: notice [TOTEM ] A new membership (192.168.112.1:2524) was formed. Members left: 5 Dec 6 17:28:23 nethcn-b2 corosync[10210]: notice [TOTEM ] Failed to receive the leave message. failed: 5 Dec 6 17:28:23 nethcn-b2 corosync[10210]: [TOTEM ] A new membership (192.168.112.1:2524) was formed. Members left: 5 Dec 6 17:28:23 nethcn-b2 corosync[10210]: [TOTEM ] Failed to receive the leave message. failed: 5 Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: members: 1/10104, 2/9987, 3/9969, 4/9839 Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: starting data syncronisation Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [status] notice: members: 1/10104, 2/9987, 3/9969, 4/9839 Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [status] notice: starting data syncronisation Dec 6 17:28:23 nethcn-b2 corosync[10210]: notice [QUORUM] Members[4]: 1 2 3 4 Dec 6 17:28:23 nethcn-b2 corosync[10210]: notice [MAIN ] Completed service synchronization, ready to provide service. Dec 6 17:28:23 nethcn-b2 corosync[10210]: [QUORUM] Members[4]: 1 2 3 4 Dec 6 17:28:23 nethcn-b2 corosync[10210]: [MAIN ] Completed service synchronization, ready to provide service. Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: received sync request (epoch 1/10104/0000000A) Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [status] notice: received sync request (epoch 1/10104/0000000A) Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: received all states Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: leader is 1/10104 Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: synced members: 1/10104, 2/9987, 3/9969, 4/9839 Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: all data is up to date Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [dcdb] notice: dfsm_deliver_queue: queue length 11 Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [status] notice: received all states Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [status] notice: all data is up to date Dec 6 17:28:23 nethcn-b2 pmxcfs[9987]: [status] notice: dfsm_deliver_queue: queue length 26 Dec 6 17:28:24 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:24Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:28 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:28Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:32 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:32Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057623 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057651 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057658 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057665 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057672 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057681 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057688 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:32 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:32.057694 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:12.057620) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058175 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058198 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058212 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058224 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058238 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058250 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058263 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:33.058274 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:13.058171) Dec 6 17:28:33 nethcn-b2 pvestatd[10618]: status update time (9.911 seconds) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058434 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058444 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058447 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058449 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058454 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058456 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058458 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:34.058460 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:14.058430) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382846 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382872 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382880 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382890 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382899 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382906 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382912 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:34 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:34.382918 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:14.382840) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058560 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058575 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058578 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058599 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058602 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058604 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058606 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[10845]: 2017-12-06 17:28:35.058609 7fbf4f84c700 -1 osd.11 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:11.439837 front 2017-12-06 17:28:11.439837 (cutoff 2017-12-06 17:28:15.058555) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202181 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202194 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202199 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202202 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202205 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202207 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202210 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 ceph-osd[11486]: 2017-12-06 17:28:35.202212 7f38ff3a1700 -1 osd.10 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:14.631386 front 2017-12-06 17:28:14.631386 (cutoff 2017-12-06 17:28:15.202177) Dec 6 17:28:35 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:35Z E! InfluxDB Output Error: Post http://influxdb-b1.as6724.net:8086/write?consistency=any&db=noc_nethcn_telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Dec 6 17:28:35 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:35Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383153 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383170 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383173 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383178 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383180 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383183 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383185 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11856]: 2017-12-06 17:28:35.383187 7f308a81d700 -1 osd.14 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:13.944993 front 2017-12-06 17:28:13.944993 (cutoff 2017-12-06 17:28:15.383148) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471821 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6815 osd.32 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471853 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6803 osd.33 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471861 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6807 osd.34 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471871 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6827 osd.35 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471877 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6811 osd.36 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471888 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6819 osd.37 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471909 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6823 osd.38 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 ceph-osd[11590]: 2017-12-06 17:28:35.471916 7f6a57e74700 -1 osd.12 37761 heartbeat_check: no reply from 192.168.112.135:6831 osd.39 since back 2017-12-06 17:28:14.871165 front 2017-12-06 17:28:14.871165 (cutoff 2017-12-06 17:28:15.471814) Dec 6 17:28:35 nethcn-b2 kernel: [87510.180709] libceph: osd32 down Dec 6 17:28:35 nethcn-b2 kernel: [87510.184066] libceph: osd33 down Dec 6 17:28:35 nethcn-b2 kernel: [87510.187436] libceph: osd34 down Dec 6 17:28:35 nethcn-b2 kernel: [87510.190854] libceph: osd35 down Dec 6 17:28:35 nethcn-b2 kernel: [87510.194260] libceph: osd36 down Dec 6 17:28:35 nethcn-b2 kernel: [87510.197709] libceph: osd37 down Dec 6 17:28:35 nethcn-b2 kernel: [87510.201060] libceph: osd38 down Dec 6 17:28:35 nethcn-b2 kernel: [87510.204407] libceph: osd39 down Dec 6 17:28:36 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:36Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:40 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:40Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:44 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:44Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:48 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:48Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:52 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:52Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:28:56 nethcn-b2 telegraf[29224]: 2017-12-06T16:28:56Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:29:00 nethcn-b2 telegraf[29224]: 2017-12-06T16:29:00Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:29:00 nethcn-b2 systemd[1]: Starting Proxmox VE replication runner... Dec 6 17:29:01 nethcn-b2 systemd[1]: Started Proxmox VE replication runner. Dec 6 17:29:04 nethcn-b2 telegraf[29224]: 2017-12-06T16:29:04Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:29:08 nethcn-b2 telegraf[29224]: 2017-12-06T16:29:08Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:29:12 nethcn-b2 telegraf[29224]: 2017-12-06T16:29:12Z E! Error in plugin [inputs.ceph]: took longer to collect than collection interval (4s) Dec 6 17:29:29 nethcn-b2 nullmailer[914]: Rescanning queue. Dec 6 17:29:55 nethcn-b2 corosync[10210]: notice [TOTEM ] A new membership (192.168.112.1:2528) was formed. Members joined: 5 Dec 6 17:29:55 nethcn-b2 corosync[10210]: [TOTEM ] A new membership (192.168.112.1:2528) was formed. Members joined: 5 Dec 6 17:29:59 nethcn-b2 corosync[10210]: notice [TOTEM ] Retransmit List: 4 Dec 6 17:29:59 nethcn-b2 corosync[10210]: [TOTEM ] Retransmit List: 4 -------------- next part -------------- Dec 6 17:27:20 nethcn-b5 systemd[1]: Created slice User Slice of root. Dec 6 17:27:20 nethcn-b5 systemd[1]: Starting User Manager for UID 0... Dec 6 17:27:20 nethcn-b5 systemd[1]: Started Session 2272 of user root. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Listening on GnuPG cryptographic agent (access for web browsers). Dec 6 17:27:20 nethcn-b5 systemd[9782]: Listening on GnuPG network certificate management daemon. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Dec 6 17:27:20 nethcn-b5 systemd[9782]: Reached target Paths. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Reached target Timers. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Listening on GnuPG cryptographic agent and passphrase cache. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Dec 6 17:27:20 nethcn-b5 systemd[9782]: Reached target Sockets. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Reached target Basic System. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Reached target Default. Dec 6 17:27:20 nethcn-b5 systemd[9782]: Startup finished in 19ms. Dec 6 17:27:20 nethcn-b5 systemd[1]: Started User Manager for UID 0. Dec 6 17:27:24 nethcn-b5 kernel: [88684.296467] FW INVALID STATE: IN=vlan31 OUT= MAC=24:8a:07:20:c5:56:24:8a:07:20:c5:5e:08:00 SRC=192.168.112.131 DST=192.168.112.135 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=27581 DF PROTO=TCP SPT=34568 DPT=6789 WINDOW=0 RES=0x00 RST URGP=0 Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 54min 17.746014s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 7min 30.184150s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 9min 42.427373s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 35min 49.985856s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 57min 39.588322s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 41min 14.870258s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 14min 20.468467s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 9min 11.475661s random time. Dec 6 17:27:30 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:30 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 38min 19.555617s random time. Dec 6 17:27:31 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:31 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 30min 37.001210s random time. Dec 6 17:27:31 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:32 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 2min 37.078602s random time. Dec 6 17:27:32 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:32 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 40min 55.466580s random time. Dec 6 17:27:32 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:32 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 1min 34.251377s random time. Dec 6 17:27:32 nethcn-b5 systemd[1]: Stopping The Proxmox VE cluster filesystem... Dec 6 17:27:32 nethcn-b5 pmxcfs[10077]: [main] notice: teardown filesystem Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ERROR: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: server received shutdown request Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: server stopped Dec 6 17:27:33 nethcn-b5 systemd[1]: pve-ha-crm.service: Main process exited, code=exited, status=255/n/a Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[1] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[1] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: Unable to load access control list: Connection refused Dec 6 17:27:33 nethcn-b5 systemd[1]: pve-ha-crm.service: Control process exited, code=exited status=111 Dec 6 17:27:33 nethcn-b5 systemd[1]: pve-ha-crm.service: Unit entered failed state. Dec 6 17:27:33 nethcn-b5 systemd[1]: pve-ha-crm.service: Failed with result 'exit-code'. Dec 6 17:27:34 nethcn-b5 pmxcfs[10077]: [main] notice: exit proxmox configuration filesystem (0) Dec 6 17:27:34 nethcn-b5 systemd[1]: Stopped The Proxmox VE cluster filesystem. Dec 6 17:27:34 nethcn-b5 systemd[1]: Starting The Proxmox VE cluster filesystem... Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [status] notice: update cluster info (cluster name NETHCN-B, version = 7) Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [status] notice: node has quorum Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [dcdb] notice: members: 1/10104, 2/9987, 3/9969, 4/9839, 5/14789 Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [dcdb] notice: starting data syncronisation Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [status] notice: members: 1/10104, 2/9987, 3/9969, 4/9839, 5/14789 Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [status] notice: starting data syncronisation Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [dcdb] notice: received sync request (epoch 1/10104/00000009) Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [status] notice: received sync request (epoch 1/10104/00000009) Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [dcdb] notice: received all states Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [dcdb] notice: leader is 1/10104 Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [dcdb] notice: synced members: 1/10104, 2/9987, 3/9969, 4/9839, 5/14789 Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [dcdb] notice: all data is up to date Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [status] notice: received all states Dec 6 17:27:34 nethcn-b5 pmxcfs[14789]: [status] notice: all data is up to date Dec 6 17:27:35 nethcn-b5 systemd[1]: Started The Proxmox VE cluster filesystem. Dec 6 17:27:35 nethcn-b5 systemd[1]: Reloading Proxmox VE firewall. Dec 6 17:27:35 nethcn-b5 pve-firewall[15692]: send HUP to 10709 Dec 6 17:27:35 nethcn-b5 pve-firewall[10709]: received signal HUP Dec 6 17:27:35 nethcn-b5 pve-firewall[10709]: server shutdown (restart) Dec 6 17:27:35 nethcn-b5 systemd[1]: Reloaded Proxmox VE firewall. Dec 6 17:27:35 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:36 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 17min 21.376308s random time. Dec 6 17:27:36 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:36 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 17min 58.242872s random time. Dec 6 17:27:36 nethcn-b5 systemd[1]: Stopping Proxmox VE firewall logger... Dec 6 17:27:36 nethcn-b5 pvepw-logger[24509]: received terminate request (signal) Dec 6 17:27:36 nethcn-b5 pvepw-logger[24509]: stopping pvefw logger Dec 6 17:27:36 nethcn-b5 pve-firewall[10709]: restarting server Dec 6 17:27:36 nethcn-b5 systemd[1]: Stopped Proxmox VE firewall logger. Dec 6 17:27:36 nethcn-b5 systemd[1]: Starting Proxmox VE firewall logger... Dec 6 17:27:36 nethcn-b5 pvefw-logger[15777]: starting pvefw logger Dec 6 17:27:36 nethcn-b5 systemd[1]: Started Proxmox VE firewall logger. Dec 6 17:27:36 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:36 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 26min 56.751953s random time. Dec 6 17:27:36 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:36 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 10min 35.995195s random time. Dec 6 17:27:36 nethcn-b5 kernel: [88695.943725] audit: type=1400 audit(1512577656.632:14): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=15841 comm="apparmor_parser" Dec 6 17:27:36 nethcn-b5 kernel: [88696.136335] audit: type=1400 audit(1512577656.825:15): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default" pid=15845 comm="apparmor_parser" Dec 6 17:27:36 nethcn-b5 kernel: [88696.153706] audit: type=1400 audit(1512577656.825:16): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-cgns" pid=15845 comm="apparmor_parser" Dec 6 17:27:36 nethcn-b5 kernel: [88696.172585] audit: type=1400 audit(1512577656.825:17): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-with-mounting" pid=15845 comm="apparmor_parser" Dec 6 17:27:36 nethcn-b5 kernel: [88696.191137] audit: type=1400 audit(1512577656.825:18): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-with-nesting" pid=15845 comm="apparmor_parser" Dec 6 17:27:37 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:37 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 25min 26.022837s random time. Dec 6 17:27:37 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:37 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 43min 7.212239s random time. Dec 6 17:27:37 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:37 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 45min 48.451806s random time. Dec 6 17:27:37 nethcn-b5 systemd[1]: Stopping PVE Local HA Ressource Manager Daemon... Dec 6 17:27:38 nethcn-b5 pve-ha-lrm[14351]: received signal TERM Dec 6 17:27:38 nethcn-b5 pve-ha-lrm[14351]: restart LRM, freeze all services Dec 6 17:27:38 nethcn-b5 pve-ha-lrm[14351]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:27:39 nethcn-b5 pvestatd[10636]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:27:48 nethcn-b5 pve-ha-lrm[14351]: watchdog closed (disabled) Dec 6 17:27:48 nethcn-b5 pve-ha-lrm[14351]: server stopped Dec 6 17:27:49 nethcn-b5 systemd[1]: Stopped PVE Local HA Ressource Manager Daemon. Dec 6 17:27:49 nethcn-b5 systemd[1]: Starting PVE Cluster Ressource Manager Daemon... Dec 6 17:27:49 nethcn-b5 pve-ha-crm[16740]: starting server Dec 6 17:27:49 nethcn-b5 pve-ha-crm[16740]: status change startup => wait_for_quorum Dec 6 17:27:49 nethcn-b5 systemd[1]: Started PVE Cluster Ressource Manager Daemon. Dec 6 17:27:49 nethcn-b5 systemd[1]: Starting PVE Local HA Ressource Manager Daemon... Dec 6 17:27:50 nethcn-b5 pve-ha-lrm[16775]: starting server Dec 6 17:27:50 nethcn-b5 pve-ha-lrm[16775]: status change startup => wait_for_agent_lock Dec 6 17:27:50 nethcn-b5 systemd[1]: Started PVE Local HA Ressource Manager Daemon. Dec 6 17:27:50 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:50 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 42min 4.585055s random time. Dec 6 17:27:50 nethcn-b5 systemd[1]: Stopping PVE Cluster Ressource Manager Daemon... Dec 6 17:27:50 nethcn-b5 pve-ha-crm[16740]: received signal TERM Dec 6 17:27:50 nethcn-b5 pve-ha-crm[16740]: server received shutdown request Dec 6 17:27:54 nethcn-b5 pve-ha-crm[16740]: status change wait_for_quorum => slave Dec 6 17:27:54 nethcn-b5 pve-ha-crm[16740]: server stopped Dec 6 17:27:55 nethcn-b5 systemd[1]: Stopped PVE Cluster Ressource Manager Daemon. Dec 6 17:27:55 nethcn-b5 systemd[1]: Starting PVE Cluster Ressource Manager Daemon... Dec 6 17:27:55 nethcn-b5 pve-ha-crm[17429]: starting server Dec 6 17:27:55 nethcn-b5 pve-ha-crm[17429]: status change startup => wait_for_quorum Dec 6 17:27:55 nethcn-b5 systemd[1]: Started PVE Cluster Ressource Manager Daemon. Dec 6 17:27:56 nethcn-b5 systemd[1]: Reloading. Dec 6 17:27:56 nethcn-b5 systemd[1]: apt-daily-upgrade.timer: Adding 32min 16.418414s random time. Dec 6 17:27:56 nethcn-b5 systemd[1]: Reloading PVE API Daemon. Dec 6 17:27:57 nethcn-b5 pvedaemon[17563]: send HUP to 10872 Dec 6 17:27:57 nethcn-b5 pvedaemon[10872]: received signal HUP Dec 6 17:27:57 nethcn-b5 pvedaemon[10872]: server closing Dec 6 17:27:57 nethcn-b5 pvedaemon[10872]: server shutdown (restart) Dec 6 17:27:57 nethcn-b5 pvedaemon[10874]: worker exit Dec 6 17:27:57 nethcn-b5 pvedaemon[10873]: worker exit Dec 6 17:27:57 nethcn-b5 pvedaemon[10875]: worker exit Dec 6 17:27:57 nethcn-b5 systemd[1]: Reloaded PVE API Daemon. Dec 6 17:27:57 nethcn-b5 systemd[1]: Reloading PVE API Proxy Server. Dec 6 17:27:58 nethcn-b5 pvedaemon[10872]: restarting server Dec 6 17:27:58 nethcn-b5 pvedaemon[10872]: starting 3 worker(s) Dec 6 17:27:58 nethcn-b5 pvedaemon[10872]: worker 17616 started Dec 6 17:27:58 nethcn-b5 pvedaemon[10872]: worker 17617 started Dec 6 17:27:58 nethcn-b5 pvedaemon[10872]: worker 17618 started Dec 6 17:27:58 nethcn-b5 pveproxy[17600]: send HUP to 13530 Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: received signal HUP Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: server closing Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: server shutdown (restart) Dec 6 17:27:58 nethcn-b5 pveproxy[13533]: worker exit Dec 6 17:27:58 nethcn-b5 pveproxy[13532]: worker exit Dec 6 17:27:58 nethcn-b5 pveproxy[13531]: worker exit Dec 6 17:27:58 nethcn-b5 systemd[1]: Reloaded PVE API Proxy Server. Dec 6 17:27:58 nethcn-b5 systemd[1]: Reloading PVE SPICE Proxy Server. Dec 6 17:27:58 nethcn-b5 spiceproxy[17623]: send HUP to 13563 Dec 6 17:27:58 nethcn-b5 spiceproxy[13563]: received signal HUP Dec 6 17:27:58 nethcn-b5 spiceproxy[13563]: server closing Dec 6 17:27:58 nethcn-b5 spiceproxy[13563]: server shutdown (restart) Dec 6 17:27:58 nethcn-b5 spiceproxy[13564]: worker exit Dec 6 17:27:58 nethcn-b5 systemd[1]: Reloaded PVE SPICE Proxy Server. Dec 6 17:27:58 nethcn-b5 systemd[1]: Reloading PVE Status Daemon. Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: Using '/etc/pve/local/pveproxy-ssl.pem' as certificate for the web interface. Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: restarting server Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: starting 3 worker(s) Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: worker 17637 started Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: worker 17639 started Dec 6 17:27:58 nethcn-b5 pveproxy[13530]: worker 17640 started Dec 6 17:27:58 nethcn-b5 spiceproxy[13563]: restarting server Dec 6 17:27:58 nethcn-b5 spiceproxy[13563]: starting 1 worker(s) Dec 6 17:27:58 nethcn-b5 spiceproxy[13563]: worker 17644 started Dec 6 17:27:58 nethcn-b5 pvestatd[17634]: send HUP to 10636 Dec 6 17:27:58 nethcn-b5 pvestatd[10636]: received signal HUP Dec 6 17:27:58 nethcn-b5 pvestatd[10636]: server shutdown (restart) Dec 6 17:27:58 nethcn-b5 systemd[1]: Reloaded PVE Status Daemon. Dec 6 17:27:59 nethcn-b5 pvestatd[10636]: restarting server Dec 6 17:28:00 nethcn-b5 systemd[1]: Starting Proxmox VE replication runner... Dec 6 17:28:00 nethcn-b5 pve-ha-lrm[16775]: successfully acquired lock 'ha_agent_nethcn-b5_lock' Dec 6 17:28:00 nethcn-b5 pve-ha-lrm[16775]: watchdog active Dec 6 17:28:00 nethcn-b5 pve-ha-lrm[16775]: status change wait_for_agent_lock => active Dec 6 17:28:00 nethcn-b5 systemd[1]: Started Proxmox VE replication runner. Dec 6 17:28:00 nethcn-b5 pve-ha-crm[17429]: status change wait_for_quorum => slave Dec 6 17:28:01 nethcn-b5 cron[10376]: (*system*pveupdate) RELOAD (/etc/cron.d/pveupdate) Dec 6 17:28:01 nethcn-b5 CRON[18584]: (root) CMD ( sleep $((RANDOM % 20)); /usr/local/sbin/check_ipmi.sh) Dec 6 17:28:03 nethcn-b5 pvedaemon[10872]: worker 10873 finished Dec 6 17:28:03 nethcn-b5 pvedaemon[10872]: worker 10874 finished Dec 6 17:28:03 nethcn-b5 pvedaemon[10872]: worker 10875 finished Dec 6 17:28:03 nethcn-b5 pveproxy[13530]: worker 13531 finished Dec 6 17:28:03 nethcn-b5 pveproxy[13530]: worker 13532 finished Dec 6 17:28:03 nethcn-b5 pveproxy[13530]: worker 13533 finished Dec 6 17:28:03 nethcn-b5 spiceproxy[13563]: worker 13564 finished Dec 6 17:28:06 nethcn-b5 systemd[1]: Stopping LXC Container Monitoring Daemon... Dec 6 17:28:06 nethcn-b5 systemd[1]: Stopped LXC Container Monitoring Daemon. Dec 6 17:28:06 nethcn-b5 systemd[1]: Started LXC Container Monitoring Daemon. Dec 6 17:28:06 nethcn-b5 systemd[1]: Stopping Proxmox VE watchdog multiplexer... Dec 6 17:28:06 nethcn-b5 watchdog-mux[3747]: got terminate request Dec 6 17:28:06 nethcn-b5 watchdog-mux[3747]: exit watchdog-mux with active connections Dec 6 17:28:06 nethcn-b5 systemd[1]: Stopped Proxmox VE watchdog multiplexer. Dec 6 17:28:06 nethcn-b5 systemd[1]: Started Proxmox VE watchdog multiplexer. Dec 6 17:28:06 nethcn-b5 kernel: [88725.955509] watchdog: watchdog0: watchdog did not stop! Dec 6 17:28:06 nethcn-b5 watchdog-mux[18946]: watchdog active - unable to restart watchdog-mux Dec 6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE Dec 6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Unit entered failed state. Dec 6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'. Dec 6 17:28:10 nethcn-b5 pve-ha-lrm[16775]: watchdog update failed - Broken pipe Dec 6 17:29:41 nethcn-b5 systemd-modules-load[1761]: Inserted module 'iscsi_tcp' Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x4f3 with crng_init=0 Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] Linux version 4.13.8-3-pve (root at nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18)) #1 SMP PVE 4.13.8-30 (Tue, 5 Dec 2017 13:06:48 +0100) () Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.8-3-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs elevator=noop console=tty0 console=ttyS1,115200n8 Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] KERNEL supported cpus: Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] Intel GenuineIntel Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] AMD AuthenticAMD Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] Centaur CentaurHauls Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] e820: BIOS-provided physical RAM map: Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000099bff] usable Dec 6 17:29:41 nethcn-b5 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000099c00-0x000000000009ffff] reserved From andreas at mx20.org Thu Dec 7 14:07:41 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Thu, 7 Dec 2017 14:07:41 +0100 Subject: [PVE-User] WARNING: Upgrade and Watchdog kills Server in HA-Mode In-Reply-To: <6e4940d4-6c10-f253-7dad-f93959c111fc@proxmox.com> References: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> <6e4940d4-6c10-f253-7dad-f93959c111fc@proxmox.com> Message-ID: <7d48bde3-5220-d75b-d835-86dd4e4e1bdd@mx20.org> Hi again, On 07.12.2017 08:57, Thomas Lamprecht wrote: > Do you got some log entries around that time? > Or a persistent journal? some more filtered logs about the watchdog are attached. nethcn-b(1|2|5) "crashed" and nethcn-b(3|4) kept online. Ceph monitors are running on nethcn-b(1|3|5). Andreas -------------- next part -------------- root at nethcn-b1:~# cat /var/log/syslog.1|egrep watchdog\|ipcc Dec 6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Transport endpoint is not connected Dec 6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused Dec 6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused Dec 6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused Dec 6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused Dec 6 18:33:56 nethcn-b1 pve-ha-lrm[13875]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 18:33:56 nethcn-b1 pve-ha-lrm[13875]: ipcc_send_rec[2] failed: Connection refused Dec 6 18:33:56 nethcn-b1 pve-ha-lrm[13875]: ipcc_send_rec[3] failed: Connection refused Dec 6 18:33:56 nethcn-b1 watchdog-mux[3565]: client did not stop watchdog - disable watchdog updates Dec 6 18:33:58 nethcn-b1 pve-ha-crm[10964]: ipcc_send_rec[1] failed: Transport endpoint is not connected root at nethcn-b2:~# cat /var/log/syslog.1|egrep watchdog\|ipcc Dec 6 17:46:40 nethcn-b2 pve-ha-crm[10842]: watchdog active Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:51:20 nethcn-b2 watchdog-mux[3397]: client did not stop watchdog - disable watchdog updates Dec 6 17:51:21 nethcn-b2 pve-ha-lrm[13145]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:51:21 nethcn-b2 watchdog-mux[3397]: exit watchdog-mux with active connections Dec 6 17:51:21 nethcn-b2 kernel: [88876.361477] watchdog: watchdog0: watchdog did not stop! Dec 6 17:51:23 nethcn-b2 pvestatd[10618]: ipcc_send_rec[1] failed: Transport endpoint is not connected root at nethcn-b3:~# cat /var/log/syslog.1|egrep watchdog\|ipcc Dec 6 17:46:15 nethcn-b3 pveproxy[15923]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:46:15 nethcn-b3 pveproxy[15923]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:46:15 nethcn-b3 pveproxy[15923]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:46:19 nethcn-b3 pvestatd[10805]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:46:20 nethcn-b3 pve-ha-crm[10996]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:46:20 nethcn-b3 pve-ha-lrm[13497]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:46:30 nethcn-b3 pve-ha-lrm[13497]: watchdog closed (disabled) Dec 6 17:46:40 nethcn-b3 pve-ha-crm[10996]: watchdog closed (disabled) Dec 6 17:47:03 nethcn-b3 systemd[1]: Stopping Proxmox VE watchdog multiplexer... Dec 6 17:47:03 nethcn-b3 watchdog-mux[3580]: got terminate request Dec 6 17:47:03 nethcn-b3 watchdog-mux[3580]: clean exit Dec 6 17:47:03 nethcn-b3 systemd[1]: Stopped Proxmox VE watchdog multiplexer. Dec 6 17:47:03 nethcn-b3 systemd[1]: Started Proxmox VE watchdog multiplexer. Dec 6 17:47:03 nethcn-b3 watchdog-mux[834]: Watchdog driver 'Software Watchdog', version 0 Dec 6 17:49:21 nethcn-b3 pve-ha-lrm[30589]: watchdog active root at nethcn-b4:~# cat /var/log/syslog.1|egrep watchdog\|ipcc Dec 6 17:37:08 nethcn-b4 pveproxy[12998]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:37:08 nethcn-b4 pveproxy[12998]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:37:08 nethcn-b4 pveproxy[12998]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:37:10 nethcn-b4 pve-ha-lrm[12950]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:37:11 nethcn-b4 pve-ha-crm[10654]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:37:13 nethcn-b4 pvestatd[10424]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:37:20 nethcn-b4 pve-ha-lrm[12950]: watchdog closed (disabled) Dec 6 17:39:01 nethcn-b4 systemd[1]: Stopping Proxmox VE watchdog multiplexer... Dec 6 17:39:01 nethcn-b4 watchdog-mux[3564]: got terminate request Dec 6 17:39:01 nethcn-b4 watchdog-mux[3564]: clean exit Dec 6 17:39:01 nethcn-b4 systemd[1]: Stopped Proxmox VE watchdog multiplexer. Dec 6 17:39:01 nethcn-b4 systemd[1]: Started Proxmox VE watchdog multiplexer. Dec 6 17:39:01 nethcn-b4 watchdog-mux[5395]: Watchdog driver 'Software Watchdog', version 0 Dec 6 17:44:51 nethcn-b4 pve-ha-lrm[31595]: watchdog active Dec 6 17:53:26 nethcn-b4 pve-ha-crm[31896]: watchdog active root at nethcn-b5:/var/log# cat /var/log/syslog.1|egrep watchdog\|ipcc Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[1] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[1] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[2] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[3] failed: Connection refused Dec 6 17:27:38 nethcn-b5 pve-ha-lrm[14351]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:27:39 nethcn-b5 pvestatd[10636]: ipcc_send_rec[1] failed: Transport endpoint is not connected Dec 6 17:27:48 nethcn-b5 pve-ha-lrm[14351]: watchdog closed (disabled) Dec 6 17:28:00 nethcn-b5 pve-ha-lrm[16775]: watchdog active Dec 6 17:28:06 nethcn-b5 systemd[1]: Stopping Proxmox VE watchdog multiplexer... Dec 6 17:28:06 nethcn-b5 watchdog-mux[3747]: got terminate request Dec 6 17:28:06 nethcn-b5 watchdog-mux[3747]: exit watchdog-mux with active connections Dec 6 17:28:06 nethcn-b5 systemd[1]: Stopped Proxmox VE watchdog multiplexer. Dec 6 17:28:06 nethcn-b5 systemd[1]: Started Proxmox VE watchdog multiplexer. Dec 6 17:28:06 nethcn-b5 kernel: [88725.955509] watchdog: watchdog0: watchdog did not stop! Dec 6 17:28:06 nethcn-b5 watchdog-mux[18946]: watchdog active - unable to restart watchdog-mux Dec 6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE Dec 6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Unit entered failed state. Dec 6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'. Dec 6 17:28:10 nethcn-b5 pve-ha-lrm[16775]: watchdog update failed - Broken pipe From mark at tuxis.nl Sun Dec 10 05:54:19 2017 From: mark at tuxis.nl (Mark Schouten) Date: Sun, 10 Dec 2017 05:54:19 +0100 Subject: [PVE-User] WARNING: Upgrade and Watchdog kills Server in HA-Mode In-Reply-To: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> References: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> Message-ID: <1340E011-9802-4D06-85BC-23242C791025@tuxis.nl> Isn?t this the issue: Setting up pve-ha-manager (2.0-4) ... watchdog-mux.service is a disabled or a static unit, not starting it. Where possibly the service is stopped, but not started again? > On 6 Dec 2017, at 18:43, Andreas Herrmann wrote: > > Setting up pve-ha-manager (2.0-4) ... > watchdog-mux.service is a disabled or a static unit, not starting it. From andreas at mx20.org Sun Dec 10 08:04:18 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Sun, 10 Dec 2017 08:04:18 +0100 Subject: [PVE-User] WARNING: Upgrade and Watchdog kills Server in HA-Mode In-Reply-To: <1340E011-9802-4D06-85BC-23242C791025@tuxis.nl> References: <5c946c6e-bfa9-7bf5-aa3f-59be6279fdb3@mx20.org> <1340E011-9802-4D06-85BC-23242C791025@tuxis.nl> Message-ID: <83098aba-c96d-668c-8a0a-477b327ab594@mx20.org> Hi, the error seems to be fixed with pve-cluster version 5.0-19 https://git.proxmox.com/?p=pve-cluster.git;a=commitdiff;h=02b93019317d2b598fbae808301aeccc6088e9c5 https://git.proxmox.com/?p=pve-cluster.git;a=commitdiff;h=ec826d72c06e6f649b2b19c3341c39abb29b19f9 Andreas On 10.12.2017 05:54, Mark Schouten wrote: > Isn?t this the issue: > > Setting up pve-ha-manager (2.0-4) ... > watchdog-mux.service is a disabled or a static unit, not starting it. > > > Where possibly the service is stopped, but not started again? > > >> On 6 Dec 2017, at 18:43, Andreas Herrmann wrote: >> >> Setting up pve-ha-manager (2.0-4) ... >> watchdog-mux.service is a disabled or a static unit, not starting it. From elacunza at binovo.es Mon Dec 11 09:55:46 2017 From: elacunza at binovo.es (Eneko Lacunza) Date: Mon, 11 Dec 2017 09:55:46 +0100 Subject: [PVE-User] PVE4->PVE5 Live Migration issues In-Reply-To: References: <1cab5238-f913-fed2-1e37-3a5eb657c1d1@coppint.com> Message-ID: <86089592-f9c7-42b9-239a-daa7a22e182e@binovo.es> What we found was that some of our VMs were running already with a non-cirrus VGA. So we had to check each VM's running kvm process, to know wether we had to add vga:cirrus or not. We didn't see this 100%CPU issue though. El 07/12/17 a las 15:56, Florent B escribi?: > Even if migration succeeded with "vga: cirrus", some VM are frozen with > 100%CPU, no console... > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user -- Zuzendari Teknikoa / Director T?cnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarraga bidea 2, 2? izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es From f.rust at sec.tu-bs.de Mon Dec 11 10:31:55 2017 From: f.rust at sec.tu-bs.de (F.Rust) Date: Mon, 11 Dec 2017 10:31:55 +0100 Subject: [PVE-User] Setup default VM ID starting number Message-ID: <3B6B854E-AF79-4A55-B6BB-3CA643891413@sec.tu-bs.de> Hi all, is it possible to set a different starting number for VM ids? We have different clusters and don?t want to have overlapping vm ids. So it would be great to simply say Cluster 1 start VM-ids at 100 Cluster 2 start VM ids at 1000 ? Admin of Cluster 2 can not see which machine ids on Cluster 1 exist an vice versa. But machine images or backups might get mixed up in SAN Best regards, Frank -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: Message signed with OpenPGP URL: From t.lamprecht at proxmox.com Mon Dec 11 10:58:55 2017 From: t.lamprecht at proxmox.com (Thomas Lamprecht) Date: Mon, 11 Dec 2017 10:58:55 +0100 Subject: [PVE-User] Setup default VM ID starting number In-Reply-To: <3B6B854E-AF79-4A55-B6BB-3CA643891413@sec.tu-bs.de> References: <3B6B854E-AF79-4A55-B6BB-3CA643891413@sec.tu-bs.de> Message-ID: Hi, On 12/11/2017 10:31 AM, F.Rust wrote: > Hi all, > > is it possible to set a different starting number for VM ids? No, currently not, I'm afraid. > We have different clusters and don?t want to have overlapping vm ids. > So it would be great to simply say > Cluster 1 start VM-ids at 100 > Cluster 2 start VM ids at 1000 > ? > Admin of Cluster 2 can not see which machine ids on Cluster 1 exist an vice versa. > But machine images or backups might get mixed up in SAN Is there a possibility to declare two different backup endpoints in your setup? We normally expect that two cluster do not access the same write-able storage twice at the same path, exactly for the backup clash possibility and other shared resource access problems. cheers, Thomas From miguel_3_gonzalez at yahoo.es Mon Dec 11 13:40:28 2017 From: miguel_3_gonzalez at yahoo.es (=?UTF-8?Q?Miguel_Gonz=c3=a1lez?=) Date: Mon, 11 Dec 2017 13:40:28 +0100 Subject: [PVE-User] sparse and compression Message-ID: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> Dear all, Is it advisable to use sparse on ZFS pools performance wise? And compression? Which kind of compression? Can I change a zpool to sparse on the fly or do I need to turn off all VMs before doing so? Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? NAME USED AVAIL REFER MOUNTPOINT rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - Thanks! Miguel --- This email has been checked for viruses by AVG. http://www.avg.com From andreas at mx20.org Mon Dec 11 14:16:38 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Mon, 11 Dec 2017 14:16:38 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> Message-ID: Hi Migual, first at all: man zfs! On 11.12.2017 13:40, Miguel Gonz?lez wrote: > Is it advisable to use sparse on ZFS pools performance wise? And > compression? Which kind of compression? Sparse or not doesn't matter on SSDs. I would use compression because of less r/w to the disc and modern CPUs can handle lz4 quite well. Also keep in mind: A sparse volume only stays sparse if trim/discard is used! volblocksize is important: ZFS is using 8k as default. For ZFS filesystem a recordsize of 128K is used. Some older test: zpool/vm-zvols/bsize_1k written 10.3G zpool/vm-zvols/bsize_1k logicalused 1.82G zpool/vm-zvols/bsize_4k written 2.60G zpool/vm-zvols/bsize_4k logicalused 1.76G zpool/vm-zvols/bsize_8k written 2.60G zpool/vm-zvols/bsize_8k logicalused 1.78G zpool/vm-zvols/bsize_16k written 1.87G zpool/vm-zvols/bsize_16k logicalused 1.70G zpool/vm-zvols/bsize_32k written 1.87G zpool/vm-zvols/bsize_32k logicalused 1.71G zpool/vm-zvols/bsize_64k written 1.72G zpool/vm-zvols/bsize_64k logicalused 1.72G zpool/vm-zvols/bsize_128k written 1.75G zpool/vm-zvols/bsize_128k logicalused 1.75G > Can I change a zpool to sparse on the fly or do I need to turn off all > VMs before doing so? No, sparse or not is set at creation. > Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? > > NAME USED AVAIL REFER MOUNTPOINT > rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - Because of blocksizes. Check zfs get all and read theory about ZFS. Here's an example for a non-sparse 50GB Volume for a VM: zpool/vm-zvols/foobar 51.6G 2.19T 34.6G - zpool/vm-zvols/foobar used 51.6G zpool/vm-zvols/foobar referenced 34.6G zpool/vm-zvols/foobar compressratio 1.02x zpool/vm-zvols/foobar volsize 50G zpool/vm-zvols/foobar volblocksize 8K zpool/vm-zvols/foobar compression lz4 zpool/vm-zvols/foobar refreservation 51.6G zpool/vm-zvols/foobar usedbydataset 34.6G zpool/vm-zvols/foobar usedbyrefreservation 17.0G zpool/vm-zvols/foobar refcompressratio 1.02x zpool/vm-zvols/foobar written 34.6G zpool/vm-zvols/foobar logicalused 24.2G zpool/vm-zvols/foobar logicalreferenced 24.2G Andreas From f.gruenbichler at proxmox.com Mon Dec 11 14:17:31 2017 From: f.gruenbichler at proxmox.com (Fabian =?iso-8859-1?Q?Gr=FCnbichler?=) Date: Mon, 11 Dec 2017 14:17:31 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> Message-ID: <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> On Mon, Dec 11, 2017 at 01:40:28PM +0100, Miguel Gonz?lez wrote: > Dear all, > > Is it advisable to use sparse on ZFS pools performance wise? And > compression? Which kind of compression? sparse just tells ZFS to not reserve space, it does not make a difference performance wise. if you do over provision and attempt to use more space than you actuall have, you can corrupt volumes / run into I/O errors though, like with most storages. compression is advisable, it costs (almost) nothing and usually increases performance and saves space. the default (on which is lz4) is fine. > > Can I change a zpool to sparse on the fly or do I need to turn off all > VMs before doing so? sparse will only affect newly created volumes. you can "convert" sparse volumes to fully reserved ones and vice versa manually though. compression only affects data written after it has been enabled, and already written data stays compressed if you turn it off again. if you want to fully switch from compressed to uncompressed or vice versa, you need to re-write all the data. > > Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? > > NAME USED AVAIL REFER MOUNTPOINT > rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - wild guess - you are using raidz of some kind? ashift is set to 12 / auto-detected? From miguel_3_gonzalez at yahoo.es Mon Dec 11 15:23:34 2017 From: miguel_3_gonzalez at yahoo.es (=?UTF-8?Q?Miguel_Gonz=c3=a1lez?=) Date: Mon, 11 Dec 2017 15:23:34 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> Message-ID: >> Can I change a zpool to sparse on the fly or do I need to turn off all >> VMs before doing so? > > sparse will only affect newly created volumes. you can "convert" sparse > volumes to fully reserved ones and vice versa manually though. > how can I convert manually from non-sparse to sparse? Creating a new zpool and copy disk with dd? Or any other easier way? > >> >> Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? >> >> NAME USED AVAIL REFER MOUNTPOINT >> rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - > > wild guess - you are using raidz of some kind? ashift is set to 12 / > auto-detected? Yes, raid1 Thanks for your promptly reply! Miguel --- This email has been checked for viruses by AVG. http://www.avg.com From miguel_3_gonzalez at yahoo.es Mon Dec 11 15:27:47 2017 From: miguel_3_gonzalez at yahoo.es (=?UTF-8?Q?Miguel_Gonz=c3=a1lez?=) Date: Mon, 11 Dec 2017 15:27:47 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> Message-ID: > > Also keep in mind: A sparse volume only stays sparse if trim/discard is used! How do I use trim/discard? Has this to be set in guest level, right? > > volblocksize is important: ZFS is using 8k as default. For ZFS filesystem a > recordsize of 128K is used. Can I change recordsize to 128K after creation or do I need to create a new zpool for that? Thanks for your promptly answer! Miguel --- This email has been checked for viruses by AVG. http://www.avg.com From f.gruenbichler at proxmox.com Mon Dec 11 15:29:04 2017 From: f.gruenbichler at proxmox.com (Fabian =?iso-8859-1?Q?Gr=FCnbichler?=) Date: Mon, 11 Dec 2017 15:29:04 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> Message-ID: <20171211142904.biawkzgnntym2vmn@nora.maurer-it.com> On Mon, Dec 11, 2017 at 03:23:34PM +0100, Miguel Gonz?lez wrote: > > >> Can I change a zpool to sparse on the fly or do I need to turn off all > >> VMs before doing so? > > > > sparse will only affect newly created volumes. you can "convert" sparse > > volumes to fully reserved ones and vice versa manually though. > > > > how can I convert manually from non-sparse to sparse? Creating a new > zpool and copy disk with dd? Or any other easier way? (un)set the reservations appropriately. like I said, "sparse" is entirely virtual for ZFS, the only difference is whether the full size is reserved upon creation or not. > >> Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? > >> > >> NAME USED AVAIL REFER MOUNTPOINT > >> rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - > > > > wild guess - you are using raidz of some kind? ashift is set to 12 / > > auto-detected? > > Yes, raid1 > > Thanks for your promptly reply! raid1 (aka mirror)? or raidZ-1 ? those are two very different things ;) From miguel_3_gonzalez at yahoo.es Mon Dec 11 15:34:37 2017 From: miguel_3_gonzalez at yahoo.es (=?UTF-8?Q?Miguel_Gonz=c3=a1lez?=) Date: Mon, 11 Dec 2017 15:34:37 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: <20171211142904.biawkzgnntym2vmn@nora.maurer-it.com> References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> <20171211142904.biawkzgnntym2vmn@nora.maurer-it.com> Message-ID: <79d00b4d-9211-4b1a-d136-7a78454badda@yahoo.es> On 12/11/17 3:29 PM, Fabian Gr?nbichler wrote: > On Mon, Dec 11, 2017 at 03:23:34PM +0100, Miguel Gonz?lez wrote: >> >>>> Can I change a zpool to sparse on the fly or do I need to turn off all >>>> VMs before doing so? >>> >>> sparse will only affect newly created volumes. you can "convert" sparse >>> volumes to fully reserved ones and vice versa manually though. >>> >> >> how can I convert manually from non-sparse to sparse? Creating a new >> zpool and copy disk with dd? Or any other easier way? > > (un)set the reservations appropriately. like I said, "sparse" is > entirely virtual for ZFS, the only difference is whether the full size > is reserved upon creation or not. from Andreas comment maybe i should look more into change blocksize. > >>>> Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? >>>> >>>> NAME USED AVAIL REFER MOUNTPOINT >>>> rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - >>> >>> wild guess - you are using raidz of some kind? ashift is set to 12 / >>> auto-detected? >> >> Yes, raid1 >> >> Thanks for your promptly reply! > > raid1 (aka mirror)? or raidZ-1 ? those are two very different things ;) from zfs perspective is called mirror-0 (not softraid underneath): zpool status pool: rpool state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details. scan: scrub repaired 0B in 0h39m with 0 errors on Sun Dec 10 02:03:43 2017 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda2 ONLINE 0 0 0 sdb2 ONLINE 0 0 0 --- This email has been checked for viruses by AVG. http://www.avg.com From andreas at mx20.org Mon Dec 11 15:35:35 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Mon, 11 Dec 2017 15:35:35 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> Message-ID: Hi, On 11.12.2017 15:27, Miguel Gonz?lez wrote: >> Also keep in mind: A sparse volume only stays sparse if trim/discard is used! > > How do I use trim/discard? Has this to be set in guest level, right? https://pve.proxmox.com/wiki/Qemu_trim/discard_and_virtio_scsi >> volblocksize is important: ZFS is using 8k as default. For ZFS filesystem a >> recordsize of 128K is used. > > Can I change recordsize to 128K after creation or do I need to create a > new zpool for that? Play and learn: zfs set volblocksize=16K zpool/vm-zvols/test cannot set property for 'zpool/vm-zvols/test': 'volblocksize' is readonly You really should read 'man zfs' Andreas From andreas at mx20.org Mon Dec 11 15:47:56 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Mon, 11 Dec 2017 15:47:56 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> Message-ID: Hi On 11.12.2017 14:17, Fabian Gr?nbichler wrote: > On Mon, Dec 11, 2017 at 01:40:28PM +0100, Miguel Gonz?lez wrote: >> Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? >> >> NAME USED AVAIL REFER MOUNTPOINT >> rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - > > wild guess - you are using raidz of some kind? ashift is set to 12 / > auto-detected? No! 'zpool list' will show what is used on disk. zfs list is totally transparent to zpool layout. Have a look at 'zpool get all' for the ashift setting. Example for raidz1 (4x 960GB SSDs): root at foobar:~# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zpool 3.41T 102G 3.31T - 8% 2% 1.00x ONLINE - root at foobar:~# zfs list NAME USED AVAIL REFER MOUNTPOINT zpool 237G 2.17T 140K /zpool zpool ALLOC is smaller than zfs USED in this example. Why? Try to unserstand the difference between 'referenced' and 'used'. My volumes aren't sparse but discard is used. Andreas From miguel_3_gonzalez at yahoo.es Mon Dec 11 16:10:29 2017 From: miguel_3_gonzalez at yahoo.es (=?UTF-8?Q?Miguel_Gonz=c3=a1lez?=) Date: Mon, 11 Dec 2017 16:10:29 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> Message-ID: On 12/11/17 3:47 PM, Andreas Herrmann wrote: > Hi > > On 11.12.2017 14:17, Fabian Gr?nbichler wrote: >> On Mon, Dec 11, 2017 at 01:40:28PM +0100, Miguel Gonz?lez wrote: >>> Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? >>> >>> NAME USED AVAIL REFER MOUNTPOINT >>> rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - >> >> wild guess - you are using raidz of some kind? ashift is set to 12 / >> auto-detected? > > No! 'zpool list' will show what is used on disk. zfs list is totally > transparent to zpool layout. Have a look at 'zpool get all' for the ashift > setting. > > Example for raidz1 (4x 960GB SSDs): > root at foobar:~# zpool list > NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT > zpool 3.41T 102G 3.31T - 8% 2% 1.00x ONLINE - > > root at foobar:~# zfs list > NAME USED AVAIL REFER MOUNTPOINT > zpool 237G 2.17T 140K /zpool > > zpool ALLOC is smaller than zfs USED in this example. Why? Try to unserstand > the difference between 'referenced' and 'used'. My volumes aren't sparse but > discard is used. I have search around about how to understand those columns. I didn?t find anything on the wiki that explains this. This is my zfs list: NAME USED AVAIL REFER MOUNTPOINT rpool 207G 61.9G 104K /rpool rpool/ROOT 6.10G 61.9G 96K /rpool/ROOT rpool/ROOT/pve-1 6.10G 61.9G 6.10G / rpool/data 197G 61.9G 96K /rpool/data rpool/data/vm-100-disk-1 108G 61.9G 108G - rpool/data/vm-102-disk-1 37.1G 77.9G 21.1G - rpool/data/vm-102-disk-2 51.6G 81.8G 31.7G - rpool/swap 4.25G 64.9G 1.25G - If I run zfs get all I get: rpool/data/vm-100-disk-1 written 108G rpool/data/vm-100-disk-1 logicalused 129G rpool/data/vm-100-disk-1 logicalreferenced 129G rpool/data/vm-102-disk-1 written 21.1G rpool/data/vm-102-disk-1 logicalused 27.1G rpool/data/vm-102-disk-1 logicalreferenced 27.1G rpool/data/vm-102-disk-2 written 31.7G rpool/data/vm-102-disk-2 logicalused 36.2G rpool/data/vm-102-disk-2 logicalreferenced 36.2G So even If I?m having 8k blocksize and non-sparse the written data is quite close to the real usage in the guests VMs. All this comes from that I was running out of space when running pve-zsync to perform a copy of the VM in other node. I have found out that snapshots were taken some part of the data (30 Gb). Any way to run a pve-zsync only a day that doesn?t consume snapshots on this machine (Maybe running from the target machine?) Thanks Miguel --- This email has been checked for viruses by AVG. http://www.avg.com From andreas at mx20.org Mon Dec 11 16:22:59 2017 From: andreas at mx20.org (Andreas Herrmann) Date: Mon, 11 Dec 2017 16:22:59 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> Message-ID: <9d630687-41ae-4849-e90f-18a0f25c5a3a@mx20.org> Hi Miguel, On 11.12.2017 16:10, Miguel Gonz?lez wrote: >> Example for raidz1 (4x 960GB SSDs): >> root at foobar:~# zpool list >> NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT >> zpool 3.41T 102G 3.31T - 8% 2% 1.00x ONLINE - >> >> root at foobar:~# zfs list >> NAME USED AVAIL REFER MOUNTPOINT >> zpool 237G 2.17T 140K /zpool >> >> zpool ALLOC is smaller than zfs USED in this example. Why? Try to unserstand >> the difference between 'referenced' and 'used'. My volumes aren't sparse but >> discard is used. > > > I have search around about how to understand those columns. I didn?t > find anything on the wiki that explains this. Why should Proxmox explain the theory of ZFS? Please have a look at 'man zfs'. There you'll find all you need. Andreas From f.gruenbichler at proxmox.com Mon Dec 11 16:37:57 2017 From: f.gruenbichler at proxmox.com (Fabian =?iso-8859-1?Q?Gr=FCnbichler?=) Date: Mon, 11 Dec 2017 16:37:57 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> Message-ID: <20171211153757.fe7pgkh3yndob2wz@nora.maurer-it.com> On Mon, Dec 11, 2017 at 03:47:56PM +0100, Andreas Herrmann wrote: > Hi > > On 11.12.2017 14:17, Fabian Gr?nbichler wrote: > > On Mon, Dec 11, 2017 at 01:40:28PM +0100, Miguel Gonz?lez wrote: > >> Why a virtual disk shows as 60G when originally It was 36 Gb in raw format? > >> > >> NAME USED AVAIL REFER MOUNTPOINT > >> rpool/data/vm-102-disk-1 60.0G 51.3G 20.9G - > > > > wild guess - you are using raidz of some kind? ashift is set to 12 / > > auto-detected? > > No! 'zpool list' will show what is used on disk. zfs list is totally > transparent to zpool layout. Have a look at 'zpool get all' for the ashift > setting. I know. in most cases when people are surprised by their zvols taking up more space than expected, it is because they are using raidz and don't know about the interaction between ashift=12, raidz and small volblocksize. > > Example for raidz1 (4x 960GB SSDs): > root at foobar:~# zpool list > NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT > zpool 3.41T 102G 3.31T - 8% 2% 1.00x ONLINE - > > root at foobar:~# zfs list > NAME USED AVAIL REFER MOUNTPOINT > zpool 237G 2.17T 140K /zpool > > zpool ALLOC is smaller than zfs USED in this example. Why? Try to unserstand > the difference between 'referenced' and 'used'. My volumes aren't sparse but > discard is used. your output is pretty worthless, as "REFER" only refers to the pool dataset, and not its children. I do know the difference between used and referenced, which is not (directly) related to discard at all. discard can obviously get your referenced value down ;) see the following for an example where a 10G volume takes more than 10G of space in 'zfs list' output: $ zfs list testpool -r -o name,used,referenced,volsize NAME USED REFER VOLSIZE testpool 14.3G 140K - testpool/test 14.3G 14.3G 10G $ zpool list testpool NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT testpool 39.8G 19.7G 20.1G - 0% 49% 1.00x ONLINE - the only difference between a sparse and non-sparse zvol is whether refreservation is set, which affects the usedbyrefreservation value which in turn (might / probably will) affect the used value. no relation to discard at all. From lindsay.mathieson at gmail.com Mon Dec 11 16:46:36 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Tue, 12 Dec 2017 01:46:36 +1000 Subject: [PVE-User] pveproxy dying, node unusable Message-ID: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> I dist-upraded two nodes yesterday. Now both those nodes have multiple unkilliable pveproxy processes. dmesg has many entries of: [50996.416909] INFO: task pveproxy:6798 blocked for more than 120 seconds. [50996.416914]?????? Tainted: P?????????? O 4.4.95-1-pve #1 [50996.416918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [50996.416922] pveproxy??????? D ffff8809194e3df8 0? 6798????? 1 0x00000004 [50996.416925]? ffff8809194e3df8 ffff880ff6f5ed80 ffff880ff84fe200 ffff880fded5e200 [50996.416927]? ffff8809194e4000 ffff880fc7fb43ac ffff880fded5e200 00000000ffffffff [50996.416929]? ffff880fc7fb43b0 ffff8809194e3e10 ffffffff818643b5 ffff880fc7fb43a8 qm list hangs Node vms do not respond in web gui The node I did not upgrade is fine. -- Lindsay Mathieson From lindsay.mathieson at gmail.com Mon Dec 11 16:50:30 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Tue, 12 Dec 2017 01:50:30 +1000 Subject: [PVE-User] pveproxy dying, node unusable In-Reply-To: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> References: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> Message-ID: <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> Also I was unable to connect to the VM's on those nodes, not even via RDP On 12/12/2017 1:46 AM, Lindsay Mathieson wrote: > > I dist-upraded two nodes yesterday. Now both those nodes have multiple > unkilliable pveproxy processes. dmesg has many entries of: > > [50996.416909] INFO: task pveproxy:6798 blocked for more than 120 > seconds. > [50996.416914]?????? Tainted: P?????????? O 4.4.95-1-pve #1 > [50996.416918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [50996.416922] pveproxy??????? D ffff8809194e3df8 0? 6798????? 1 > 0x00000004 > [50996.416925]? ffff8809194e3df8 ffff880ff6f5ed80 ffff880ff84fe200 > ffff880fded5e200 > [50996.416927]? ffff8809194e4000 ffff880fc7fb43ac ffff880fded5e200 > 00000000ffffffff > [50996.416929]? ffff880fc7fb43b0 ffff8809194e3e10 ffffffff818643b5 > ffff880fc7fb43a8 > > > qm list hangs > > Node vms do not respond in web gui > > The node I did not upgrade is fine. > > > -- > Lindsay Mathieson -- Lindsay Mathieson From miguel_3_gonzalez at yahoo.es Mon Dec 11 17:05:17 2017 From: miguel_3_gonzalez at yahoo.es (=?UTF-8?Q?Miguel_Gonz=c3=a1lez?=) Date: Mon, 11 Dec 2017 17:05:17 +0100 Subject: [PVE-User] sparse and compression In-Reply-To: <20171211153757.fe7pgkh3yndob2wz@nora.maurer-it.com> References: <26aaa72a-458d-8c63-c3c9-16830a96e0c3@yahoo.es> <20171211131731.7ixb7wpq6tly3khp@nora.maurer-it.com> <20171211153757.fe7pgkh3yndob2wz@nora.maurer-it.com> Message-ID: > $ zfs list testpool -r -o name,used,referenced,volsize > NAME USED REFER VOLSIZE > testpool 14.3G 140K - > testpool/test 14.3G 14.3G 10G Mine is: zfs list rpool -r -o name,used,referenced,volsize NAME USED REFER VOLSIZE rpool 207G 104K - rpool/ROOT 6.10G 96K - rpool/ROOT/pve-1 6.10G 6.10G - rpool/data 197G 96K - rpool/data/vm-100-disk-1 108G 108G 138G rpool/data/vm-102-disk-1 37.1G 21.3G 36G rpool/data/vm-102-disk-2 51.6G 31.7G 50G rpool/swap 4.25G 1.25G 4G How can I fix this with the minimum downtime? In the three disks I have more than 15 Gb free. Thanks, Miguel --- This email has been checked for viruses by AVG. http://www.avg.com From e.kasper at proxmox.com Mon Dec 11 17:14:37 2017 From: e.kasper at proxmox.com (Emmanuel Kasper) Date: Mon, 11 Dec 2017 17:14:37 +0100 Subject: [PVE-User] pveproxy dying, node unusable In-Reply-To: <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> References: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> Message-ID: <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> On 12/11/2017 04:50 PM, Lindsay Mathieson wrote: > Also I was unable to connect to the VM's on those nodes, not even via RDP > > On 12/12/2017 1:46 AM, Lindsay Mathieson wrote: >> >> I dist-upraded two nodes yesterday. Now both those nodes have multiple >> unkilliable pveproxy processes. dmesg has many entries of: >> >> ??? [50996.416909] INFO: task pveproxy:6798 blocked for more than 120 >> ??? seconds. >> ??? [50996.416914]?????? Tainted: P?????????? O 4.4.95-1-pve #1 >> ??? [50996.416918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> ??? disables this message. >> ??? [50996.416922] pveproxy??????? D ffff8809194e3df8 0? 6798????? 1 >> ??? 0x00000004 >> ??? [50996.416925]? ffff8809194e3df8 ffff880ff6f5ed80 ffff880ff84fe200 >> ??? ffff880fded5e200 >> ??? [50996.416927]? ffff8809194e4000 ffff880fc7fb43ac ffff880fded5e200 >> ??? 00000000ffffffff >> ??? [50996.416929]? ffff880fc7fb43b0 ffff8809194e3e10 ffffffff818643b5 >> ??? ffff880fc7fb43a8 >> >> >> qm list hangs >> >> Node vms do not respond in web gui >> >> The node I did not upgrade is fine. Hi Lindsay As a quick check, is the cluster file system mounted on /etc/pve and can you read files there normally ( ie cat /etc/pve/datacenter.cfg working ) ? Are the node storages returning their status properly ? (ie pvesm status does not hang) From lindsay.mathieson at gmail.com Mon Dec 11 17:18:42 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Tue, 12 Dec 2017 02:18:42 +1000 Subject: [PVE-User] pveproxy dying, node unusable In-Reply-To: <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> References: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> Message-ID: <9e4acabd-ede0-2bf7-ce78-0eb2980ad92d@gmail.com> On 12/12/2017 2:14 AM, Emmanuel Kasper wrote: > Hi Lindsay > As a quick check, is the cluster file system mounted on /etc/pve and can > you read files there normally ( ie cat /etc/pve/datacenter.cfg working ) ? Unfortunately I hard reset both nodes as I needed them up. But a pvecm status showed that quorum was ok and the nodes were marked green in the web gui. /etc/pve was mounted and accessible on the unaffected node. > > Are the node storages returning their status properly ? > (ie pvesm status does not hang) Yes they were (pvesm status). nb. Both nodes are running ok after a reset now. thanks. -- Lindsay Mathieson From davel at upilab.com Wed Dec 13 14:34:20 2017 From: davel at upilab.com (David Lawley) Date: Wed, 13 Dec 2017 08:34:20 -0500 Subject: [PVE-User] netdata anyone? Message-ID: <61f8bd85-2659-795f-3aed-9dd791b9bfb0@upilab.com> Anyone use netdata? https://github.com/firehol/netdata pro/cons, impact on Promox if any. It help me find one issue I was having but was unsure of long term impact. Help me identify "Squeezed" packets on a nic. From daniel at linux-nerd.de Wed Dec 13 22:03:47 2017 From: daniel at linux-nerd.de (Daniel) Date: Wed, 13 Dec 2017 22:03:47 +0100 Subject: [PVE-User] netdata anyone? In-Reply-To: <61f8bd85-2659-795f-3aed-9dd791b9bfb0@upilab.com> References: <61f8bd85-2659-795f-3aed-9dd791b9bfb0@upilab.com> Message-ID: <7C10FA35-95E9-48E9-86FB-4C395381532C@linux-nerd.de> There is now problem. Proxmox is more or less a normal Debian. You can install it as in the docu described. Cheers Daniel Am 13.12.17, 14:35 schrieb "pve-user im Auftrag von David Lawley" : Anyone use netdata? https://github.com/firehol/netdata pro/cons, impact on Promox if any. It help me find one issue I was having but was unsure of long term impact. Help me identify "Squeezed" packets on a nic. _______________________________________________ pve-user mailing list pve-user at pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user From m at plus-plus.su Mon Dec 18 18:57:23 2017 From: m at plus-plus.su (Mikhail) Date: Mon, 18 Dec 2017 20:57:23 +0300 Subject: [PVE-User] Failure to install latest PVE on Debian Stretch Message-ID: Hello, Following this official wiki instruction: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Stretch (done that procedure several times already) I'm having problem with new server. It is running clean install of Debian Stretch and I'm trying to put Proxmox using PVE packages install as described in wiki. I'm not sure whether the problem is related to something misconfigured on Stretch itself, or the problem is somewhere with PVE packages, but my installation fails on the following: Setting up pve-firewall (3.0-5) ... Created symlink /etc/systemd/system/multi-user.target.wants/pve-firewall.service ? /lib/systemd/system/pve-firewall.service. insserv: Service pve-cluster has to be enabled to start service pvefw-logger insserv: exiting now! update-rc.d: error: insserv rejected the script header dpkg: error processing package pve-firewall (--configure): subprocess installed post-installation script returned error exit status 1 dpkg: dependency problems prevent configuration of qemu-server: qemu-server depends on pve-firewall; however: Package pve-firewall is not configured yet. dpkg: error processing package qemu-server (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of proxmox-ve: proxmox-ve depends on qemu-server; however: Package qemu-server is not configured yet. dpkg: error processing package proxmox-ve (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of pve-manager: pve-manager depends on pve-firewall; however: Package pve-firewall is not configured yet. pve-manager depends on qemu-server (>= 1.1-1); however: Package qemu-server is not configured yet. dpkg: error processing package pve-manager (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of pve-ha-manager: pve-ha-manager depends on qemu-server; however: Package qemu-server is not configured yet. dpkg: error processing package pve-ha-manager (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of pve-container: pve-container depends on pve-ha-manager; however: Package pve-ha-manager is not configured yet. dpkg: error processing package pve-container (--configure): dependency problems - leaving unconfigured Processing triggers for initramfs-tools (0.130) ... update-initramfs: Generating /boot/initrd.img-4.13.13-1-pve I: The initramfs will attempt to resume from /dev/md0 I: (UUID=25b05adb-f12d-40d0-8c68-1bf28e25e9ba) I: Set the RESUME variable to override this. Processing triggers for libc-bin (2.24-11+deb9u1) ... Processing triggers for systemd (232-25+deb9u1) ... Errors were encountered while processing: pve-firewall qemu-server proxmox-ve pve-manager pve-ha-manager pve-container E: Sub-process /usr/bin/dpkg returned an error code (1) root at pve /etc/apt/sources.list.d # I have tried everything, "apt-get -f install", dpkg-reconfigure all PVE packages, remove proxmox-ve postfix open-iscsi packages and doing install again - always failing with the same error. Has anyone else experienced something similar? This has never failed on my history before. Thanks. From dietmar at proxmox.com Mon Dec 18 21:06:51 2017 From: dietmar at proxmox.com (Dietmar Maurer) Date: Mon, 18 Dec 2017 21:06:51 +0100 (CET) Subject: [PVE-User] Failure to install latest PVE on Debian Stretch In-Reply-To: References: Message-ID: <1370055586.64.1513627612111@webmail.proxmox.com> > I'm having problem with new server. It is running clean install of > Debian Stretch and I'm trying to put Proxmox using PVE packages install > as described in wiki. I'm not sure whether the problem is related to > something misconfigured on Stretch itself, or the problem is somewhere > with PVE packages, but my installation fails on the following: > > Setting up pve-firewall (3.0-5) ... > Created symlink > /etc/systemd/system/multi-user.target.wants/pve-firewall.service ? > /lib/systemd/system/pve-firewall.service. > insserv: Service pve-cluster has to be enabled to start service pvefw-logger > insserv: exiting now! We do not support insserv based systems anymore - please use systemd instead. From davel at upilab.com Mon Dec 18 21:19:38 2017 From: davel at upilab.com (David Lawley) Date: Mon, 18 Dec 2017 15:19:38 -0500 Subject: [PVE-User] sysctl tuning 5.1 Message-ID: Been working on tuning for a 10g network on PV 5.1 These examples for my sysctl.conf file give me errors and I never can really seem to get any to stick net.core.rmem_max=8388608 net.core.wmem_max=8388608 net.core.rmem_default=65536 net.core.wmem_default=65536 net.ipv4.tcp_rmem="4096 87380 8388608" net.ipv4.tcp_wmem="4096 65536 8388608" net.ipv4.tcp_mem="8388608 8388608 8388608" When sysctl -p is ran I get sysctl: setting key "net.ipv4.tcp_rmem": Invalid argument net.ipv4.tcp_rmem = "4096 87380 8388608" sysctl: setting key "net.ipv4.tcp_wmem": Invalid argument net.ipv4.tcp_wmem = "4096 65536 8388608" sysctl: setting key "net.ipv4.tcp_mem": Invalid argument net.ipv4.tcp_mem = "8388608 8388608 8388608" Guess I'm trying to understand what I am missing doing it manually via cli, seems to work. root at pve:/etc# sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608" net.ipv4.tcp_rmem = 4096 87380 8388608 root at pve:/etc# From olivier.benghozi at wifirst.fr Mon Dec 18 22:49:17 2017 From: olivier.benghozi at wifirst.fr (Olivier Benghozi) Date: Mon, 18 Dec 2017 22:49:17 +0100 Subject: [PVE-User] sysctl tuning 5.1 In-Reply-To: References: Message-ID: <677240E5-98F9-44D9-826E-B30C0AB1EA3F@wifirst.fr> Remove the double quotes. > On 18 dec. 2017 at 21:19, David Lawley wrote : > > sysctl: setting key "net.ipv4.tcp_rmem": Invalid argument > net.ipv4.tcp_rmem = "4096 87380 8388608" From IMMO.WETZEL at adtran.com Tue Dec 19 11:29:33 2017 From: IMMO.WETZEL at adtran.com (IMMO WETZEL) Date: Tue, 19 Dec 2017 10:29:33 +0000 Subject: [PVE-User] network restart Message-ID: Hi, PVE 4.4 we observed a few times network card outages. The only way was a network card driver reload. But this leads to a destroyed network setup. A service restart networking.service doenst solved this. It looks like the full network restart isn't done with that. Once I tried to recover the network manually with all the steps which should be done automatically via ip ... It's a hard way if there are 40VMs and two or three different network connections per vm existing. Is there a common tool supported way to bring back all network connections ? Immo Wetzel ADTRAN GmbH Siemensallee 1 17489 Greifswald Germany Phone: +49 3834 5352 823 Mobile: +49 151 147 29 225 Immo.Wetzel at Adtran.com PGP-Fingerprint: 7313 7E88 4E19 AACF 45E9 E74D EFF7 0480 F4CF 6426 http://www.adtran.com Sitz der Gesellschaft: Berlin / Registered office: Berlin Registergericht: Berlin / Commercial registry: Amtsgericht Charlottenburg, HRB 135656 B Gesch?ftsf?hrung / Managing Directors: Roger Shannon, James D. Wilson, Jr., Dr. Eduard Scheiterer From mark at tuxis.nl Tue Dec 19 11:34:54 2017 From: mark at tuxis.nl (Mark Schouten) Date: Tue, 19 Dec 2017 11:34:54 +0100 Subject: [PVE-User] network restart In-Reply-To: References: Message-ID: <1780533.Pdf45Cfqpt@tuxis> Hi, On dinsdag 19 december 2017 10:29:33 CET IMMO WETZEL wrote: > PVE 4.4 > we observed a few times network card outages. The only way was a network > card driver reload. But this leads to a destroyed network setup. A service Sounds like spanning tree, or something like that. What can you do with ip to fix it? -- Kerio Operator in de Cloud? https://www.kerioindecloud.nl/ Mark Schouten | Tuxis Internet Engineering KvK: 61527076 | http://www.tuxis.nl/ T: 0318 200208 | info at tuxis.nl -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part. URL: From IMMO.WETZEL at adtran.com Tue Dec 19 12:13:30 2017 From: IMMO.WETZEL at adtran.com (IMMO WETZEL) Date: Tue, 19 Dec 2017 11:13:30 +0000 Subject: [PVE-User] network restart In-Reply-To: <1780533.Pdf45Cfqpt@tuxis> References: , <1780533.Pdf45Cfqpt@tuxis> Message-ID: No spanning tree configured in the whole network Sent from Mobile -------- Original message -------- From: Mark Schouten Date:19/12/2017 12:09 (GMT+01:00) To: PVE User List Subject: Re: [PVE-User] network restart Hi, On dinsdag 19 december 2017 10:29:33 CET IMMO WETZEL wrote: > PVE 4.4 > we observed a few times network card outages. The only way was a network > card driver reload. But this leads to a destroyed network setup. A service Sounds like spanning tree, or something like that. What can you do with ip to fix it? -- Kerio Operator in de Cloud? https://www.kerioindecloud.nl/ Mark Schouten | Tuxis Internet Engineering KvK: 61527076 | http://www.tuxis.nl/ T: 0318 200208 | info at tuxis.nl From davel at upilab.com Tue Dec 19 12:17:43 2017 From: davel at upilab.com (David Lawley) Date: Tue, 19 Dec 2017 06:17:43 -0500 Subject: [PVE-User] sysctl tuning 5.1 In-Reply-To: <677240E5-98F9-44D9-826E-B30C0AB1EA3F@wifirst.fr> References: <677240E5-98F9-44D9-826E-B30C0AB1EA3F@wifirst.fr> Message-ID: <9d1e334c-ea2a-851e-a720-207c44fc1ade@upilab.com> Gives a different error, but I did try it too. Guessing these are not turntable yet in the PVE kernel yet? root at pve:/etc# sysctl -w net.ipv4.tcp_rmem=4096 87380 8388608 net.ipv4.tcp_rmem = 4096 sysctl: "87380" must be of the form name=value sysctl: "8388608" must be of the form name=value On 12/18/2017 4:49 PM, Olivier Benghozi wrote: > Remove the double quotes. > >> On 18 dec. 2017 at 21:19, David Lawley wrote : >> >> sysctl: setting key "net.ipv4.tcp_rmem": Invalid argument >> net.ipv4.tcp_rmem = "4096 87380 8388608" > > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From olivier.benghozi at wifirst.fr Tue Dec 19 12:24:13 2017 From: olivier.benghozi at wifirst.fr (Olivier Benghozi) Date: Tue, 19 Dec 2017 12:24:13 +0100 Subject: [PVE-User] sysctl tuning 5.1 In-Reply-To: <9d1e334c-ea2a-851e-a720-207c44fc1ade@upilab.com> References: <677240E5-98F9-44D9-826E-B30C0AB1EA3F@wifirst.fr> <9d1e334c-ea2a-851e-a720-207c44fc1ade@upilab.com> Message-ID: In your interactive shell you need double quotes. In the .conf file you need to remove the double quotes and leave a space behind and after the equal sign. > Le 19 d?c. 2017 ? 12:17, David Lawley a ?crit : > > Gives a different error, but I did try it too. Guessing these are not turntable yet in the PVE kernel yet? > > root at pve:/etc# sysctl -w net.ipv4.tcp_rmem=4096 87380 8388608 > net.ipv4.tcp_rmem = 4096 > sysctl: "87380" must be of the form name=value > sysctl: "8388608" must be of the form name=value > > > On 12/18/2017 4:49 PM, Olivier Benghozi wrote: >> Remove the double quotes. >>> On 18 dec. 2017 at 21:19, David Lawley wrote : >>> >>> sysctl: setting key "net.ipv4.tcp_rmem": Invalid argument >>> net.ipv4.tcp_rmem = "4096 87380 8388608" >> _______________________________________________ >> pve-user mailing list >> pve-user at pve.proxmox.com >> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user From infolist at schwarz-fr.net Tue Dec 19 12:28:58 2017 From: infolist at schwarz-fr.net (Phil Schwarz) Date: Tue, 19 Dec 2017 12:28:58 +0100 Subject: [PVE-User] Ceph over IP over Infiniband Message-ID: <8647ac76-d5bb-1068-8c6c-a97b29de00c2@schwarz-fr.net> Hi, I'm currently trying to set up a brand new home cluster : - 5 nodes, with each : - 1 HCA Mellanox ConnectX-2 - 1 GB Ethernet (Proxmox 5.1 Network Admin) - 1 CX4 to CX4 cable All together connected to a SDR Flextronics IB Switch. This setup should back a Ceph Luminous (V12.2.2 included in proxmox V5.1) On all nodes, I did: - apt-get infiniband-diags - modprobe mlx4_ib - modprobe ib_ipoib - modprobe ib_umad - ifconfig ib0 IP/MASK On two nodes (tried previously on a single on, same issue), i installed opensm ( The switch doesn't have SM included) : apt-get install opensm /etc/init.d/opensm stop /etc/init.d/opensm start (Necessary to let the daemon create the logfiles) I tailed the logfile and got a "Active&Running" Setup, with "SUBNET UP" Every node is OK regardless to IB Setup : - All ib0 are UP, using ibstat - ibhosts and ibswitches seem to be OK On a node : ibping -S On every other node : ibping -G GID_Of_Previous_Server_Port I got a nice pong reply on every node. Should be happy, but... But i never went further.. Tried to ping each other. No way to get into this (mostly probably) simple issue... Any hint to achieve this task ?? Thanks for all Best regards From davel at upilab.com Tue Dec 19 12:31:45 2017 From: davel at upilab.com (David Lawley) Date: Tue, 19 Dec 2017 06:31:45 -0500 Subject: [PVE-User] sysctl tuning 5.1 In-Reply-To: References: <677240E5-98F9-44D9-826E-B30C0AB1EA3F@wifirst.fr> <9d1e334c-ea2a-851e-a720-207c44fc1ade@upilab.com> Message-ID: Bingo, thanks!! Sometimes you can read too much! On 12/19/2017 6:24 AM, Olivier Benghozi wrote: > In your interactive shell you need double quotes. > In the .conf file you need to remove the double quotes and leave a space behind and after the equal sign. > >> Le 19 d?c. 2017 ? 12:17, David Lawley a ?crit : >> >> Gives a different error, but I did try it too. Guessing these are not turntable yet in the PVE kernel yet? >> >> root at pve:/etc# sysctl -w net.ipv4.tcp_rmem=4096 87380 8388608 >> net.ipv4.tcp_rmem = 4096 >> sysctl: "87380" must be of the form name=value >> sysctl: "8388608" must be of the form name=value >> >> >> On 12/18/2017 4:49 PM, Olivier Benghozi wrote: >>> Remove the double quotes. >>>> On 18 dec. 2017 at 21:19, David Lawley wrote : >>>> >>>> sysctl: setting key "net.ipv4.tcp_rmem": Invalid argument >>>> net.ipv4.tcp_rmem = "4096 87380 8388608" >>> _______________________________________________ >>> pve-user mailing list >>> pve-user at pve.proxmox.com >>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> >> _______________________________________________ >> pve-user mailing list >> pve-user at pve.proxmox.com >> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From gilberto.nunes32 at gmail.com Tue Dec 19 12:45:03 2017 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Tue, 19 Dec 2017 09:45:03 -0200 Subject: [PVE-User] network restart In-Reply-To: References: Message-ID: Hi Can you tell us about your hardware? Mainly the network card and switches..... --- Gilberto Ferreira (47) 3025-5907 (47) 99676-7530 Skype: gilberto.nunes36 2017-12-19 8:29 GMT-02:00 IMMO WETZEL : > Hi, > > PVE 4.4 > we observed a few times network card outages. The only way was a network > card driver reload. But this leads to a destroyed network setup. > A service restart networking.service doenst solved this. It looks like the > full network restart isn't done with that. > Once I tried to recover the network manually with all the steps which > should be done automatically via ip ... > It's a hard way if there are 40VMs and two or three different network > connections per vm existing. > Is there a common tool supported way to bring back all network connections > ? > > Immo Wetzel > > ADTRAN GmbH > Siemensallee 1 > 17489 Greifswald > Germany > > Phone: +49 3834 5352 823 > Mobile: +49 151 147 29 225 > Immo.Wetzel at Adtran.com PGP-Fingerprint: > 7313 7E88 4E19 AACF 45E9 E74D EFF7 0480 F4CF 6426 > http://www.adtran.com > > Sitz der Gesellschaft: Berlin / Registered office: Berlin > Registergericht: Berlin / Commercial registry: Amtsgericht Charlottenburg, > HRB 135656 B > Gesch?ftsf?hrung / Managing Directors: Roger Shannon, James D. Wilson, > Jr., Dr. Eduard Scheiterer > > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From gilberto.nunes32 at gmail.com Tue Dec 19 13:16:30 2017 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Tue, 19 Dec 2017 10:16:30 -0200 Subject: [PVE-User] network restart In-Reply-To: References: Message-ID: Oh And if you can send the output of dmesg, just when the problem occur! Something like dmesg | tail, to see the lastest message from the kernel log --- Gilberto Ferreira (47) 3025-5907 (47) 99676-7530 Skype: gilberto.nunes36 2017-12-19 9:45 GMT-02:00 Gilberto Nunes : > Hi > > Can you tell us about your hardware? Mainly the network card and > switches..... > > > > --- > Gilberto Ferreira > > (47) 3025-5907 > (47) 99676-7530 > > Skype: gilberto.nunes36 > > > > > 2017-12-19 8:29 GMT-02:00 IMMO WETZEL : > >> Hi, >> >> PVE 4.4 >> we observed a few times network card outages. The only way was a network >> card driver reload. But this leads to a destroyed network setup. >> A service restart networking.service doenst solved this. It looks like >> the full network restart isn't done with that. >> Once I tried to recover the network manually with all the steps which >> should be done automatically via ip ... >> It's a hard way if there are 40VMs and two or three different network >> connections per vm existing. >> Is there a common tool supported way to bring back all network >> connections ? >> >> Immo Wetzel >> >> ADTRAN GmbH >> Siemensallee 1 >> 17489 Greifswald >> Germany >> >> Phone: +49 3834 5352 823 >> Mobile: +49 151 147 29 225 >> Immo.Wetzel at Adtran.com PGP-Fingerprint: >> 7313 7E88 4E19 AACF 45E9 E74D EFF7 0480 F4CF 6426 >> http://www.adtran.com >> >> Sitz der Gesellschaft: Berlin / Registered office: Berlin >> Registergericht: Berlin / Commercial registry: Amtsgericht >> Charlottenburg, HRB 135656 B >> Gesch?ftsf?hrung / Managing Directors: Roger Shannon, James D. Wilson, >> Jr., Dr. Eduard Scheiterer >> >> _______________________________________________ >> pve-user mailing list >> pve-user at pve.proxmox.com >> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > > From lindsay.mathieson at gmail.com Tue Dec 19 15:41:12 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Wed, 20 Dec 2017 00:41:12 +1000 Subject: [PVE-User] pveproxy dying, node unusable In-Reply-To: <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> References: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> Message-ID: On 12/12/2017 2:14 AM, Emmanuel Kasper wrote: > Hi Lindsay > As a quick check, is the cluster file system mounted on /etc/pve and can > you read files there normally ( ie cat /etc/pve/datacenter.cfg working ) ? > > Are the node storages returning their status properly ? > (ie pvesm status does not hang) Just had this exact same behaviour. multiple unkillable pveproxy processes with the timeout errors in dmesg. Only for the two nodes I upgraded. - cluster file system is fine - pvesm returns all storage ok. - pvecm status is normal - qm list and qm migrate just hang. - can't connect to the webgui on the two ndoes in question. Having to hard reset them as I need them usable again before work starts. -- Lindsay Mathieson From tobias.guth at ecos.de Tue Dec 19 16:24:57 2017 From: tobias.guth at ecos.de (Tobias Guth - ECOS Technology) Date: Tue, 19 Dec 2017 16:24:57 +0100 (CET) Subject: [PVE-User] pveceph dmcrypt Support Message-ID: <006601d378dd$7beb2570$73c17050$@ecos.de> Hello, I was wondering if pveceph supports creation of encrypted osds ? There is nothing in the official documentation mentioning anything about it ? Besides I did not find any information for future releases. It would be nice to have an ceph cluster setup by proxmox, but for production use my requirement is encryption of the osd devices ! Regards Tobi From m at plus-plus.su Tue Dec 19 16:59:57 2017 From: m at plus-plus.su (Mikhail) Date: Tue, 19 Dec 2017 18:59:57 +0300 Subject: [PVE-User] Failure to install latest PVE on Debian Stretch In-Reply-To: <1370055586.64.1513627612111@webmail.proxmox.com> References: <1370055586.64.1513627612111@webmail.proxmox.com> Message-ID: <5b619800-5841-4aaa-a94d-8b4885740c16@plus-plus.su> >> insserv: Service pve-cluster has to be enabled to start service pvefw-logger >> insserv: exiting now! > > We do not support insserv based systems anymore - please use systemd instead. > Thanks for pointing! I removed insserv and reinstalled systemd on a running system and then was able to fix PVE packages issue - it now runs as usual. Cheers. From lindsay.mathieson at gmail.com Wed Dec 20 01:13:59 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Wed, 20 Dec 2017 10:13:59 +1000 Subject: [PVE-User] pveproxy dying, node unusable In-Reply-To: References: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> Message-ID: On 20/12/2017 12:41 AM, Lindsay Mathieson wrote: > Having to hard reset them as I need them usable again before work starts. And pveproxy hung on both nodes again this morning, this is becoming quite a problem for us. [21360.917460] INFO: task pveproxy:18122 blocked for more than 120 seconds. [21360.917465]?????? Tainted: P?????????? O??? 4.4.95-1-pve #1 [21360.917469] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21360.917473] pveproxy??????? D ffff8807799cbdf8???? 0 18122????? 1 0x00000004 [21360.917476]? ffff8807799cbdf8 ffff880ff114a840 ffff880ff84fc600 ffff880fd9979c00 [21360.917478]? ffff8807799cc000 ffff880fc30143ac ffff880fd9979c00 00000000ffffffff [21360.917480]? ffff880fc30143b0 ffff8807799cbe10 ffffffff818643b5 ffff880fc30143a8 [21360.917482] Call Trace: [21360.917485]? [] schedule+0x35/0x80 [21360.917487]? [] schedule_preempt_disabled+0xe/0x10 [21360.917489]? [] __mutex_lock_slowpath+0xb9/0x130 [21360.917491]? [] mutex_lock+0x1f/0x30 [21360.917493]? [] filename_create+0x7a/0x160 [21360.917495]? [] SyS_mkdir+0x53/0x100 [21360.917497]? [] entry_SYSCALL_64_fastpath+0x16/0x75 Is it possible to rollback the last update? -- Lindsay Mathieson From lindsay.mathieson at gmail.com Wed Dec 20 01:19:01 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Wed, 20 Dec 2017 10:19:01 +1000 Subject: [PVE-User] pveproxy dying, node unusable In-Reply-To: References: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> Message-ID: nb. This is with Proxmox 4 On 20/12/2017 10:13 AM, Lindsay Mathieson wrote: > On 20/12/2017 12:41 AM, Lindsay Mathieson wrote: >> Having to hard reset them as I need them usable again before work >> starts. > > And pveproxy hung on both nodes again this morning, this is becoming > quite a problem for us. > > > [21360.917460] INFO: task pveproxy:18122 blocked for more than 120 > seconds. > [21360.917465]?????? Tainted: P?????????? O??? 4.4.95-1-pve #1 > [21360.917469] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [21360.917473] pveproxy??????? D ffff8807799cbdf8???? 0 18122 1 > 0x00000004 > [21360.917476]? ffff8807799cbdf8 ffff880ff114a840 ffff880ff84fc600 > ffff880fd9979c00 > [21360.917478]? ffff8807799cc000 ffff880fc30143ac ffff880fd9979c00 > 00000000ffffffff > [21360.917480]? ffff880fc30143b0 ffff8807799cbe10 ffffffff818643b5 > ffff880fc30143a8 > [21360.917482] Call Trace: > [21360.917485]? [] schedule+0x35/0x80 > [21360.917487]? [] schedule_preempt_disabled+0xe/0x10 > [21360.917489]? [] __mutex_lock_slowpath+0xb9/0x130 > [21360.917491]? [] mutex_lock+0x1f/0x30 > [21360.917493]? [] filename_create+0x7a/0x160 > [21360.917495]? [] SyS_mkdir+0x53/0x100 > [21360.917497]? [] entry_SYSCALL_64_fastpath+0x16/0x75 > > > Is it possible to rollback the last update? > -- Lindsay Mathieson From lindsay.mathieson at gmail.com Wed Dec 20 01:33:48 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Wed, 20 Dec 2017 10:33:48 +1000 Subject: [PVE-User] pveproxy dying, node unusable In-Reply-To: References: <9a8556df-7f6c-255d-1d9e-0ad4619f5f11@gmail.com> <4e32b3c0-ddd5-6579-f521-775c22015e05@gmail.com> <5e86f27f-4d69-fc6f-8b5c-ec80f94e74ac@proxmox.com> Message-ID: <6e69b36a-9cfb-62f4-7501-6026b415d796@gmail.com> On 20/12/2017 10:13 AM, Lindsay Mathieson wrote: > On 20/12/2017 12:41 AM, Lindsay Mathieson wrote: >> Having to hard reset them as I need them usable again before work >> starts. > > And pveproxy hung on both nodes again this morning, this is becoming > quite a problem for us. > ?systemctl status pveproxy ? pveproxy.service - PVE API Proxy Server ?? Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled) ?? Active: failed (Result: timeout) since Wed 2017-12-20 06:49:06 AEST; 3h 44min ago ?Main PID: 4325 (code=exited, status=0/SUCCESS) Dec 20 06:46:06 vng systemd[1]: pveproxy.service start operation timed out. Terminating. Dec 20 06:47:36 vng systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing. Dec 20 06:49:06 vng systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode. Dec 20 06:49:06 vng systemd[1]: Failed to start PVE API Proxy Server. Dec 20 06:49:06 vng systemd[1]: Unit pveproxy.service entered failed state. -- Lindsay Mathieson From f.gruenbichler at proxmox.com Wed Dec 20 08:17:15 2017 From: f.gruenbichler at proxmox.com (Fabian =?iso-8859-1?Q?Gr=FCnbichler?=) Date: Wed, 20 Dec 2017 08:17:15 +0100 Subject: [PVE-User] pveceph dmcrypt Support In-Reply-To: <006601d378dd$7beb2570$73c17050$@ecos.de> References: <006601d378dd$7beb2570$73c17050$@ecos.de> Message-ID: <20171220071715.gcliw2yfzmkr2uzu@nora.maurer-it.com> On Tue, Dec 19, 2017 at 04:24:57PM +0100, Tobias Guth - ECOS Technology wrote: > Hello, > > I was wondering if pveceph supports creation of encrypted osds ? > > There is nothing in the official documentation mentioning anything about > it ? Besides I did not find any information for future releases. > > It would be nice to have an ceph cluster setup by proxmox, but for > production use my requirement is encryption of the osd devices ! > > > > Regards > > Tobi no, it does not (currently / yet). but you should be able to set them up manually, and all of the other pveceph integration stuff should still work (except for destroying the OSDs, which assumes a regular unencrypted GPT / ceph-disk setup). we might re-visit this when looking at ceph-volume integration for the upcoming Mimic release. From t.lamprecht at proxmox.com Wed Dec 20 10:22:52 2017 From: t.lamprecht at proxmox.com (Thomas Lamprecht) Date: Wed, 20 Dec 2017 10:22:52 +0100 Subject: [PVE-User] Proxmox provided container system appliances updated Message-ID: <6f0752ac-646a-b47d-0033-f9f1dcd682b0@proxmox.com> Hi, At the end of last week we updated the container system appliances, hosted on http://download.proxmox.com/images/ As previously, they are available to download through the Proxmox VE webUI storage content panel. Here a quick overview of what changed: New: * Ubuntu Artful (17.10) * Alpine Linux 3.6 * Alpine Linux 3.7 * Fedora 26 * Fedora 27 * openSUSE 42.3 Updated (point release or rolling release): * Debian Stretch (9.0 -> 9.3) * Centos 7 (04 May 2017 (7.3) -> 12 Dec. 2017 (7.4)) * Arch Linux (04 July 2017 -> 12 Dec. 2017) * gentoo (03 May 2017 -> 11 Dec. 2017) Removed (EOL): * Fedora 24 * Alpine Linux 3.3 Note: Removals are done from the appliances index, they may be still downloaded manually, if needed. cheers, Thomas From gilberto.nunes32 at gmail.com Thu Dec 21 14:25:33 2017 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Thu, 21 Dec 2017 11:25:33 -0200 Subject: [PVE-User] Proxmox 5.1-40 - LXC Templates just gone!?! Message-ID: Hi guys Where's TurnKey repos??? Cheers --- Gilberto Ferreira (47) 3025-5907 (47) 99676-7530 Skype: gilberto.nunes36 From f.gruenbichler at proxmox.com Thu Dec 21 14:34:45 2017 From: f.gruenbichler at proxmox.com (Fabian =?iso-8859-1?Q?Gr=FCnbichler?=) Date: Thu, 21 Dec 2017 14:34:45 +0100 Subject: [PVE-User] Proxmox 5.1-40 - LXC Templates just gone!?! In-Reply-To: References: Message-ID: <20171221133445.y33dslkp56h665rw@nora.maurer-it.com> On Thu, Dec 21, 2017 at 11:25:33AM -0200, Gilberto Nunes wrote: > Hi guys > > Where's TurnKey repos??? > > Cheers > maybe you need to run "pveam update" ? everything looks OK.. From gilberto.nunes32 at gmail.com Thu Dec 21 14:35:56 2017 From: gilberto.nunes32 at gmail.com (Gilberto Nunes) Date: Thu, 21 Dec 2017 11:35:56 -0200 Subject: [PVE-User] Proxmox 5.1-40 - LXC Templates just gone!?! In-Reply-To: <20171221133445.y33dslkp56h665rw@nora.maurer-it.com> References: <20171221133445.y33dslkp56h665rw@nora.maurer-it.com> Message-ID: Yes... I realize that just a second aftet send the e-mail. Sorry for that! --- Gilberto Ferreira (47) 3025-5907 (47) 99676-7530 Skype: gilberto.nunes36 2017-12-21 11:34 GMT-02:00 Fabian Gr?nbichler : > On Thu, Dec 21, 2017 at 11:25:33AM -0200, Gilberto Nunes wrote: > > Hi guys > > > > Where's TurnKey repos??? > > > > Cheers > > > > maybe you need to run "pveam update" ? everything looks OK.. > > _______________________________________________ > pve-user mailing list > pve-user at pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > From tobias.guth at ecos.de Fri Dec 29 08:51:58 2017 From: tobias.guth at ecos.de (Tobias Guth - ECOS Technology) Date: Fri, 29 Dec 2017 08:51:58 +0100 (CET) Subject: [PVE-User] pveceph dmcrypt Support Message-ID: <01db01d38079$db3de0a0$91b9a1e0$@ecos.de> > no, it does not (currently / yet). ... > we might re-visit this when looking at ceph-volume integration for the > upcoming Mimic release. Thanks for your hint. I have setup Ceph within Proxmox (pveceph) and did setup encrypted OSDs with ceph-deploy on my storageboxes. Worked like charm ! Regards Tobi From gbr at majentis.com Fri Dec 29 20:48:55 2017 From: gbr at majentis.com (Gerald Brandt) Date: Fri, 29 Dec 2017 13:48:55 -0600 Subject: [PVE-User] Snapshots not showing in interface Message-ID: Hi, I have a VM with 2 snapshots. The display of snapsots for the VM is blank, so I can't delete the snapshot from there. This is a conf file: #Univention Corprorate Server 4.2-3 #Active Directory Domain Server #email server # #UPS Monitoring bootdisk: virtio0 cores: 4 ide2: NAS:iso/systemrescuecd-x86-4.6.0.iso,media=cdrom,size=456342K memory: 4096 name: AD-Mail net0: virtio=36:61:63:36:37:38,bridge=vmbr0 numa: 0 onboot: 1 ostype: l26 parent: update2 smbios1: uuid=2c9872f5-3e0d-4b8b-a080-9abc234d0517 sockets: 2 startup: order=1,up=60,down=30 unused0: NAS:131/vm-131-disk-2.qcow2 virtio0: NAS:131/vm-131-disk-1.qcow2,size=150G [update] bootdisk: virtio0 cores: 4 ide2: NAS:iso/systemrescuecd-x86-4.6.0.iso,media=cdrom,size=456342K memory: 4096 name: AD-Mail net0: virtio=36:61:63:36:37:38,bridge=vmbr0 numa: 0 onboot: 1 ostype: l26 parent: update2 smbios1: uuid=2c9872f5-3e0d-4b8b-a080-9abc234d0517 snaptime: 1514400926 sockets: 2 startup: order=1,up=60,down=30 virtio0: NAS:131/vm-131-disk-1.qcow2,size=150G [update2] #before 4.1.4 to 4.2.3 bootdisk: virtio0 cores: 4 ide2: NAS:iso/systemrescuecd-x86-4.6.0.iso,media=cdrom,size=456342K memory: 4096 name: AD-Mail net0: virtio=36:61:63:36:37:38,bridge=vmbr0 numa: 0 onboot: 1 ostype: l26 parent: update smbios1: uuid=2c9872f5-3e0d-4b8b-a080-9abc234d0517 snaptime: 1514404291 sockets: 2 startup: order=1,up=60,down=30 virtio0: NAS:131/vm-131-disk-1.qcow2,size=150G Any idea why the GUI is blank? Gerald From lindsay.mathieson at gmail.com Sat Dec 30 02:27:12 2017 From: lindsay.mathieson at gmail.com (Lindsay Mathieson) Date: Sat, 30 Dec 2017 11:27:12 +1000 Subject: [PVE-User] Snapshots not showing in interface In-Reply-To: References: Message-ID: <3410dd42-6718-493b-18af-7809cc5d6a10@gmail.com> On 30/12/2017 5:48 AM, Gerald Brandt wrote: > I have a VM with 2 snapshots. The display of snapsots for the VM is > blank, so I can't delete the snapshot from there. > > This is a conf file: update and update2 both have each other as a parent - circular reference. If you don't want to save the snapshots I'd delete them from the conf file and use qemu-img to delete them from the qcow2 image. Once that is done, delete the parent entry from the main part of the conf file. -- Lindsay Mathieson From jagan.p at stackuptech.com Sat Dec 30 09:07:46 2017 From: jagan.p at stackuptech.com (jagan) Date: Sat, 30 Dec 2017 13:37:46 +0530 Subject: [PVE-User] Corosync Totem Re transmit logs - Node not responding Message-ID: <9c7acf0b-544d-4663-ad72-e043a10dbcd5@stackuptech.com> Hi, I am using 2 node cluster with DRBD on PVE 3.4, i have seen huge log entries " corosync[2539]:? [TOTEM ] Retransmit List: 4b493 4b494 4b495 4b496 4b497 4b498 4b499 4b49a" in syslog & corosync log. one cluster node? is freezing frequently not responding (Monitor & keyboard not responding). 2 Nodes are running in production, need your support to resolve the issue. Thanks in advance. From dietmar at proxmox.com Sat Dec 30 09:33:11 2017 From: dietmar at proxmox.com (Dietmar Maurer) Date: Sat, 30 Dec 2017 09:33:11 +0100 (CET) Subject: [PVE-User] Snapshots not showing in interface In-Reply-To: <3410dd42-6718-493b-18af-7809cc5d6a10@gmail.com> References: <3410dd42-6718-493b-18af-7809cc5d6a10@gmail.com> Message-ID: <99258182.3.1514622792402@webmail.proxmox.com> > On December 30, 2017 at 2:27 AM Lindsay Mathieson > wrote: > > > On 30/12/2017 5:48 AM, Gerald Brandt wrote: > > I have a VM with 2 snapshots. The display of snapsots for the VM is > > blank, so I can't delete the snapshot from there. > > > > This is a conf file: > > update and update2 both have each other as a parent - circular reference. I wonder how that can happen - did you manually edit the config file?