[PVE-User] [ceph-users] Re: Ceph Usage web and terminal.

Wed Dec 29 12:16:59 CET 2021

Just a feeling but I'd say that the imbalance in OSDs (one host having many more disks than the
rest) is your problem.

Assuming that your configuration keeps 3 copies of each VM image then the imbalance probably means
that 2 of these 3 copies reside on pve-3111 and if this host is unavailable, all VM images with 2
copies on that host become unresponsive, too.

Check your failure domain for Ceph and possibly change it from OSD to host. This should prevent that
one host holds multiple copies of a VM image.

Regards,

	Uwe

Am 29.12.21 um 09:36 schrieb Сергей Цаболов:
> Hello to all.
> 
> In my case I have the 7 node cluster Proxmox and working Ceph (ceph version 15.2.15  octopus
> (stable)": 7)
> 
> Ceph HEALTH_OK
> 
> ceph -s
>   cluster:
>     id:     9662e3fa-4ce6-41df-8d74-5deaa41a8dde
>     health: HEALTH_OK
> 
>   services:
>     mon: 7 daemons, quorum pve-3105,pve-3107,pve-3108,pve-3103,pve-3101,pve-3111,pve-3109 (age 17h)
>     mgr: pve-3107(active, since 41h), standbys: pve-3109, pve-3103, pve-3105, pve-3101, pve-3111,
> pve-3108
>     mds: cephfs:1 {0=pve-3105=up:active} 6 up:standby
>     osd: 22 osds: 22 up (since 17h), 22 in (since 17h)
> 
>   task status:
> 
>   data:
>     pools:   4 pools, 1089 pgs
>     objects: 1.09M objects, 4.1 TiB
>     usage:   7.7 TiB used, 99 TiB / 106 TiB avail
>     pgs:     1089 active+clean
> 
> ---------------------------------------------------------------------------------------------------------------------
> 
> 
> ceph osd tree
> 
> ID   CLASS  WEIGHT     TYPE NAME            STATUS  REWEIGHT PRI-AFF
>  -1         106.43005  root default
> -13          14.55478      host pve-3101
>  10    hdd    7.27739          osd.10           up   1.00000 1.00000
>  11    hdd    7.27739          osd.11           up   1.00000 1.00000
> -11          14.55478      host pve-3103
>   8    hdd    7.27739          osd.8            up   1.00000 1.00000
>   9    hdd    7.27739          osd.9            up   1.00000 1.00000
>  -3          14.55478      host pve-3105
>   0    hdd    7.27739          osd.0            up   1.00000 1.00000
>   1    hdd    7.27739          osd.1            up   1.00000 1.00000
>  -5          14.55478      host pve-3107
>   2    hdd    7.27739          osd.2            up   1.00000 1.00000
>   3    hdd    7.27739          osd.3            up   1.00000 1.00000
>  -9          14.55478      host pve-3108
>   6    hdd    7.27739          osd.6            up   1.00000 1.00000
>   7    hdd    7.27739          osd.7            up   1.00000 1.00000
>  -7          14.55478      host pve-3109
>   4    hdd    7.27739          osd.4            up   1.00000 1.00000
>   5    hdd    7.27739          osd.5            up   1.00000 1.00000
> -15          19.10138      host pve-3111
>  12    hdd   10.91409          osd.12           up   1.00000 1.00000
>  13    hdd    0.90970          osd.13           up   1.00000 1.00000
>  14    hdd    0.90970          osd.14           up   1.00000 1.00000
>  15    hdd    0.90970          osd.15           up   1.00000 1.00000
>  16    hdd    0.90970          osd.16           up   1.00000 1.00000
>  17    hdd    0.90970          osd.17           up   1.00000 1.00000
>  18    hdd    0.90970          osd.18           up   1.00000 1.00000
>  19    hdd    0.90970          osd.19           up   1.00000 1.00000
>  20    hdd    0.90970          osd.20           up   1.00000 1.00000
>  21    hdd    0.90970          osd.21           up   1.00000 1.00000
> 
> ---------------------------------------------------------------------------------------------------------------
> 
> 
> POOL                               ID  PGS   STORED   OBJECTS USED     %USED  MAX AVAIL
> vm.pool                            2  1024  3.0 TiB  863.31k  6.0 TiB   6.38     44 TiB  (this pool
> have the all VM disk)
> 
> ---------------------------------------------------------------------------------------------------------------
> 
> 
> ceph osd map vm.pool vm.pool.object
> osdmap e14319 pool 'vm.pool' (2) object 'vm.pool.object' -> pg 2.196f68d5 (2.d5) -> up ([2,4], p2)
> acting ([2,4], p2)
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> pveversion -v
> proxmox-ve: 6.4-1 (running kernel: 5.4.143-1-pve)
> pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
> pve-kernel-helper: 6.4-8
> pve-kernel-5.4: 6.4-7
> pve-kernel-5.4.143-1-pve: 5.4.143-1
> pve-kernel-5.4.106-1-pve: 5.4.106-1
> ceph: 15.2.15-pve1~bpo10
> ceph-fuse: 15.2.15-pve1~bpo10
> corosync: 3.1.2-pve1
> criu: 3.11-3
> glusterfs-client: 5.5-3
> ifupdown: residual config
> ifupdown2: 3.0.0-1+pve4~bpo10
> ksm-control-daemon: 1.3-1
> libjs-extjs: 6.0.1-10
> libknet1: 1.22-pve1~bpo10+1
> libproxmox-acme-perl: 1.1.0
> libproxmox-backup-qemu0: 1.1.0-1
> libpve-access-control: 6.4-3
> libpve-apiclient-perl: 3.1-3
> libpve-common-perl: 6.4-4
> libpve-guest-common-perl: 3.1-5
> libpve-http-server-perl: 3.2-3
> libpve-storage-perl: 6.4-1
> libqb0: 1.0.5-1
> libspice-server1: 0.14.2-4~pve6+1
> lvm2: 2.03.02-pve4
> lxc-pve: 4.0.6-2
> lxcfs: 4.0.6-pve1
> novnc-pve: 1.1.0-1
> proxmox-backup-client: 1.1.13-2
> proxmox-mini-journalreader: 1.1-1
> proxmox-widget-toolkit: 2.6-1
> pve-cluster: 6.4-1
> pve-container: 3.3-6
> pve-docs: 6.4-2
> pve-edk2-firmware: 2.20200531-1
> pve-firewall: 4.1-4
> pve-firmware: 3.3-2
> pve-ha-manager: 3.1-1
> pve-i18n: 2.3-1
> pve-qemu-kvm: 5.2.0-6
> pve-xtermjs: 4.7.0-3
> qemu-server: 6.4-2
> smartmontools: 7.2-pve2
> spiceterm: 3.1-1
> vncterm: 1.6-2
> zfsutils-linux: 2.0.6-pve1~bpo10+1
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> 
> And now my problem:
> 
> For all VM I have one pool for VM disks
> 
> When  node/host pve-3111  is shutdown in many of other nodes/hosts pve-3107, pve-3105  VM not
> shutdown but not available in network.
> 
> After the node/host is up Ceph back to HEALTH_OK and the all VM back to access in Network (without
> reboot).
> 
> Can some one to suggest me what I can to check in Ceph ?
> 
> Thanks.
>