[PVE-User] Ceph: Some trouble creating OSD with journal on a sotware raid device...

Thu Oct 13 12:26:38 CEST 2016

Hello Marco,

On 10/13/2016 12:13 PM, Marco Gaiarin wrote:
> 
> I'm a bit confused.
> 
> I'm trying to create 4 OSD on a server, where the SO reside on a
> raid-1. On the same (couple of) disk there's 4 50MB partition for
> the journal (the two disks are SSD).
> Better with a command:

I have to ask a more general question here, why are you putting the journal on a RAID1? For better performance and less
complexity the journal should reside on standalone SSDs. With the RAID1 you limit the speed of the journal, then it
would be better to reside the journal on the OSD disks itself.

> 
>  root at vedovanera:~# blkid 
>  /dev/sdf1: UUID="75103d23-83a6-9f5d-eb1e-f021e729041b" UUID_SUB="70aa73ab-585c-5df1-dfef-bbc847766504" LABEL="vedovanera:0" TYPE="linux_raid_member" PARTUUID="180187f1-01"
>  /dev/sdf2: UUID="e21df9d5-3230-f991-d70d-f948704a7594" UUID_SUB="fe105e88-252a-97ab-d543-c2c2a89499d0" LABEL="vedovanera:1" TYPE="linux_raid_member" PARTUUID="180187f1-02"
>  /dev/sdf5: UUID="ba35e389-814d-dc29-8818-c9e86d9d8f08" UUID_SUB="66104965-2a53-c142-43fb-1bc35f66bf41" LABEL="vedovanera:2" TYPE="linux_raid_member" PARTUUID="180187f1-05"
>  /dev/sdf6: UUID="90778432-b426-51e9-b0a2-48d76ef24364" UUID_SUB="00db7dd7-f0fd-ea54-e52f-0a8725ed7866" LABEL="vedovanera:3" TYPE="linux_raid_member" PARTUUID="180187f1-06"
>  /dev/sdf7: UUID="09be7173-4edc-1e14-5e06-dfdcd677943c" UUID_SUB="876a79e0-be59-6153-cede-97aefcdec849" LABEL="vedovanera:4" TYPE="linux_raid_member" PARTUUID="180187f1-07"
>  /dev/sdf8: UUID="fd54393a-2969-7f9f-8e29-f4120dc4ab00" UUID_SUB="d576b4c8-dfc5-8ddd-25f2-9b0da0c7241c" LABEL="vedovanera:5" TYPE="linux_raid_member" PARTUUID="180187f1-08"
>  /dev/sda: PTUUID="cf6dccb4-4f6f-472a-9f1a-5945de4f1703" PTTYPE="gpt"
>  /dev/sdc: PTUUID="3ecc2e48-b12d-4cb1-add8-87f0e611b7e8" PTTYPE="gpt"
>  /dev/sde1: UUID="75103d23-83a6-9f5d-eb1e-f021e729041b" UUID_SUB="ab4416c0-a715-ef87-466a-6a58096eb2b9" LABEL="vedovanera:0" TYPE="linux_raid_member" PARTUUID="03210f34-01"
>  /dev/sde2: UUID="e21df9d5-3230-f991-d70d-f948704a7594" UUID_SUB="2355caea-4102-7269-38be-22779790c388" LABEL="vedovanera:1" TYPE="linux_raid_member" PARTUUID="03210f34-02"
>  /dev/sde5: UUID="ba35e389-814d-dc29-8818-c9e86d9d8f08" UUID_SUB="b3211065-8c5d-3fa5-8f57-2a50ef461a34" LABEL="vedovanera:2" TYPE="linux_raid_member" PARTUUID="03210f34-05"
>  /dev/sde6: UUID="90778432-b426-51e9-b0a2-48d76ef24364" UUID_SUB="296a78cf-0e97-62f6-d136-cefb9abffa3e" LABEL="vedovanera:3" TYPE="linux_raid_member" PARTUUID="03210f34-06"
>  /dev/sde7: UUID="09be7173-4edc-1e14-5e06-dfdcd677943c" UUID_SUB="36667e33-d801-c114-cb59-8770b66fc98d" LABEL="vedovanera:4" TYPE="linux_raid_member" PARTUUID="03210f34-07"
>  /dev/sde8: UUID="fd54393a-2969-7f9f-8e29-f4120dc4ab00" UUID_SUB="b5eac45a-2693-e2c5-3e00-2b8a33658a00" LABEL="vedovanera:5" TYPE="linux_raid_member" PARTUUID="03210f34-08"
>  /dev/sdb: PTUUID="000e025c" PTTYPE="dos"
>  /dev/md0: UUID="a751e134-b3ed-450c-b694-664d80f07c68" TYPE="ext4"
>  /dev/sdd: PTUUID="000b1250" PTTYPE="dos"
>  /dev/md1: UUID="8bd0c899-0317-4d20-a781-ff662e92b0b1" TYPE="swap"
>  /dev/md2: PTUUID="a7eb14f0-d2f9-4552-8e2d-b5165e654ea8" PTTYPE="gpt"
>  /dev/md3: PTUUID="ba4073c3-fab2-41e9-9612-28d28ae6468d" PTTYPE="gpt"
>  /dev/md4: PTUUID="c3dfbbfa-28da-4bc8-88fd-b49785e7e212" PTTYPE="gpt"
>  /dev/md5: PTUUID="c616bbf8-41f0-4e62-b77f-b0e8eeb624e2" PTTYPE="gpt"
> 
> 'md0' is /, 'md1' the swap, md2-5 the cache partition, sda-d the disks
> for OSDs.
> 
> 
> The proxmox correctly see the 4 OSD candidate disks, but does not see the
> journal partition. So i've used commandline:

pveceph is a wrapper around ceph tools and a dependency for pveceph is smartmontools. So mdadm doesn't list smart
attributes and this might be why it's not seeing it. But this is more a guess and should be verified by someone who
knows better.

> 
>  root at vedovanera:~# pveceph createosd /dev/sda --journal_dev /dev/md2
>  command '/sbin/zpool list -HPLv' failed: open3: exec of /sbin/zpool list -HPLv failed at /usr/share/perl5/PVE/Tools.pm line 409.
>  
>  create OSD on /dev/sda (xfs)
>  using device '/dev/md2' for journal
>  Caution: invalid backup GPT header, but valid main header; regenerating
>  backup header from main header.
>  
>  ****************************************************************************
>  Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
>  verification and recovery are STRONGLY recommended.
>  ****************************************************************************
>  GPT data structures destroyed! You may now partition the disk using fdisk or
>  other utilities.
>  Creating new GPT entries.
>  The operation has completed successfully.
>  WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data
>  Setting name!
>  partNum is 0
>  REALLY setting name!
>  The operation has completed successfully.
>  Setting name!
>  partNum is 0
>  REALLY setting name!
>  The operation has completed successfully.
>  meta-data=/dev/sda1              isize=2048   agcount=4, agsize=122094597 blks
>           =                       sectsz=4096  attr=2, projid32bit=1
>           =                       crc=0        finobt=0
>  data     =                       bsize=4096   blocks=488378385, imaxpct=5
>           =                       sunit=0      swidth=0 blks
>  naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
>  log      =internal log           bsize=4096   blocks=238466, version=2
>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
>  realtime =none                   extsz=4096   blocks=0, rtextents=0
>  Warning: The kernel is still using the old partition table.
>  The new table will be used at the next reboot.
>  The operation has completed successfully.
> 
> so, seems all went well... but OSD does not show up on web interface, but seems
> ''counted'' (i've 2 OSD working on another server):
> 
>  root at vedovanera:~# ceph -s
>     cluster 8794c124-c2ec-4e81-8631-742992159bd6
>      health HEALTH_WARN
>             64 pgs degraded
>             64 pgs stale
>             64 pgs stuck degraded
>             64 pgs stuck stale
>             64 pgs stuck unclean
>             64 pgs stuck undersized
>             64 pgs undersized
>             noout flag(s) set
>      monmap e2: 2 mons at {0=10.27.251.7:6789/0,1=10.27.251.8:6789/0}
>             election epoch 6, quorum 0,1 0,1
>      osdmap e29: 3 osds: 2 up, 2 in
>             flags noout
>       pgmap v42: 64 pgs, 1 pools, 0 bytes data, 0 objects
>             67200 kB used, 3724 GB / 3724 GB avail
>                   64 stale+active+undersized+degraded
> 
> the previous command create also a partition (of 5GB) on the md2:
> 
>  root at vedovanera:~# blkid | grep md2
>  /dev/md2: PTUUID="a7eb14f0-d2f9-4552-8e2d-b5165e654ea8" PTTYPE="gpt"
>  /dev/md2p1: PARTLABEL="ceph journal" PARTUUID="d1ccfdb2-539e-4e6a-ad60-be100304832b"
> 
> Now, if i destroy the OSD:
> 
>  root at vedovanera:~# pveceph destroyosd 2
>  destroy OSD osd.2
>  /etc/init.d/ceph: osd.2 not found (/etc/pve/ceph.conf defines mon.0 mon.1, /var/lib/ceph defines )
>  command 'setsid service ceph -c /etc/pve/ceph.conf stop osd.2' failed: exit code 1
>  Remove osd.2 from the CRUSH map
>  Remove the osd.2 authentication key.
>  Remove OSD osd.2
>  Unmount OSD osd.2 from  /var/lib/ceph/osd/ceph-2
>  umount: /var/lib/ceph/osd/ceph-2: mountpoint not found
>  command 'umount /var/lib/ceph/osd/ceph-2' failed: exit code 32
> 
> delete the /dev/md2p1 partition and recreate (type Linux) of 50GB, zap the sda disk,
> and i redo the OSD creation, it works, with some strange ''warning'':
> 
>  root at vedovanera:~# pveceph createosd /dev/sda --journal_dev /dev/md2p1
>  command '/sbin/zpool list -HPLv' failed: open3: exec of /sbin/zpool list -HPLv failed at /usr/share/perl5/PVE/Tools.pm line 409.
>  
>  create OSD on /dev/sda (xfs)
>  using device '/dev/md2p1' for journal
>  Caution: invalid backup GPT header, but valid main header; regenerating
>  backup header from main header.
>  
>  ****************************************************************************
>  Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
>  verification and recovery are STRONGLY recommended.
>  ****************************************************************************
>  GPT data structures destroyed! You may now partition the disk using fdisk or
>  other utilities.
>  Creating new GPT entries.
>  The operation has completed successfully.
>  WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data
>  WARNING:ceph-disk:Journal /dev/md2p1 was not prepared with ceph-disk. Symlinking directly.
>  Setting name!
>  partNum is 0
>  REALLY setting name!
>  The operation has completed successfully.
>  meta-data=/dev/sda1              isize=2048   agcount=4, agsize=122094597 blks
>           =                       sectsz=4096  attr=2, projid32bit=1
>           =                       crc=0        finobt=0
>  data     =                       bsize=4096   blocks=488378385, imaxpct=5
>           =                       sunit=0      swidth=0 blks
>  naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
>  log      =internal log           bsize=4096   blocks=238466, version=2
>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
>  realtime =none                   extsz=4096   blocks=0, rtextents=0
>  Warning: The kernel is still using the old partition table.
>  The new table will be used at the next reboot.
>  The operation has completed successfully.
> 
> Now OSD show up on pve web interface, and seems to work as expected.
> 
> I've also tried to ''reformat'' the jounal, eg stop the OSD, flush ad recreate:
> 
>  root at vedovanera:~# ceph-osd -i 2 --flush-journal
>   HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>   HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>  2016-10-13 12:06:14.209250 7ffb7c596880 -1 flushed journal /var/lib/ceph/osd/ceph-2/journal for object store /var/lib/ceph/osd/ceph-2
>  root at vedovanera:~# ceph-osd -i 2 --mkjournal
>   HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>  2016-10-13 12:06:45.034323 7f774cef7880 -1 created new journal /var/lib/ceph/osd/ceph-2/journal for object store /var/lib/ceph/osd/ceph-2
> 
> OSD restart correctly, but i'm still in doubt i'm doing something
> wrong...
> 
> 
> Thanks.
> 

-- 
Cheers,
Alwin