[pve-devel] [PATCH manager v2] fix #4631: ceph: osd: create: add osds-per-device

Wolfgang Bumiller w.bumiller at proxmox.com
Wed Aug 23 11:05:24 CEST 2023


On Mon, Aug 21, 2023 at 01:45:54PM +0200, Aaron Lauterer wrote:
> Allows to automatically create multiple OSDs per physical device. The
> main use case are fast NVME drives that would be bottlenecked by a
> single OSD service.
> 
> By using the 'ceph-volume lvm batch' command instead of the 'ceph-volume
> lvm create' for multiple OSDs / device, we don't have to deal with the
> split of the drive ourselves.
> 
> But this means that the parameters to specify a DB or WAL device won't
> work as the 'batch' command doesn't use them. Dedicated DB and WAL
> devices don't make much sense anyway if we place the OSDs on fast NVME
> drives.
> 
> Some other changes to how the command is built were needed as well, as
> the 'batch' command needs the path to the disk as a positional argument,
> not as '--data /dev/sdX'.
> We drop the '--cluster-fsid' paramter because the 'batch' command
> doesn't accept it. The 'create' will fall back to reading it from the
> ceph.conf file.
> 
> Removal of OSDs works as expected without any code changes. As long as
> there are other OSDs on a disk, the VG & PV won't be removed, even if
> 'cleanup' is enabled.
> 
> The '--no-auto' paramter is used to avoid the following deprecation
> warning:
> ```
> --> DEPRECATION NOTICE
> --> You are using the legacy automatic disk sorting behavior
> --> The Pacific release will change the default to --no-auto
> --> passed data devices: 1 physical, 0 LVM
> --> relative data size: 0.3333333333333333
> ```
> 
> Signed-off-by: Aaron Lauterer <a.lauterer at proxmox.com>
> ---
> changes since v1:
> * change parameter type to integer
> * rephrase parameter description to make it clear it is only useful for
> NVMEs
> * change error handling to raise_param_exc
> * add --no-auto param to ceph-volume
> 
>  PVE/API2/Ceph/OSD.pm | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/API2/Ceph/OSD.pm b/PVE/API2/Ceph/OSD.pm
> index ded35990..3a51bebc 100644
> --- a/PVE/API2/Ceph/OSD.pm
> +++ b/PVE/API2/Ceph/OSD.pm
> @@ -275,6 +275,13 @@ __PACKAGE__->register_method ({
>  		type => 'string',
>  		description => "Set the device class of the OSD in crush."
>  	    },
> +	    'osds-per-device' => {
> +		optional => 1,
> +		type => 'integer',
> +		minimum => '1',
> +		description => 'OSD services per physical device. Only useful for fast ".
> +		    "NVME devices to utilize their performance better.',
> +	    },
>  	},
>      },
>      returns => { type => 'string' },
> @@ -294,6 +301,15 @@ __PACKAGE__->register_method ({
>  	# extract parameter info and fail if a device is set more than once
>  	my $devs = {};
>  
> +	# allow 'osds-per-device' only without dedicated db and/or wal devs. We cannot specify them with
> +	# 'ceph-volume lvm batch' and they don't make a lot of sense on fast NVMEs anyway.
> +	if ($param->{'osds-per-device'}) {
> +	    for my $type ( qw(db_dev wal_dev) ) {
> +		raise_param_exc({ $type => "canot use 'osds-per-device' parameter with '${type}'" })
> +		    if $param->{$type};
> +	    }
> +	}
> +
>  	my $ceph_conf = cfs_read_file('ceph.conf');
>  
>  	my $osd_network = $ceph_conf->{global}->{cluster_network};
> @@ -364,8 +380,6 @@ __PACKAGE__->register_method ({
>  	my $monstat = $rados->mon_command({ prefix => 'quorum_status' });
>  
>  	die "unable to get fsid\n" if !$monstat->{monmap} || !$monstat->{monmap}->{fsid};

Does it make sense to still keep this line above if we don't use fsid
anymore? (And if not, $monstat then also becomes unused.)

> -	my $fsid = $monstat->{monmap}->{fsid};
> -        $fsid = $1 if $fsid =~ m/^([0-9a-f\-]+)$/;
>  
>  	my $ceph_bootstrap_osd_keyring = PVE::Ceph::Tools::get_config('ceph_bootstrap_osd_keyring');
>  
> @@ -470,7 +484,10 @@ __PACKAGE__->register_method ({
>  		$test_disk_requirements->($disklist);
>  
>  		my $dev_class = $param->{'crush-device-class'};
> -		my $cmd = ['ceph-volume', 'lvm', 'create', '--cluster-fsid', $fsid ];
> +		# create allows for detailed configuration of DB and WAL devices
> +		# batch for easy creation of multiple OSDs (per device)
> +		my $create_mode = $param->{'osds-per-device'} ? 'batch' : 'create';
> +		my $cmd = ['ceph-volume', 'lvm', $create_mode ];
>  		push @$cmd, '--crush-device-class', $dev_class if $dev_class;
>  
>  		my $devname = $devs->{dev}->{name};
> @@ -504,8 +521,11 @@ __PACKAGE__->register_method ({
>  		    push @$cmd, "--block.$type", $part_or_lv;
>  		}
>  
> -		push @$cmd, '--data', $devpath;
> +		push @$cmd, '--data' if $create_mode eq 'create';
> +		push @$cmd, $devpath;

^ not the biggest fan of mixing positional arguments and options - does
`ceph-volume` support `--` to separate positional arguments explicitly?
If so, we should use it.
    ceph-volume lvm create --osds-per-device 3 -- /dev/path
as opposed to
    ceph-volume lvm create /dev/path --osds-per-device 3
that's just bad style and potentially dangerous

>  		push @$cmd, '--dmcrypt' if $param->{encrypted};
> +		push @$cmd, '--osds-per-device', $param->{'osds-per-device'}, '--yes', '--no-auto'
> +		    if $create_mode eq 'batch';
>  
>  		PVE::Diskmanage::wipe_blockdev($devpath);
>  
> -- 
> 2.39.2





More information about the pve-devel mailing list