create rbd pool in autoscaling mode fails with Error ERANGE pgp_num

Question

create rbd pool in autoscaling mode fails with Error ERANGE pgp_num

insatomcat opened this issue 2 years ago · 3 comments

Bug Report

What happened:

ceph-ansible (stable-7.0) is not able to create an rbd pool with autoscale_mode=on.
I get the error:

Error ERANGE: ''pgp_num'' must be greater than 0 and lower or equal than ''pg_num'', which in this case is 1

What you expected to happen:

The rbd pool should be created with no error in autoscaling mode.

How to reproduce it (minimal and precise):

Using :

ceph official rpm package for v17.2.5 (https://download.ceph.com/rpm-17.2.5/el9/)
RHEL9.
ceph-ansible stable-7.0.

Using clients.yml car file to create an rbd pool:

user_config: true
rbd:
  name: "rbd"
  application: "rbd"
  pg_autoscale_mode: on
  target_size_ratio: 1
pools:
  - "{{ rbd }}"

ceph-ansible fails:

2022-11-19 08:57:07,293 p=24 u=virtu n=ansible | failed: [rhel9-1] (item={'name': 'rbd', 'application': 'rbd', 'pg_autoscale_mode': True, 'target_size_ratio': 1}) => changed=false
  ansible_loop_var: item
  cmd:
  - ceph
  - -n
  - client.admin
  - -k
  - /etc/ceph/ceph.client.admin.keyring
  - --cluster
  - ceph
  - osd
  - pool
  - create
  - rbd
  - replicated
  - --target_size_ratio
  - '1'
  - replicated_rule
  - --expected_num_objects
  - '0'
  - --autoscale-mode
  - 'on'
  delta: '0:00:01.698545'
  end: '2022-11-19 09:57:07.221218'
  item:
    application: rbd
    name: rbd
    pg_autoscale_mode: true
    target_size_ratio: 1
  rc: 2
  start: '2022-11-19 09:57:05.522673'
  stderr: 'Error ERANGE: ''pgp_num'' must be greater than 0 and lower or equal than ''pg_num'', which in this case is 1'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

Maybe it's a ceph bug?

[root@rhel9-1 ~]# ceph -n client.admin -k /etc/ceph/ceph.client.admin.keyring --cluster ceph osd pool create test --target-size-ratio 0.2 --autoscale-mode=on
Error ERANGE: 'pgp_num' must be greater than 0 and lower or equal than 'pg_num', which in this case is 1

Also note that creating the rbd in autoscale_mode=warn and then setting it to "on" seems to work:

[root@rhel9-1 ~]# ceph -n client.admin -k /etc/ceph/ceph.client.admin.keyring --cluster ceph osd pool create test --autoscale-mode=warn
pool 'test' created
[root@rhel9-1 ~]# ceph osd pool set test pg_autoscale_mode on
set pool 7 pg_autoscale_mode to on

Share your group_vars files, inventory and full ceph-ansibe log

Environment:

OS (e.g. from /etc/os-release): RHEL9.1
Kernel (e.g. uname -a): 5.14.0-70
Docker version if applicable (e.g. docker version): N/A
Ansible version (e.g. ansible-playbook --version): 2.12.0
ceph-ansible version (e.g. git head or tag or stable branch): stable-7.0
Ceph version (e.g. ceph -v): 17.2.5

Thanks!

all.yml.gz
osds.yml.gz
clients.yml.gz
ansible.log.gz

Answer 1 · 2022-11-25T09:25:02.000Z

Hi,

Indeed it seems to be a Ceph issue as doing the same in my test env works:

[root@mon0 /]# ceph osd pool create test --target-size-ratio 0.2 --autoscale-mode=on
pool 'test' created
[root@mon0 /]# ceph osd pool create test1 --target-size-ratio 1 --autoscale-mode=on
pool 'test1' created

[root@mon0 /]# ceph osd pool ls detail  
pool 1 'device_health_metrics' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 240 pgp_num 230 pg_num_target 8 pgp_num_target 8 pg_num_pending 239 autoscale_mode on last_change 111 lfor 0/111/111 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'rbd' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 116 pgp_num 114 pg_num_target 32 pgp_num_target 32 pg_num_pending 115 autoscale_mode on last_change 111 lfor 0/111/111 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'test' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 250 pg_num_target 32 pgp_num_target 32 pg_num_pending 255 autoscale_mode on last_change 111 lfor 0/111/111 flags hashpspool stripe_width 0 target_size_ratio 0.2
pool 4 'test1' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 106 lfor 0/0/98 flags hashpspool stripe_width 0 target_size_ratio 1

For that error to happen, pgp_num must be greater than pg_num (https://github.com/ceph/ceph/blob/f2f5ca51509ee8bd6b66772a02f0f57c68862fd7/src/mon/OSDMonitor.cc#L8012) but with autoscale-mode set to on, pg_num is set to 1 which means that in your case pgp_num has to be greater than this value. Have you changed the default value of osd_pool_default_pgp_num in your ceph.conf file?

Answer 2 · 2022-11-25T16:36:30.000Z

@insatomcat

From the all.yml file you shared:

    global:
        osd_pool_default_size: "{{ osd_pool_default_size }}"
        osd_pool_default_min_size: "{{ osd_pool_default_min_size }}"
        osd_pool_default_pg_num: 128
        osd_pool_default_pgp_num: 128
        osd_crush_chooseleaf_type: 1
        mon_osd_min_down_reporters: 1
    mon:
        auth_allow_insecure_global_id_reclaim: false
    osd:
        osd_min_pg_log_entries: 500
        osd_max_pg_log_entries: 500
        osd memory target: "{{ osd_memory_target }}"

You are forcing pgp_num to 128. pgp_num cannot be higher than pg_num, this is why you get that error. When using autoscale_mode=on, pg_num = 1 (https://docs.ceph.com/en/latest/rados/configuration/pool-pg-config-ref/#confval-osd_pool_default_pg_autoscale_mode)

Answer 3 · 2022-11-25T17:06:07.000Z

thanks a lot.