create rbd pool in autoscaling mode fails with Error ERANGE pgp_num
insatomcat opened this issue · 3 comments
Bug Report
What happened:
ceph-ansible (stable-7.0) is not able to create an rbd pool with autoscale_mode=on.
I get the error:
Error ERANGE: ''pgp_num'' must be greater than 0 and lower or equal than ''pg_num'', which in this case is 1
What you expected to happen:
The rbd pool should be created with no error in autoscaling mode.
How to reproduce it (minimal and precise):
Using :
- ceph official rpm package for v17.2.5 (https://download.ceph.com/rpm-17.2.5/el9/)
- RHEL9.
- ceph-ansible stable-7.0.
Using clients.yml car file to create an rbd pool:
user_config: true
rbd:
name: "rbd"
application: "rbd"
pg_autoscale_mode: on
target_size_ratio: 1
pools:
- "{{ rbd }}"
ceph-ansible fails:
2022-11-19 08:57:07,293 p=24 u=virtu n=ansible | failed: [rhel9-1] (item={'name': 'rbd', 'application': 'rbd', 'pg_autoscale_mode': True, 'target_size_ratio': 1}) => changed=false
ansible_loop_var: item
cmd:
- ceph
- -n
- client.admin
- -k
- /etc/ceph/ceph.client.admin.keyring
- --cluster
- ceph
- osd
- pool
- create
- rbd
- replicated
- --target_size_ratio
- '1'
- replicated_rule
- --expected_num_objects
- '0'
- --autoscale-mode
- 'on'
delta: '0:00:01.698545'
end: '2022-11-19 09:57:07.221218'
item:
application: rbd
name: rbd
pg_autoscale_mode: true
target_size_ratio: 1
rc: 2
start: '2022-11-19 09:57:05.522673'
stderr: 'Error ERANGE: ''pgp_num'' must be greater than 0 and lower or equal than ''pg_num'', which in this case is 1'
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
Maybe it's a ceph bug?
[root@rhel9-1 ~]# ceph -n client.admin -k /etc/ceph/ceph.client.admin.keyring --cluster ceph osd pool create test --target-size-ratio 0.2 --autoscale-mode=on
Error ERANGE: 'pgp_num' must be greater than 0 and lower or equal than 'pg_num', which in this case is 1
Also note that creating the rbd in autoscale_mode=warn and then setting it to "on" seems to work:
[root@rhel9-1 ~]# ceph -n client.admin -k /etc/ceph/ceph.client.admin.keyring --cluster ceph osd pool create test --autoscale-mode=warn
pool 'test' created
[root@rhel9-1 ~]# ceph osd pool set test pg_autoscale_mode on
set pool 7 pg_autoscale_mode to on
Share your group_vars files, inventory and full ceph-ansibe log
Environment:
- OS (e.g. from /etc/os-release): RHEL9.1
- Kernel (e.g.
uname -a
): 5.14.0-70 - Docker version if applicable (e.g.
docker version
): N/A - Ansible version (e.g.
ansible-playbook --version
): 2.12.0 - ceph-ansible version (e.g.
git head or tag or stable branch
): stable-7.0 - Ceph version (e.g.
ceph -v
): 17.2.5
Thanks!
Hi,
Indeed it seems to be a Ceph issue as doing the same in my test env works:
[root@mon0 /]# ceph osd pool create test --target-size-ratio 0.2 --autoscale-mode=on
pool 'test' created
[root@mon0 /]# ceph osd pool create test1 --target-size-ratio 1 --autoscale-mode=on
pool 'test1' created
[root@mon0 /]# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 240 pgp_num 230 pg_num_target 8 pgp_num_target 8 pg_num_pending 239 autoscale_mode on last_change 111 lfor 0/111/111 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'rbd' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 116 pgp_num 114 pg_num_target 32 pgp_num_target 32 pg_num_pending 115 autoscale_mode on last_change 111 lfor 0/111/111 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'test' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 250 pg_num_target 32 pgp_num_target 32 pg_num_pending 255 autoscale_mode on last_change 111 lfor 0/111/111 flags hashpspool stripe_width 0 target_size_ratio 0.2
pool 4 'test1' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 106 lfor 0/0/98 flags hashpspool stripe_width 0 target_size_ratio 1
For that error to happen, pgp_num must be greater than pg_num (https://github.com/ceph/ceph/blob/f2f5ca51509ee8bd6b66772a02f0f57c68862fd7/src/mon/OSDMonitor.cc#L8012) but with autoscale-mode set to on, pg_num is set to 1 which means that in your case pgp_num has to be greater than this value. Have you changed the default value of osd_pool_default_pgp_num in your ceph.conf file?
From the all.yml file you shared:
global:
osd_pool_default_size: "{{ osd_pool_default_size }}"
osd_pool_default_min_size: "{{ osd_pool_default_min_size }}"
osd_pool_default_pg_num: 128
osd_pool_default_pgp_num: 128
osd_crush_chooseleaf_type: 1
mon_osd_min_down_reporters: 1
mon:
auth_allow_insecure_global_id_reclaim: false
osd:
osd_min_pg_log_entries: 500
osd_max_pg_log_entries: 500
osd memory target: "{{ osd_memory_target }}"
You are forcing pgp_num to 128. pgp_num cannot be higher than pg_num, this is why you get that error. When using autoscale_mode=on, pg_num = 1 (https://docs.ceph.com/en/latest/rados/configuration/pool-pg-config-ref/#confval-osd_pool_default_pg_autoscale_mode)
thanks a lot.