Ceph crush rules explanation for multiroom/racks setup

1 vote

1 answer

607 views

ceph

I started recently with ceph, inherited 1 large cluster for maintenance and now building recovery cluster. By game of trial and failure I managed to create crush rules to fit my purpose but failed to understand the syntax of crush rule definition. Could someone please explain (don't reference ceph docs, since they don't explain that)? Here is my setup of production cluster: 20 hosts distributed in 2 rooms, 2 racks in each room, 5 servers per rack, 10 OSDs per host, 200 OSDs in total. Someone wanted super safe setup, so replication is 2/4 and rules are (supposedly) defined to replicate to other room, 2 copies in each rack, 4 in total for every object. Here is the rule:

rule replicated_nvme {
	id 4
	type replicated
	min_size 1
	max_size 100
	step take default class nvme
	step choose firstn 0 type room
	step choose firstn 2 type rack
	step chooseleaf firstn 1 type host
	step emit
}

At my new cluster I have smaller setup so just 2 racks with 2 servers in each for test. I tried this, similar to the above, but without room:

rule replicated-nvme {
	id 6
	type replicated
	step take default class nvme
	step choose firstn 0 type rack
	step chooseleaf firstn 1 type host
	step emit
}

However, this doesn't produce desired result (with replication 2/4 it should be copy to other rack each copy to different server). What I got is 2 replicas in servers in different racks and 2 additional copies not created. I get this from ceph:

pgs:     4/8 objects degraded (50.000%)
             1 active+undersized+degraded

and I see that only 2 OSDs are used, not 4! So, I played and just changed to this:

rule replicated-nvme {
	id 6
	type replicated
	step take default class nvme
	step choose firstn 0 type rack
	step chooseleaf firstn 0 type host
	step emit
}

and it works. Pool PGs are replicated to 4 OSDs accross 2 racks (2 OSDs oer each rack). The only difference is chooseleaf firstn 0 type host instead of chooseleaf firstn 1 type host. The questions are: - what is the difference between choose and chooseleaf - what is the meaning of the *number* after firstn - how is the hierarchy defined for **steps**, what is checked before, what after? In short, I would like to know the syntax of crush rules. Just for clarification, altough the production cluster have even number of hosts per room/rack, and even replication rules, the object distribution is not super even. I.e. PGs distribution may differ to up to 10% per OSD. I suspect that 1st rule defined above is wrong and that purely by large number of OSDs is the distribution more or less equal.

Asked by dotokija (133 rep)

Aug 2, 2024, 08:51 AM
Last activity: Aug 9, 2024, 12:09 PM

Ceph crush rules explanation for multiroom/racks setup

Related Questions