--- title: recovering a rook-ceph cluster author: "0x3bb" date: M08-10-2024 --- I share a _rook-ceph_ cluster between a friend of mine approximately 2000km away. I had reservations whether this would be a good idea at first because of the latency and the fact consumer grade WAN might bring anything that breathes to its knees. However, I'm pleased with how robust it has been under those circumstances. Getting started with _rook-ceph_ is really simple because it orchestrates everything for you. That's convenient, although, if and when you have problems, you can suddenly become an operator of a distributed system that you may know very little about due to the abstraction. During this incident, I found myself in exactly that position, relying heavily on the (great) documentation for both _rook_ and _ceph_ itself. # the beginning In the process of moving nodes, **I accidentally zapped 2 OSDs**. Given that the _"ceph cluster"_ is basically two bedrooms peered together, we were down to the final OSD. This sounds bad, but it was fine: mons were still up, just a matter of removing the old OSDs from the tree and letting replication work its magic. # the real mistake I noticed although the block pool was replicated, we had lost all our RADOS object storage. ![](/b/images/ec-2-1.png) The erasure-coding profile was `k=2, m=1`. That meant we could only lose 2 OSDs, which had already happened. Object storage (which our applications interfaced with via Ceph S3 Gateway) was lost. Nothing critical -- we just needed our CSI volume mounts back online for databases -- packages and other artifacts could easily be restored from backups. Moving on from this, the first thing I did was _"fix"_ the EC configuration to `k=3, m=2`. This would spread the data over 5 OSDs. ``` - dataChunks: 2 - codingChunks: 1 + dataChunks: 3 + codingChunks: 2 ``` Happy with that, I restored a backup of the objects. Everything was working. A few days later, _renovatebot_ arrived with a new PR to bump _rook-ceph_ `1.14.9`. ![](/b/images/rook-ceph-pr.png) Of course I want that -- the number is bigger than the previous number so everything is going to be better than before. # downtime Following the merge, all services on the cluster went down at once. I checked the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch of gibberish and decided to check out the GitHub issues. Since nothing is ever my fault, I decided to see who or what was to blame. With no clues, I had still hoped this would be relatively simple to fix. Resigning to actually reading the OSD logs in the `rook-ceph-crashcollector` pods, I saw (but did not understand) the problem: `osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)` ![](/b/images/ec-3-2.png) With the _"fixed"_ configuration, what I had actually done is split the object store pool across _5_ OSDs. We had _3_. Due to the `rook-ceph-operator` being rescheduled from the version bump, the OSD daemons had been reloaded as part of the update procedure and now demanded an answer for the data and coding chunks that simply did not exist. Sure enough, `ceph -s` also reported undersized placement groups. This makes sense as there weren't enough OSDs to split the data. # from bad to suicide ## reverting erasure-coding profile The first attempt I made was to revert the EC profile back to `k=2, m=1`. The OSDs were still in the same state complaining about the erasure-coding profile. ## causing even more damage The second attempt (and in hindsight, a very poor choice) was to zap the other two underlying OSD disks: [ `osd-1`, `osd-2` ]. `zap.sh` ``` DISK="/dev/vdb" # Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean) sgdisk --zap-all $DISK # Wipe a large portion of the beginning of the disk to remove more LVM metadata that may be present dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync # SSDs may be better cleaned with blkdiscard instead of dd blkdiscard $DISK # Inform the OS of partition table changes partprobe $DISK ``` Perhaps having two other OSDs online would allow me to replicate the healthy pgs without the offending RADOS objects. Sure enough, the 2 new OSDs started. Since `osd-0` with the actual data still wouldn't start, the cluster was still in a broken state. Now down to the last OSD, at this point I knew that I was going to make many, many more mistakes. If I was going to continue I needed to backup the logical volume used by the `osd-0` node before continuing, which I did. ## clutching at mons I switched my focus to a new narrative: _something was wrong with the mons_. They were in quorum but I still couldn't figure out why the now last-surviving OSD was having issues starting. The `mon_host` configuration was correct in `secret/rook-ceph-config`: `mon_host: '[v2:10.50.1.10:3300,v1:10.50.1.10:6789],[v2:10.50.1.11:3300,v1:10.50.1.11:6789],[v2:10.55.1.10:3300,v1:10.55.1.10:6789]'` Nothing had changed with the underlying data on those mons. Maybe there was corruption in the monitor store? The monitors maintain a map of the cluster state: the `osdmap`, the `crushmap`, etc. My theory was: if the cluster map did not have the correct placement groups and other OSD metadata then perhaps replacing it would help. I replaced the data from another mon to the one used for the failing OSD deployment (`store.db`) and scaled up the deployment: `2024-08-04T00:49:47.698+0000 7f12fc78f700 0 mon.s@2(probing) e30 removed from monmap, suicide.` With all data potentially lost and it being almost 1AM, that message was not very reassuring. I did manually change the monmap and inject it back in, but ended up back in the same position. I figured I had done enough experimenting at this point and had to look deeper outside of the deployment. The only meaningful change we had made was the erasure-coding profile. # initial analysis First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors: ``` 2024-08-03T18:58:49.845+0000 7f2a7d6da700 1 osd.0 pg_epoch: 8916 pg[12.11s2( v 8915'287 (0'0,8915'287] local-lis/les=8893/8894 n=23 ec=5063/5059 lis/c=8893/8413 les/c/f=8894/8414/0 sis=8916 pruub=9.256847382s) [1,2,NONE]p1(0) r=-1 lpr=8916 pi=[8413,8916)/1 crt=8915'287 mlcod 0'0 unknown NOTIFY pruub 20723.630859375s@ mbc={}] state: transitioning to Stray 2024-08-03T18:58:49.849+0000 7f2a7ced9700 -1G /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h: In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread 7f2a7e6dc700 time 2024-08-03T18:58:49.853351+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0) ``` ``` -2> 2024-08-03T18:59:00.086+0000 7ffa48b48640 5 osd.2 pg_epoch: 8894 pg[12.9(unlocked)] enter Initial ... /src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0) ``` I noticed a pattern: all the failing pg IDs were prefixed with `12`. Seeing this, I had concluded: - mons were ok and in quorum; - the `osd-0` daemon fails to start; - other fresh OSDs (`osd-1` and `osd-2`) start fine, this was a data integrity issue confined to `osd-0` (and the previous OSDs had I not nuked them); - the cause was a change in the erasure-coding profile, which happened on only one pool where the chunk distribution was modified; Accepting the loss of the miniscule data on the object storage pool in favor of saving the block storage, I could correct the misconfiguration. # preparation To avoid troubleshoting issues caused from my failed attempts, I decided I would do a clear out of the existing CRDs and just focus first on getting the OSD with the data back online. If I ever got the data back, then I'd probably be conscious of prior misconfiguration and have to do so regardless. - backup the important shit; - clear out the `rook-ceph` namespace; ## backups - the logical volume for `osd-0`, so I can re-attach it and afford mistakes; - `/var/lib/rook` on all nodes, containing mon data; ## removal ### deployments/daemonsets These were the first to go, as I didn't want the `rook-operator` persistently creating Kubernetes objects when I was actively trying to kill them. ### crds Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed: - `cephblockpoolradosnamespaces` - `cephblockpools` - `cephbucketnotifications` - `cephclients` - `cephclusters` - `cephcosidrivers` - `cephfilesystemmirrors` - `cephfilesystems` - `cephfilesystemsubvolumegroups` - `cephnfses` - `cephobjectrealms` - `cephobjectstores` - `cephobjectstoreusers` - `cephobjectzonegroups` - `cephobjectzones` - `cephrbdmirrors` ### /var/lib/rook I had these backed up for later, but I didn't want them there when the cluster came online. ### osd disks I did not wipe any devices. First, I obviously didn't want to wipe the disk with the data on it. As for the other, now useless OSDs that I had mistakenly created over the old ones; I knew spawning the `rook-operator` would create new OSDs if they didn't belong to an old ceph cluster. This would make troubleshooting `osd-0` more difficult, as I'd now have to consider analysing the status reported from `osd-1` and `osd-2`. ## provisioning Since at this point I only cared about `osd-0` and it was beneficial to have fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1` within the helm `values.yaml`. Following this, I simply reconciled the chart. I noticed the `rook-ceph-operator`, `rook-ceph-mon-a`, `rook-ceph-mgr-a` came online as expected. Because the OSDs were part of an old cluster, I now had a ceph-cluster with no OSDs, as shown in the `rook-ceph-osd-prepare-*` jobs for each node. ``` osd.0: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6" osd.1: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6" osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6" ``` # surgery With less noise and a clean slate, it was time to attempt to fix this mess. - adopt `osd-0` to the new cluster; - remove the corrupted pgs from `osd-0`; - bring up two new OSDs for replication; ## osd-0 I started trying to determine how I would _safely_ remove the offending objects. If that happened, then the OSD would have no issues with the erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should start. - If the placement groups contained only objects created from the _RADOS Object Gateway_, then I can simply remove the pgs. - If, however, the pgs contain both the former _and_ block device objects then it would require careful removal of all non-rdb (block storage) objects as there would be valuable data loss by purging the entire placement groups. Since OSD pools have a `1:N` relationship with pgs, the second scenario seemed unlikely, perhaps impossible. Next, I needed to inspect the OSD somehow, because the existing deployment would continously crash. `kubectl rook-ceph debug start rook-ceph-osd-0` Running this command allowed me to observe the OSD without it actually joining the cluster. The "real" OSD deployment need only be scheduled, but crashing continuously was ok. Once you execute that command, it will scale the OSD daemon down and create a new deployment that mirrors the configuration but _without_ the daemon running in order to perform maintenance. Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool. ``` [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS 12.0 0 0 0 0 0 0 0 0 unknown 8h 12.1 0 0 0 0 0 0 0 0 unknown 8h 12.2 0 0 0 0 0 0 0 0 unknown 8h 12.3 0 0 0 0 0 0 0 0 unknown 8h 12.4 0 0 0 0 0 0 0 0 unknown 8h 12.5 0 0 0 0 0 0 0 0 unknown 8h 12.6 0 0 0 0 0 0 0 0 unknown 8h 12.7 0 0 0 0 0 0 0 0 unknown 8h 12.8 0 0 0 0 0 0 0 0 unknown 8h 12.9 0 0 0 0 0 0 0 0 unknown 8h 12.a 0 0 0 0 0 0 0 0 unknown 8h 12.b 0 0 0 0 0 0 0 0 unknown 8h 12.c 0 0 0 0 0 0 0 0 unknown 8h 12.d 0 0 0 0 0 0 0 0 unknown 8h 12.e 0 0 0 0 0 0 0 0 unknown 8h 12.f 0 0 0 0 0 0 0 0 unknown 8h 12.10 0 0 0 0 0 0 0 0 unknown 8h 12.11 0 0 0 0 0 0 0 0 unknown 8h 12.12 0 0 0 0 0 0 0 0 unknown 8h 12.13 0 0 0 0 0 0 0 0 unknown 8h 12.14 0 0 0 0 0 0 0 0 unknown 8h 12.15 0 0 0 0 0 0 0 0 unknown 8h 12.16 0 0 0 0 0 0 0 0 unknown 8h 12.17 0 0 0 0 0 0 0 0 unknown 8h 12.18 0 0 0 0 0 0 0 0 unknown 8h 12.19 0 0 0 0 0 0 0 0 unknown 8h 12.1a 0 0 0 0 0 0 0 0 unknown 8h 12.1b 0 0 0 0 0 0 0 0 unknown 8h ``` Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD. ``` [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-pgs | grep ^12 12.bs1 12.6s1 12.1fs0 12.1ds1 12.15s0 12.16s0 12.11s0 12.12s2 12.0s0 12.17s2 12.4s1 12.9s0 12.19s0 12.cs2 12.13s0 12.14s2 12.3s2 12.1as0 12.1bs2 12.as1 12.1es1 12.1cs2 12.2s2 12.8s1 12.7s2 12.ds0 12.es0 12.fs0 12.18s0 12.1s0 12.5s1 12.10s2 ``` I still needed to be convinced I wasn't removing any valuable data. I inspected a few of them to be sure. ``` [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 12.10s0 --op list "12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/07/19/071984080b32e2867 f 1ac6ec2b7d2b8724bc5d75e2850b5e7 f20040ee52F55d1.2~e7rYg3S "hash" :1340195137, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0} "12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/9a/82/9a82a64c3a8439c75d8e584181427b073712afd1454747bec3dcb84bcbe19ac5. 2~urbG4nd "hash" :4175566657, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0} "12.10s0", {"oid" :"7d927F08-bd9b-4d4b-bfc1-d331eb216e68.22197937.1__shadow Windows Security Internals.pdf.2~g9stQ9inkWvsTq33S9z5xNEHEgST2H4.1_1","key":"", "snapid":- "shard id":2,"max":0}] ... ``` With this information, I now knew: - the log exceptions matched the pgs that were impacted from the change in the erasure-coding configuration; - `ceph-objectstore.rgw.buckets.data` had a relationship with those pgs where the configuration was changed; - the objects were familiar with the objects in the buckets, e.g. books; Since I _did_ modify the erasure-coding profile this is all starting to make sense. Carefully, the next operation was to remove the offending placement groups. Simply removing the pool wouldn't work, as the OSD daemon not starting meant it would know nothing about this change, and still not have enough chunks to come alive. ``` [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.17s2 marking collection for removal setting '_remove' omap key finish_remove_pgs 12.17s2_head removing 12.17s2 Remove successful [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.0s0 marking collection for removal setting '_remove' omap key finish_remove_pgs 12.0s0_head removing 12.0s0 Remove successful ``` I did this for every PG listed above. Once I scaled down the maintenance deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon with (hopefully) agreeable placement groups and thankfully, it had come alive. ``` k get pods -n rook-ceph rook-ceph-osd-0-6f57466c78-bj96p 2/2 Running ``` An eager run of `ceph -s` produced both relief and disappointment. The OSD was up, but the pgs were in an `unknown` state. ``` ceph -s ... data: pools: 12 pools, 169 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 169 unknown ``` At this point, I had mentioned to my friend (helping by playing GuildWars 2) that we might be saved. It seemed promising as we at least had `osd-0` running again now that the troublesome pgs were removed. He agreed, and contemplated changing his character's hair colour instead of saving our data. ## mons ### restoring I had `/var/lib/rook` backups from each node with the old mon data. At this point, with the correct number of placement groups and seeing 100% of them remaining in an `unknown` state, it seemed the next step was to restore the mons. I knew from reading the `rook-ceph` docs that if you want to restore the data of a monitor to a new cluster, you have to inject the `monmap` into the old mons `store.db`. Before doing this, I scaled `deployment/rook-ceph-mon-a` down to `0` first. Then, navigating to a directory on my local machine with the backups I ran a container to modify the `monmap` on my local fs. `docker run -it --rm --entrypoint bash -v .:/var/lib/rook rook/ceph:v1.14.9` ``` touch /etc/ceph/ceph.conf cd /var/lib/rook ceph-mon --extract-monmap monmap --mon-data ./mon-q/data monmaptool --rm q monmaptool --rm s monmaptool --rm t ``` Now the old mons `q`, `s` and `t` were removed from the map, I had to add the new cluster mon `rook-ceph-mon-a` created following the new ceph-cluster. ``` monmaptool --addv a '[v2:10.50.1.10:3300,v1:10.50.1.10:6789]` ceph-mon --inject-monmap monmap --mon-data ./mon-q/data exit ``` Shoving it back up to the node `rook-ceph-mon-a` lives on: `scp -r ./mon-q/data/* 3bb@10.50.1.10:/var/lib/rook/mon-a/data/` Rescheduling the deployment and although the mon log output isn't giving me suggestions of suicide, all our pgs still remain in an `unknown` state. ## recovering the mon store It turns out that you can actually recover the mon store. It's not a huge deal so long as your OSDs have data integrity. Scaling the useless `mon-a` down, I copied the existing `mon-a` data onto the `rook-ceph-osd-0` daemon container. Another `osd-0` debug container... `k rook-ceph debug start rook-ceph-osd-0` I rebuilt the mon data, using the existing RocksDB kv store. This would have worked without the backup, but I was interested to see the `osdmaps` trimmed due to the other 2 removed OSDs. ``` [root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-0/ --op update-mon-db --mon-store-path /tmp/mon-a/data/ osd.0 : 3099 osdmaps trimmed, 635 osdmaps added. ``` ``` [root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-authtool /tmp/mon-a/keyring -n mon. --cap mon 'allow *' --gen-key [root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-monstore-tool /tmp/mon-a/data rebuild -- --keyring /tmp/mon-a/keyring 4 rocksdb: [db/flush_job.cc:967] [default] [JOB 3] Level-0 flush table #3433: 62997231 bytes OK 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1722831731454649, "job": 3, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [2, 0, 0, 0, 0, 0, 2], "immutable_memtables": 1} 4 rocksdb: [file/delete_scheduler.cc:74] Deleted file /tmp/mon-a/data/store.db/003433.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1723067397472153, "job": 4, "event": "table_file_deletion", "file_number": 3433} 4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete ``` After copying the now _rebuilt_ `mon-a` store back, and bringing everything up again, the cluster was finally resurrecting. It took some time for the rebalancing and replication to finish, but hours later, `ceph -s` reported a healthy cluster and services resumed being entirely unaware of the chaos that had ensued over the previous few days: ``` cluster: id: 47f25963-57c0-4b3b-9b35-bbf68c09eec6 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: b(active, since 8h), standbys: a mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 3h), 3 in (since 3h) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 4.88k objects, 16 GiB usage: 47 GiB used, 1.1 TiB / 1.2 TiB avail pgs: 169 active+clean io: client: 639 B/s rd, 9.0 KiB/s wr, 1 op/s rd, 1 op/s wr recovery: 834 KiB/s, 6 objects/s ``` It seemed like a miracle, but it is entirely credited to how resilient ceph is built to tolerate that level of abuse. # why _Data appears to be lost_ - ceph OSD daemons fail to start; - the OSDs could not reconstruct the data from chunks; - the `osdmap` referenced a faulty erasure-coding profile; - the monstore `osdmap` still had reference to the above erasure-coding profile; - the erasure-coding profile was changed to a topology impossible to satisfy under the current architecture; - 2 disks were zapped, hitting the ceiling of the failure domain for `ceph-objectstore.rgw.buckets.data_ecprofile`; The monitor `osdmap` still contained the bad EC profile. `ceph-monstore-tool /tmp/mon-bak get osdmap > osdmap.bad` `osdmaptool --dump json osdmap.bad | grep -i profile` ``` "erasure_code_profiles":{ "ceph-objectstore.rgw.buckets.data_ecprofile":{ "crush-device-class":"", "crush-failure-domain":"host", "crush-root":"default", "jerasure-per-chunk-alignment":"false", "k":"3", "m":"2" } } ``` After rebuilding the monstore... `ceph-monstore-tool /tmp/mon-a get osdmap > osdmap.good` `osdmaptool --dump json osdmap.good | grep -i profile` ``` "erasure_code_profiles":{ "ceph-objectstore.rgw.buckets.data_ecprofile":{ "crush-device-class":"", "crush-failure-domain":"host", "crush-root":"default", "jerasure-per-chunk-alignment":"false", "k":"2", "m":"1" } } ``` Therefore, it seems as if I could have attempted to rebuild the monstore first, possibly circumventing the _EC Assert_ errors. The placement groups on `osd-0` were still mapping to 3 OSDs, not 5. ``` [root@ad9e4c6e7343 rook]# osdmaptool --test-map-pgs-dump --pool 12 osdmap osdmaptool: osdmap file 'osdmap' pool 12 pg_num 32 12.0 [2147483647,2147483647,2147483647] -1 12.1 [2147483647,2147483647,2147483647] -1 12.2 [2147483647,2147483647,2147483647] -1 12.3 [2147483647,2147483647,2147483647] -1 12.4 [2147483647,2147483647,2147483647] -1 12.5 [2147483647,2147483647,2147483647] -1 12.6 [2147483647,2147483647,2147483647] -1 12.7 [2147483647,2147483647,2147483647] -1 12.8 [2147483647,2147483647,2147483647] -1 12.9 [2147483647,2147483647,2147483647] -1 12.a [2147483647,2147483647,2147483647] -1 12.b [2147483647,2147483647,2147483647] -1 12.c [2147483647,2147483647,2147483647] -1 12.d [2147483647,2147483647,2147483647] -1 12.e [2147483647,2147483647,2147483647] -1 12.f [2147483647,2147483647,2147483647] -1 12.10 [2147483647,2147483647,2147483647] -1 12.11 [2147483647,2147483647,2147483647] -1 12.12 [2147483647,2147483647,2147483647] -1 12.13 [2147483647,2147483647,2147483647] -1 12.14 [2147483647,2147483647,2147483647] -1 12.15 [2147483647,2147483647,2147483647] -1 12.16 [2147483647,2147483647,2147483647] -1 12.17 [2147483647,2147483647,2147483647] -1 12.18 [2147483647,2147483647,2147483647] -1 12.19 [2147483647,2147483647,2147483647] -1 12.1a [2147483647,2147483647,2147483647] -1 12.1b [2147483647,2147483647,2147483647] -1 12.1c [2147483647,2147483647,2147483647] -1 12.1d [2147483647,2147483647,2147483647] -1 12.1e [2147483647,2147483647,2147483647] -1 12.1f [2147483647,2147483647,2147483647] -1 #osd count first primary c wt wt osd.0 0 0 0 0.488297 1 osd.1 0 0 0 0.488297 1 osd.2 0 0 0 0.195297 1 in 3 avg 0 stddev 0 (-nanx) (expected 0 -nanx)) size 3 32 ``` Since the cluster did not have enough OSDs (wanted 5 with `k=3,m=2`), the rule can be tested against the old crush map, with `--num-rep` representing the required OSDs, i.e. `k+m`: With the original erasure-coding profile (`k+m=3`), everything looks good -- no bad mappings. ``` [root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 3 --show-bad-mappings // healthy ``` With `k+m=5`, though -- or anything greater than `3` OSDs... ``` [root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 5 --show-bad-mappings ... bad mapping rule 20 x 1002 num_rep 5 result [0,2147483647,1,2,2147483647] bad mapping rule 20 x 1003 num_rep 5 result [0,2147483647,2,1,2147483647] bad mapping rule 20 x 1004 num_rep 5 result [1,0,2147483647,2,2147483647] bad mapping rule 20 x 1005 num_rep 5 result [0,1,2147483647,2,2147483647] bad mapping rule 20 x 1006 num_rep 5 result [0,1,2147483647,2,2147483647] bad mapping rule 20 x 1007 num_rep 5 result [0,1,2147483647,2147483647,2] bad mapping rule 20 x 1008 num_rep 5 result [1,2,0,2147483647,2147483647] bad mapping rule 20 x 1009 num_rep 5 result [2,1,0,2147483647,2147483647] bad mapping rule 20 x 1010 num_rep 5 result [0,1,2,2147483647,2147483647] bad mapping rule 20 x 1011 num_rep 5 result [0,2147483647,2,1,2147483647] bad mapping rule 20 x 1012 num_rep 5 result [0,1,2147483647,2,2147483647] bad mapping rule 20 x 1013 num_rep 5 result [0,2,2147483647,1,2147483647] bad mapping rule 20 x 1014 num_rep 5 result [1,0,2147483647,2,2147483647] bad mapping rule 20 x 1015 num_rep 5 result [2,0,2147483647,1,2147483647] bad mapping rule 20 x 1016 num_rep 5 result [1,0,2,2147483647,2147483647] bad mapping rule 20 x 1017 num_rep 5 result [2,0,1,2147483647,2147483647] bad mapping rule 20 x 1018 num_rep 5 result [1,0,2147483647,2147483647,2] bad mapping rule 20 x 1019 num_rep 5 result [0,1,2,2147483647,2147483647] bad mapping rule 20 x 1020 num_rep 5 result [0,2147483647,1,2147483647,2] bad mapping rule 20 x 1021 num_rep 5 result [2,1,0,2147483647,2147483647] bad mapping rule 20 x 1022 num_rep 5 result [1,0,2,2147483647,2147483647] bad mapping rule 20 x 1023 num_rep 5 result [0,2,1,2147483647,2147483647] ``` Mappings were found on _3_ OSDs, but missing the 4th and 5th reference as indicated by the largest 32-bit int (i.e. missing). The object storage data would still have been lost, but it could have made the recovery of the cluster significantly less painful.