diff --git a/2024-08-10-recovering-ceph-cluster.md b/2024-08-10-recovering-ceph-cluster.md new file mode 100644 index 0000000..e5661ca --- /dev/null +++ b/2024-08-10-recovering-ceph-cluster.md @@ -0,0 +1,728 @@ +--- +title: "recovering a rook-ceph cluster when all hope seems to be lost" +author: "0x3bb" +date: M08-10-2024 +--- + +I share a _rook-ceph_ cluster between a friend of mine approximately 2000km +away. I had reservations whether this would be a good idea at first because of +the latency and the fact consumer grade WAN might bring anything +that breathes to its knees. However, I'm pleased with how robust it has been +under those circumstances. + +Getting started with _rook-ceph_ is really simple because it orchestrates +everything for you. That's convenient, although, if and when you have problems, +you can suddenly become an operator of a distributed system that you may know +very little about due to the abstraction. + +During this incident, I found myself in exactly that position, relying heavily on the +(great) documentation for both _rook_ and _ceph_ itself. + +# the beginning + +In the process of moving nodes, **I accidentally zapped 2 OSDs**. + +Given that the _"ceph cluster"_ is basically two bedrooms peered together, we +were down to the final OSD. + +This sounds bad, but it was fine: mons were still up, just a matter of removing +the old OSDs from the tree and letting replication work its magic. + +# the real mistake + +I noticed although the block pool was replicated, we had lost all our RADOS +object storage. + + +![](./images/ec-2-1.png) + +The erasure-coding profile was `k=2, m=1`. That meant we could only lose 2 +OSDs, which had already happened. + +Object storage (which our applications interfaced with via Ceph S3 Gateway) was +lost. Nothing critical -- we just needed our CSI volume mounts back online for +databases -- packages and other artifacts could easily be restored from +backups. + + +Moving on from this, the first thing I did was _"fix"_ the EC configuration to +`k=3, m=2`. This would spread the data over 5 OSDs. + +``` +- dataChunks: 2 +- codingChunks: 1 ++ dataChunks: 3 ++ codingChunks: 2 +``` + +Happy with that, I restored a backup of the objects. Everything was working. + +A few days later, _renovatebot_ arrived with a new PR to bump _rook-ceph_ `1.14.9`. + +Of course I want that -- the number is bigger than the previous number so +everything is going to be better than before. + +# downtime + +Following the merge, all services on the cluster went down at once. I checked +the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch +of gibberish and decided to check out the GitHub issues. Since nothing is ever +my fault, I decided to see who or what was to blame. + +With no clues, I had still hoped this would be relatively simple to fix. + +Resigning to actually reading the OSD logs in the `rook-ceph-crashcollector` +pods, I saw (but did not understand) the problem: + +`osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)` + +![](./images/ec-3-2.png) + +With the _"fixed"_ configuration, what I had actually done is split the object +store pool across _5_ OSDs. We had _3_. + +Due to the `rook-ceph-operator` being rescheduled from the version bump, the +OSD daemons had been reloaded as part of the update procedure and now demanded +an answer for the data and coding chunks that simply did not exist. Sure enough, +`ceph -s` also reported undersized placement groups. + +This makes sense as there weren't enough OSDs to split the data. + +# from bad to suicide + +## reverting erasure-coding profile + +The first attempt I made was to revert the EC profile back to `k=2, m=1`. The +OSDs were still in the same state complaining about the erasure-coding +profile. + +## causing even more damage + +The second attempt (and in hindsight, a very poor choice) was to zap the other +two underlying OSD disks: + +[ `osd-1`, `osd-2` ]. + +`zap.sh` + +``` +DISK="/dev/vdb" + +# Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean) +sgdisk --zap-all $DISK + +# Wipe a large portion of the beginning of the disk to remove more LVM metadata that may be present +dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync + +# SSDs may be better cleaned with blkdiscard instead of dd +blkdiscard $DISK + +# Inform the OS of partition table changes +partprobe $DISK +``` + +Perhaps having two other OSDs online would allow me to replicate the healthy +pgs without the offending RADOS objects. + +Sure enough, the 2 new OSDs started. + +Since the `osd-0` with the actual data still wouldn't start, the cluster was +still in a broken state. + +Now down to the last OSD, at this point I knew that I was going to make many, +many more mistakes. If I was going to continue I needed to backup the logical +volume used by the `osd-0` node before continuing, which I did. + +## clutching at mons + +I switched my focus to a new narrative: _something was wrong with the mons_. + +They were in quorum but I still couldn't figure out why the now last-surviving OSD +was having issues starting. + +The `mon_host` configuration was correct in `secret/rook-ceph-config`: + +`mon_host: '[v2:10.50.1.10:3300,v1:10.50.1.10:6789],[v2:10.50.1.11:3300,v1:10.50.1.11:6789],[v2:10.55.1.10:3300,v1:10.55.1.10:6789]'` + +Nothing had changed with the underlying data on those mons. Maybe there was +corruption in the monitor store? The monitors maintain a map of the cluster +state: the `osdmap`, the `crushmap`, etc. + +My theory was: if the cluster map did not have the correct placement groups and +other OSD metadata then perhaps replacing it would help. + +I replaced the data from another mon to the one used for the failing OSD +deployment (`store.db`) and scaled up the deployment: + +`2024-08-04T00:49:47.698+0000 7f12fc78f700 0 mon.s@2(probing) e30 removed +from monmap, suicide.` + +With all data potentially lost and it being almost 1AM, that message was not +very reassuring. I did manually change the monmap and inject it back in, but +ended up back in the same position. + +I figured I had done enough experimenting at this point and had to look deeper +outside of the deployment. The only meaningful change we had made was the +erasure-coding profile. + +# initial analysis + +First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors: + +``` +2024-08-03T18:58:49.845+0000 7f2a7d6da700 1 osd.0 pg_epoch: 8916 +pg[12.11s2( v 8915'287 (0'0,8915'287] local-lis/les=8893/8894 n=23 ec=5063/5059 +lis/c=8893/8413 les/c/f=8894/8414/0 sis=8916 pruub=9.256847382s) +[1,2,NONE]p1(0) r=-1 lpr=8916 pi=[8413,8916)/1 crt=8915'287 mlcod 0'0 unknown +NOTIFY pruub 20723.630859375s@ mbc={}] state: transitioning to Stray +2024-08-03T18:58:49.849+0000 7f2a7ced9700 -1G +/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h: +In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread +7f2a7e6dc700 time 2024-08-03T18:58:49.853351+0000 +/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h: +34: FAILED ceph_assert(stripe_width % stripe_size == 0) +``` + +``` +-2> 2024-08-03T18:59:00.086+0000 7ffa48b48640 5 osd.2 pg_epoch: 8894 pg[12.9(unlocked)] enter Initial +... +/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0) + +``` + +I noticed a pattern: all the failing pg IDs were prefixed with `12`. + +Seeing this, I had concluded: + - mons were ok and in quorum; + - the `osd-0` daemon fails to start; + - other fresh OSDs (`osd-1` and `osd-2`) start fine, this was a data + integrity issue confined to `osd-0` (and the previous OSDs had I not nuked + them); + - the cause was a change in the erasure-coding profile, which happened on + only one pool where the chunk distribution was modified; + +Accepting the loss of the miniscule data on the object storage pool in favor of +saving the block storage, I could correct the misconfiguration. + +# preparation + +To avoid troubleshoting issues caused from my failed attempts, I decided I +would do a clear out of the existing CRDs and just focus first on getting the +OSD with the data back online. If I ever got the data back, then I'd probably +be conscious of prior misconfiguration and have to do so regardless. + +- backup the important shit; +- clear out the `rook-ceph` namespace; + +## backups + +- the logical volume for `osd-0`, so I can re-attach it and afford mistakes; +- `/var/lib/rook` on all nodes, containing mon data; + +## removal + +### deployments/daemonsets + +These were the first to go, as I didn't want the `rook-operator` persistently +creating Kubernetes objects when I was actively trying to kill them. + +### crds + +Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed: + +- `cephblockpoolradosnamespaces` +- `cephblockpools` +- `cephbucketnotifications` +- `cephclients` +- `cephclusters` +- `cephcosidrivers` +- `cephfilesystemmirrors` +- `cephfilesystems` +- `cephfilesystemsubvolumegroups` +- `cephnfses` +- `cephobjectrealms` +- `cephobjectstores` +- `cephobjectstoreusers` +- `cephobjectzonegroups` +- `cephobjectzones` +- `cephrbdmirrors` + +### /var/lib/rook + +I had these backed up for later, but I didn't want them there when the cluster came online. + +### osd disks + +I did not wipe any devices. + +First, I obviously didn't want to wipe the disk with the data on it. As for the +other, now useless OSDs that I had mistakenly created over the old ones; I knew +spawning the `rook-operator` would create new OSDs if they didn't belong to an +old ceph cluster. + +This would make troubleshooting `osd-0` more difficult, as I'd now have to consider +analysing the status reported from `osd-1` and `osd-2`. + + +## provisioning + +Since at this point I only cared about `osd-0` and it was beneficial to have +fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1` +within the helm `values.yaml`. + +Following this, I simply reconciled the chart. + +I noticed the `rook-ceph-operator`, `rook-ceph-mon-a`, `rook-ceph-mgr-a` came online as expected. + +Because the OSDs were part of an old cluster, I now had a ceph-cluster with no +OSDs, as shown in the `rook-ceph-osd-prepare-*` jobs for each node. + +``` +osd.0: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6" +osd.1: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6" +osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6" +``` + +# surgery + +With less noise and a clean slate, it was time to attempt to fix this mess. + +- adopt `osd-0` to the new cluster; +- remove the corrupted pgs from `osd-0`; +- bring up two new OSDs for replication; + +## osd-0 + +I started trying to determine how I would _safely_ remove the offending +objects. If that happened, then the OSD would have no issues with the +erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should +start. + + - If the placement groups contained only objects created from the + _RADOS Object Gateway_, then I can simply remove the pgs. + + - If, however, the pgs contain both the former _and_ block device objects + then it would require careful removal of all non-rdb (block storage) + objects as there would be valuable data loss by purging the entire + placement groups. + +Since OSD pools have a `1:N` relationship with pgs, the second scenario seemed +unlikely, perhaps impossible. + +Next, I needed to inspect the OSD somehow, because the existing deployment would continously crash. + +`kubectl rook-ceph debug start rook-ceph-osd-0` + +Running this command allowed me to observe the OSD without it actually joining +the cluster. The "real" OSD deployment need only be scheduled, but crashing +continously was ok. + +Once you execute that command, it will scale the OSD daemon down and create a +new deployment that mirrors the configuration but _without_ the daemon running +in order to perform maintenance. + + +Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool. + + +``` +[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data +PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS +12.0 0 0 0 0 0 0 0 0 unknown 8h +12.1 0 0 0 0 0 0 0 0 unknown 8h +12.2 0 0 0 0 0 0 0 0 unknown 8h +12.3 0 0 0 0 0 0 0 0 unknown 8h +12.4 0 0 0 0 0 0 0 0 unknown 8h +12.5 0 0 0 0 0 0 0 0 unknown 8h +12.6 0 0 0 0 0 0 0 0 unknown 8h +12.7 0 0 0 0 0 0 0 0 unknown 8h +12.8 0 0 0 0 0 0 0 0 unknown 8h +12.9 0 0 0 0 0 0 0 0 unknown 8h +12.a 0 0 0 0 0 0 0 0 unknown 8h +12.b 0 0 0 0 0 0 0 0 unknown 8h +12.c 0 0 0 0 0 0 0 0 unknown 8h +12.d 0 0 0 0 0 0 0 0 unknown 8h +12.e 0 0 0 0 0 0 0 0 unknown 8h +12.f 0 0 0 0 0 0 0 0 unknown 8h +12.10 0 0 0 0 0 0 0 0 unknown 8h +12.11 0 0 0 0 0 0 0 0 unknown 8h +12.12 0 0 0 0 0 0 0 0 unknown 8h +12.13 0 0 0 0 0 0 0 0 unknown 8h +12.14 0 0 0 0 0 0 0 0 unknown 8h +12.15 0 0 0 0 0 0 0 0 unknown 8h +12.16 0 0 0 0 0 0 0 0 unknown 8h +12.17 0 0 0 0 0 0 0 0 unknown 8h +12.18 0 0 0 0 0 0 0 0 unknown 8h +12.19 0 0 0 0 0 0 0 0 unknown 8h +12.1a 0 0 0 0 0 0 0 0 unknown 8h +12.1b 0 0 0 0 0 0 0 0 unknown 8h +``` + + +Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD. + +``` +[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-pgs | grep ^12 +12.bs1 +12.6s1 +12.1fs0 +12.1ds1 +12.15s0 +12.16s0 +12.11s0 +12.12s2 +12.0s0 +12.17s2 +12.4s1 +12.9s0 +12.19s0 +12.cs2 +12.13s0 +12.14s2 +12.3s2 +12.1as0 +12.1bs2 +12.as1 +12.1es1 +12.1cs2 +12.2s2 +12.8s1 +12.7s2 +12.ds0 +12.es0 +12.fs0 +12.18s0 +12.1s0 +12.5s1 +12.10s2 +``` + +I still needed to be convinced I wasn't removing any valuable data. +I inspected a few of them to be sure. + +``` +[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 12.10s0 --op list +"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/07/19/071984080b32e2867 f 1ac6ec2b7d2b8724bc5d75e2850b5e7 f20040ee52F55d1.2~e7rYg3S +"hash" :1340195137, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0} + +"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/9a/82/9a82a64c3a8439c75d8e584181427b073712afd1454747bec3dcb84bcbe19ac5. 2~urbG4nd +"hash" :4175566657, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0} + +"12.10s0", {"oid" :"7d927F08-bd9b-4d4b-bfc1-d331eb216e68.22197937.1__shadow Windows Security Internals.pdf.2~g9stQ9inkWvsTq33S9z5xNEHEgST2H4.1_1","key":"", "snapid":- +"shard id":2,"max":0}] +... + +``` + +With this information, I now knew: + + - the log exceptions matched the pgs that were impacted from the change in + the erasure-coding configuration; + - `ceph-objectstore.rgw.buckets.data` had a relationship with those pgs where + the configuration was changed; + - the objects were familiar with the objects in the buckets, e.g. books; + +Since I _did_ modify the erasure-coding profile this is all starting to make sense. + +Carefully, the next operation was to remove the offending placement groups. +Simply removing the pool wouldn't work, as the OSD daemon not starting meant it +would know nothing about this change, and still not have enough chunks to come alive. + +``` +[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.17s2 + marking collection for removal +setting '_remove' omap key +finish_remove_pgs 12.17s2_head removing 12.17s2 +Remove successful + +[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.0s0 + marking collection for removal +setting '_remove' omap key +finish_remove_pgs 12.0s0_head removing 12.0s0 +Remove successful +`````` + +I did this for every PG listed above. Once I scaled down the maintenance +deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon +with (hopefully) agreeable placement groups and thankfully, it had come alive. + + +``` +k get pods -n rook-ceph + +rook-ceph-osd-0-6f57466c78-bj96p 2/2 Running +``` + +An eager run of `ceph -s` produced both relief and disappointment. The OSD was up, but the pgs were in an `unknown` state. + +``` +ceph -s +... + data: + pools: 12 pools, 169 pgs + objects: 0 objects, 0 B + usage: 0 B used, 0 B / 0 B avail + pgs: 100.000% pgs unknown + 169 unknown +``` + +At this point, I had mentioned to my friend (helping by playing GuildWars 2) +that we might be saved. It seemed promising as we at least had `osd-0` running +again now that the troublesome pgs were removed. + +He agreed, and contemplated changing his character's hair colour instead of +saving our data. + +## mons + +### restoring + +I had `/var/lib/rook` backups from each node with the old mon data. At this +point, with the correct number of placement groups and seeing 100% of them +remaining in an `unknown` state, it seemed the next step was to restore the +mons. + +I knew from reading the `rook-ceph` docs that if you want to restore the data +of a monitor to a new cluster, you have to inject the `monmap` into the old +mons `store.db`. + +Before doing this, I scaled `deployment/rook-ceph-mon-a` down to `0` first. + +Then, navigating to a directory on my local machine with the backups I ran a +container to modify the `monmap` on my local fs. + +`docker run -it --rm --entrypoint bash -v .:/var/lib/rook rook/ceph:v1.14.9` + +``` +touch /etc/ceph/ceph.conf +cd /var/lib/rook +ceph-mon --extract-monmap monmap --mon-data ./mon-q/data +monmaptool --rm q +monmaptool --rm s +monmaptool --rm t +``` + +Now the old mons `q`, `s` and `t` were removed from the map, I had to add the +new cluster mon `rook-ceph-mon-a` created following the new +ceph-cluster. + +``` +monmaptool --addv a '[v2:10.50.1.10:3300,v1:10.50.1.10:6789]` +ceph-mon --inject-monmap monmap --mon-data ./mon-q/data +exit +``` + +Shoving it back up to the node `rook-ceph-mon-a` lives on: + +`scp -r ./mon-q/data/* 3bb@10.50.1.10:/var/lib/rook/mon-a/data/` + +Rescheduling the deployment and although the mon log output isn't giving me +suggestions of suicide, all our pgs still remain in an `unknown` state. + +## recovering the mon store + +It turns out that you can actually recover the mon store. It's not a huge deal +so long as your OSDs have data integrity. + +Scaling the useless `mon-a` down, I copied the existing `mon-a` data onto the +`rook-ceph-osd-0` daemon container. + +Another `osd-0` debug container... `k rook-ceph debug start rook-ceph-osd-0` + +I rebuilt the mon data, using the existing RocksDB kv store. + +This would have worked without the backup, but I was interested to see the +`osdmaps` trimmed due to the other 2 removed OSDs. + +[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-0/ --op update-mon-db --mon-store-path /tmp/mon-a/data/ +osd.0 : 3099 osdmaps trimmed, 635 osdmaps added. +``` + +``` +[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-authtool /tmp/mon-a/keyring -n mon. --cap mon 'allow *' --gen-key + +[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-monstore-tool /tmp/mon-a/data rebuild -- --keyring /tmp/mon-a/keyring +4 rocksdb: [db/flush_job.cc:967] [default] [JOB 3] Level-0 flush table #3433: 62997231 bytes OK +4 rocksdb: EVENT_LOG_v1 {"time_micros": 1722831731454649, "job": 3, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [2, 0, 0, 0, 0, 0, 2], "immutable_memtables": 1} +4 rocksdb: [file/delete_scheduler.cc:74] Deleted file /tmp/mon-a/data/store.db/003433.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000 +4 rocksdb: EVENT_LOG_v1 {"time_micros": 1723067397472153, "job": 4, "event": "table_file_deletion", "file_number": 3433} +4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete +``` + +After copying the now _rebuilt_ `mon-a` store back, and bringing everything up +again, the cluster was finally resurrecting. + +It took some time for the rebalancing and replication to occur, but hours +later, `ceph -s` reported a healthy cluster and services resumed being entirely +unaware of the chaos that had ensued over the previous few days: + +``` +cluster: + id: 47f25963-57c0-4b3b-9b35-bbf68c09eec6 + health: HEALTH_OK + + services: + mon: 3 daemons, quorum a,b,c (age 3h) + mgr: b(active, since 8h), standbys: a + mds: 1/1 daemons up, 1 hot standby + osd: 3 osds: 3 up (since 3h), 3 in (since 3h) + rgw: 2 daemons active (2 hosts, 1 zones) + + data: + volumes: 1/1 healthy + pools: 12 pools, 169 pgs + objects: 4.88k objects, 16 GiB + usage: 47 GiB used, 1.1 TiB / 1.2 TiB avail + pgs: 169 active+clean + + io: + client: 639 B/s rd, 9.0 KiB/s wr, 1 op/s rd, 1 op/s wr + recovery: 834 KiB/s, 6 objects/s +``` + +It seemed like a miracle, but it is entirely credited to how resilient ceph is +built to tolerate that level of abuse. + +# why + +_Data appears to be lost_ + +- ceph OSD daemons fail to start; +- the OSDs could not reconstruct the data from chunks; +- the `osdmap` referenced a faulty erasure-coding profile; +- the monstore `osdmap` still had reference to the above erasure-coding profile; +- the erasure-coding profile was changed to a topology impossible to satisfy under the current architecture; +- 2 disks were zapped, hitting the ceiling of the failure domain for `ceph-objectstore.rgw.buckets.data_ecprofile`; + + +The monitor `osdmap` still contained the bad EC profile. + +`ceph-monstore-tool /tmp/mon-bak get osdmap > osdmap.bad` + +osdmaptool --dump json osdmap.bad | grep -i profile + +``` +"erasure_code_profiles":{ + "ceph-objectstore.rgw.buckets.data_ecprofile":{ + "crush-device-class":"", + "crush-failure-domain":"host", + "crush-root":"default", + "jerasure-per-chunk-alignment":"false", + "k":"3", + "m":"2" + } +} +``` + +After rebuilding the monstore... + +`ceph-monstore-tool /tmp/mon-a get osdmap > osdmap.good` + +``` +"erasure_code_profiles":{ + "ceph-objectstore.rgw.buckets.data_ecprofile":{ + "crush-device-class":"", + "crush-failure-domain":"host", + "crush-root":"default", + "jerasure-per-chunk-alignment":"false", + "k":"3", + "m":"1" + } +} +``` + +Therefore, it seems as if I could have attempted to rebuild the monstore first, +possibly circumventing the `ECAssert` errors. The placement groups on `osd-0` were +still mapping to 3 OSDs, not 5. + +``` +[root@ad9e4c6e7343 rook]# osdmaptool --test-map-pgs-dump --pool 12 osdmap +osdmaptool: osdmap file 'osdmap' +pool 12 pg_num 32 +12.0 [2147483647,2147483647,2147483647] -1 +12.1 [2147483647,2147483647,2147483647] -1 +12.2 [2147483647,2147483647,2147483647] -1 +12.3 [2147483647,2147483647,2147483647] -1 +12.4 [2147483647,2147483647,2147483647] -1 +12.5 [2147483647,2147483647,2147483647] -1 +12.6 [2147483647,2147483647,2147483647] -1 +12.7 [2147483647,2147483647,2147483647] -1 +12.8 [2147483647,2147483647,2147483647] -1 +12.9 [2147483647,2147483647,2147483647] -1 +12.a [2147483647,2147483647,2147483647] -1 +12.b [2147483647,2147483647,2147483647] -1 +12.c [2147483647,2147483647,2147483647] -1 +12.d [2147483647,2147483647,2147483647] -1 +12.e [2147483647,2147483647,2147483647] -1 +12.f [2147483647,2147483647,2147483647] -1 +12.10 [2147483647,2147483647,2147483647] -1 +12.11 [2147483647,2147483647,2147483647] -1 +12.12 [2147483647,2147483647,2147483647] -1 +12.13 [2147483647,2147483647,2147483647] -1 +12.14 [2147483647,2147483647,2147483647] -1 +12.15 [2147483647,2147483647,2147483647] -1 +12.16 [2147483647,2147483647,2147483647] -1 +12.17 [2147483647,2147483647,2147483647] -1 +12.18 [2147483647,2147483647,2147483647] -1 +12.19 [2147483647,2147483647,2147483647] -1 +12.1a [2147483647,2147483647,2147483647] -1 +12.1b [2147483647,2147483647,2147483647] -1 +12.1c [2147483647,2147483647,2147483647] -1 +12.1d [2147483647,2147483647,2147483647] -1 +12.1e [2147483647,2147483647,2147483647] -1 +12.1f [2147483647,2147483647,2147483647] -1 +#osd count first primary c wt wt +osd.0 0 0 0 0.488297 1 +osd.1 0 0 0 0.488297 1 +osd.2 0 0 0 0.195297 1 + in 3 + avg 0 stddev 0 (-nanx) (expected 0 -nanx)) +size 3 32 +``` + +Since the cluster did not have enough OSDs (wanted 5 with `k=3,m=2`), the rule +can be tested against the old crush map, with `--num-rep` representing the +required OSDs, i.e. `k+m`: + +With the original erasure-coding profile (`k+m=3`), everything looks good -- no +bad mappings. + +``` +[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 3 --show-bad-mappings + +// healthy +``` + +With `k+m=5`, though -- or anything great than `3` OSDs... + +``` +[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 5 --show-bad-mappings +... +bad mapping rule 20 x 1002 num_rep 5 result [0,2147483647,1,2,2147483647] +bad mapping rule 20 x 1003 num_rep 5 result [0,2147483647,2,1,2147483647] +bad mapping rule 20 x 1004 num_rep 5 result [1,0,2147483647,2,2147483647] +bad mapping rule 20 x 1005 num_rep 5 result [0,1,2147483647,2,2147483647] +bad mapping rule 20 x 1006 num_rep 5 result [0,1,2147483647,2,2147483647] +bad mapping rule 20 x 1007 num_rep 5 result [0,1,2147483647,2147483647,2] +bad mapping rule 20 x 1008 num_rep 5 result [1,2,0,2147483647,2147483647] +bad mapping rule 20 x 1009 num_rep 5 result [2,1,0,2147483647,2147483647] +bad mapping rule 20 x 1010 num_rep 5 result [0,1,2,2147483647,2147483647] +bad mapping rule 20 x 1011 num_rep 5 result [0,2147483647,2,1,2147483647] +bad mapping rule 20 x 1012 num_rep 5 result [0,1,2147483647,2,2147483647] +bad mapping rule 20 x 1013 num_rep 5 result [0,2,2147483647,1,2147483647] +bad mapping rule 20 x 1014 num_rep 5 result [1,0,2147483647,2,2147483647] +bad mapping rule 20 x 1015 num_rep 5 result [2,0,2147483647,1,2147483647] +bad mapping rule 20 x 1016 num_rep 5 result [1,0,2,2147483647,2147483647] +bad mapping rule 20 x 1017 num_rep 5 result [2,0,1,2147483647,2147483647] +bad mapping rule 20 x 1018 num_rep 5 result [1,0,2147483647,2147483647,2] +bad mapping rule 20 x 1019 num_rep 5 result [0,1,2,2147483647,2147483647] +bad mapping rule 20 x 1020 num_rep 5 result [0,2147483647,1,2147483647,2] +bad mapping rule 20 x 1021 num_rep 5 result [2,1,0,2147483647,2147483647] +bad mapping rule 20 x 1022 num_rep 5 result [1,0,2,2147483647,2147483647] +bad mapping rule 20 x 1023 num_rep 5 result [0,2,1,2147483647,2147483647] +``` + +Mappings were found on _3_ OSDs, but missing the 4th and 5th reference as +indicated by the largest 32-bit int (i.e. missing). The object storage data +would still have been lost, but it could have made the recovery of the cluster +significantly less painful.