Add 2024-08-10-recovering-ceph-cluster.md

2024-08-10 14:24:27 +00:00
parent c5389fbda8
commit df00d13f84
1 changed files with 728 additions and 0 deletions
--- a/2024-08-10-recovering-ceph-cluster.md
+++ b/2024-08-10-recovering-ceph-cluster.md
@@ -0,0 +1,728 @@
+---
+title: "recovering a rook-ceph cluster when all hope seems to be lost"
+author: "0x3bb"
+date: M08-10-2024
+---
+
+I share a _rook-ceph_ cluster between a friend of mine approximately 2000km
+away. I had reservations whether this would be a good idea at first because of
+the latency and the fact consumer grade WAN might bring anything
+that breathes to its knees. However, I'm pleased with how robust it has been
+under those circumstances.
+
+Getting started with _rook-ceph_ is really simple because it orchestrates
+everything for you. That's convenient, although, if and when you have problems,
+you can suddenly become an operator of a distributed system that you may know
+very little about due to the abstraction. 
+
+During this incident, I found myself in exactly that position, relying heavily on the 
+(great) documentation for both _rook_ and _ceph_ itself.
+
+# the beginning
+
+In the process of moving nodes, **I accidentally zapped 2 OSDs**. 
+
+Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
+were down to the final OSD. 
+
+This sounds bad, but it was fine: mons were still up, just a matter of removing
+the old OSDs from the tree and letting replication work its magic.
+
+# the real mistake
+
+I noticed although the block pool was replicated, we had lost all our RADOS
+object storage.
+
+
+![](./images/ec-2-1.png)
+
+The erasure-coding profile was `k=2, m=1`. That meant we could only lose 2
+OSDs, which had already happened. 
+
+Object storage (which our applications interfaced with via Ceph S3 Gateway) was
+lost. Nothing critical -- we just needed our CSI volume mounts back online for
+databases -- packages and other artifacts could easily be restored from
+backups.
+
+
+Moving on from this, the first thing I did was _"fix"_ the EC configuration to
+`k=3, m=2`. This would spread the data over 5 OSDs.
+
+```
+-      dataChunks: 2
+-      codingChunks: 1
+      dataChunks: 3
+      codingChunks: 2
+```
+
+Happy with that, I restored a backup of the objects. Everything was working. 
+
+A few days later, _renovatebot_ arrived with a new PR to bump _rook-ceph_ `1.14.9`.
+
+Of course I want that -- the number is bigger than the previous number so
+everything is going to be better than before.
+
+# downtime
+
+Following the merge, all services on the cluster went down at once. I checked
+the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
+of gibberish and decided to check out the GitHub issues. Since nothing is ever
+my fault, I decided to see who or what was to blame. 
+
+With no clues, I had still hoped this would be relatively simple to fix.
+
+Resigning to actually reading the OSD logs in the `rook-ceph-crashcollector`
+pods, I saw (but did not understand) the problem:
+
+`osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)`
+
+![](./images/ec-3-2.png)
+
+With the _"fixed"_ configuration, what I had actually done is split the object
+store pool across _5_ OSDs. We had _3_. 
+
+Due to the `rook-ceph-operator` being rescheduled from the version bump, the
+OSD daemons had been reloaded as part of the update procedure and now demanded
+an answer for the data and coding chunks that simply did not exist. Sure enough,
+`ceph -s` also reported undersized placement groups. 
+
+This makes sense as there weren't enough OSDs to split the data.
+
+# from bad to suicide
+
+## reverting erasure-coding profile
+
+The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
+OSDs were still in the same state complaining about the erasure-coding
+profile.
+
+## causing even more damage
+
+The second attempt (and in hindsight, a very poor choice) was to zap the other
+two underlying OSD disks:
+
+[ `osd-1`, `osd-2` ].
+
+`zap.sh`
+
+```
+DISK="/dev/vdb"
+
+# Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean)
+sgdisk --zap-all $DISK
+
+# Wipe a large portion of the beginning of the disk to remove more LVM metadata that may be present
+dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
+
+# SSDs may be better cleaned with blkdiscard instead of dd
+blkdiscard $DISK
+
+# Inform the OS of partition table changes
+partprobe $DISK
+```
+
+Perhaps having two other OSDs online would allow me to replicate the healthy
+pgs without the offending RADOS objects. 
+
+Sure enough, the 2 new OSDs started.
+
+Since the `osd-0` with the actual data still wouldn't start, the cluster was
+still in a broken state.
+
+Now down to the last OSD, at this point I knew that I was going to make many,
+many more mistakes. If I was going to continue I needed to backup the logical
+volume used by the `osd-0` node before continuing, which I did.
+
+## clutching at mons
+
+I switched my focus to a new narrative: _something was wrong with the mons_. 
+
+They were in quorum but I still couldn't figure out why the now last-surviving OSD
+was having issues starting. 
+
+The `mon_host` configuration was correct in `secret/rook-ceph-config`:
+
+`mon_host: '[v2:10.50.1.10:3300,v1:10.50.1.10:6789],[v2:10.50.1.11:3300,v1:10.50.1.11:6789],[v2:10.55.1.10:3300,v1:10.55.1.10:6789]'`
+
+Nothing had changed with the underlying data on those mons. Maybe there was
+corruption in the monitor store? The monitors maintain a map of the cluster
+state: the `osdmap`, the `crushmap`, etc. 
+
+My theory was: if the cluster map did not have the correct placement groups and
+other OSD metadata then perhaps replacing it would help.
+
+I replaced the data from another mon to the one used for the failing OSD
+deployment (`store.db`) and scaled up the deployment:
+
+`2024-08-04T00:49:47.698+0000 7f12fc78f700  0 mon.s@2(probing) e30  removed
+from monmap, suicide.`
+
+With all data potentially lost and it being almost 1AM, that message was not
+very reassuring. I did manually change the monmap and inject it back in, but
+ended up back in the same position.
+
+I figured I had done enough experimenting at this point and had to look deeper
+outside of the deployment. The only meaningful change we had made was the
+erasure-coding profile. 
+
+# initial analysis
+
+First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
+  
+```
+2024-08-03T18:58:49.845+0000 7f2a7d6da700  1 osd.0 pg_epoch: 8916
+pg[12.11s2( v 8915'287 (0'0,8915'287] local-lis/les=8893/8894 n=23 ec=5063/5059
+lis/c=8893/8413 les/c/f=8894/8414/0 sis=8916 pruub=9.256847382s)
+[1,2,NONE]p1(0) r=-1 lpr=8916 pi=[8413,8916)/1 crt=8915'287 mlcod 0'0 unknown
+NOTIFY pruub 20723.630859375s@ mbc={}] state<Start>: transitioning to Stray
+2024-08-03T18:58:49.849+0000 7f2a7ced9700 -1G
+/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
+In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
+7f2a7e6dc700 time 2024-08-03T18:58:49.853351+0000
+/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
+34: FAILED ceph_assert(stripe_width % stripe_size == 0)
+```
+
+```
+-2> 2024-08-03T18:59:00.086+0000 7ffa48b48640  5 osd.2 pg_epoch: 8894 pg[12.9(unlocked)] enter Initial
+...
+/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)
+
+```
+
+I noticed a pattern: all the failing pg IDs were prefixed with `12`.
+
+Seeing this, I had concluded:
+  - mons were ok and in quorum;
+  - the `osd-0` daemon fails to start;
+  - other fresh OSDs (`osd-1` and `osd-2`) start fine, this was a data
+    integrity issue confined to `osd-0` (and the previous OSDs had I not nuked
+    them);
+  - the cause was a change in the erasure-coding profile, which happened on
+    only one pool where the chunk distribution was modified;
+
+Accepting the loss of the miniscule data on the object storage pool in favor of
+saving the block storage, I could correct the misconfiguration.
+  
+# preparation 
+
+To avoid troubleshoting issues caused from my failed attempts, I decided I
+would do a clear out of the existing CRDs and just focus first on getting the
+OSD with the data back online. If I ever got the data back, then I'd probably
+be conscious of prior misconfiguration and have to do so regardless.
+
+- backup the important shit;
+- clear out the `rook-ceph` namespace;
+
+## backups
+
+- the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
+- `/var/lib/rook` on all nodes, containing mon data;
+
+## removal 
+
+### deployments/daemonsets
+
+These were the first to go, as I didn't want the `rook-operator` persistently
+creating Kubernetes objects when I was actively trying to kill them.
+
+### crds 
+
+Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
+
+- `cephblockpoolradosnamespaces`
+- `cephblockpools`
+- `cephbucketnotifications`
+- `cephclients`
+- `cephclusters`
+- `cephcosidrivers`
+- `cephfilesystemmirrors`
+- `cephfilesystems`
+- `cephfilesystemsubvolumegroups`
+- `cephnfses`
+- `cephobjectrealms`
+- `cephobjectstores`
+- `cephobjectstoreusers`
+- `cephobjectzonegroups`
+- `cephobjectzones`
+- `cephrbdmirrors`
+
+### /var/lib/rook 
+
+I had these backed up for later, but I didn't want them there when the cluster came online.
+
+### osd disks 
+
+I did not wipe any devices. 
+
+First, I obviously didn't want to wipe the disk with the data on it. As for the
+other, now useless OSDs that I had mistakenly created over the old ones; I knew
+spawning the `rook-operator` would create new OSDs if they didn't belong to an
+old ceph cluster. 
+
+This would make troubleshooting `osd-0` more difficult, as I'd now have to consider
+analysing the status reported from `osd-1` and `osd-2`. 
+
+
+## provisioning
+
+Since at this point I only cared about `osd-0` and it was beneficial to have
+fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
+within the helm `values.yaml`.
+
+Following this, I simply reconciled the chart.
+
+I noticed the `rook-ceph-operator`, `rook-ceph-mon-a`, `rook-ceph-mgr-a` came online as expected.
+
+Because the OSDs were part of an old cluster, I now had a ceph-cluster with no
+OSDs, as shown in the `rook-ceph-osd-prepare-*` jobs for each node.
+
+```
+osd.0: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
+osd.1: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
+osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
+```
+
+# surgery
+
+With less noise and a clean slate, it was time to attempt to fix this mess.
+
+- adopt `osd-0` to the new cluster;
+- remove the corrupted pgs from `osd-0`;
+- bring up two new OSDs for replication;
+
+## osd-0
+
+I started trying to determine how I would _safely_ remove the offending
+objects. If that happened, then the OSD would have no issues with the
+erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
+start.
+
+  - If the placement groups contained only objects created from the
+    _RADOS Object Gateway_, then I can simply remove the pgs.
+
+  - If, however, the pgs contain both the former _and_ block device objects
+    then it would require careful removal of all non-rdb (block storage)
+    objects as there would be valuable data loss by purging the entire
+    placement groups.
+
+Since OSD pools have a `1:N` relationship with pgs, the second scenario seemed
+unlikely, perhaps impossible.
+
+Next, I needed to inspect the OSD somehow, because the existing deployment would continously crash.
+
+`kubectl rook-ceph debug start rook-ceph-osd-0`
+
+Running this command allowed me to observe the OSD without it actually joining
+the cluster. The "real" OSD deployment need only be scheduled, but crashing
+continously was ok. 
+
+Once you execute that command, it will scale the OSD daemon down and create a
+new deployment that mirrors the configuration but _without_ the daemon running
+in order to perform maintenance.
+
+
+Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
+
+
+```
+[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
+PG      OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE  SINCE  VERS
+12.0    0       0        0         0        0       0          0          0    unknown 8h
+12.1    0       0        0         0        0       0          0          0    unknown 8h
+12.2    0       0        0         0        0       0          0          0    unknown 8h
+12.3    0       0        0         0        0       0          0          0    unknown 8h
+12.4    0       0        0         0        0       0          0          0    unknown 8h
+12.5    0       0        0         0        0       0          0          0    unknown 8h
+12.6    0       0        0         0        0       0          0          0    unknown 8h
+12.7    0       0        0         0        0       0          0          0    unknown 8h
+12.8    0       0        0         0        0       0          0          0    unknown 8h
+12.9    0       0        0         0        0       0          0          0    unknown 8h
+12.a    0       0        0         0        0       0          0          0    unknown 8h
+12.b    0       0        0         0        0       0          0          0    unknown 8h
+12.c    0       0        0         0        0       0          0          0    unknown 8h
+12.d    0       0        0         0        0       0          0          0    unknown 8h
+12.e    0       0        0         0        0       0          0          0    unknown 8h
+12.f    0       0        0         0        0       0          0          0    unknown 8h
+12.10   0       0        0         0        0       0          0          0    unknown 8h
+12.11   0       0        0         0        0       0          0          0    unknown 8h
+12.12   0       0        0         0        0       0          0          0    unknown 8h
+12.13   0       0        0         0        0       0          0          0    unknown 8h
+12.14   0       0        0         0        0       0          0          0    unknown 8h
+12.15   0       0        0         0        0       0          0          0    unknown 8h
+12.16   0       0        0         0        0       0          0          0    unknown 8h
+12.17   0       0        0         0        0       0          0          0    unknown 8h
+12.18   0       0        0         0        0       0          0          0    unknown 8h
+12.19   0       0        0         0        0       0          0          0    unknown 8h
+12.1a   0       0        0         0        0       0          0          0    unknown 8h
+12.1b   0       0        0         0        0       0          0          0    unknown 8h
+```
+
+
+Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
+
+```
+[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-pgs | grep ^12
+12.bs1
+12.6s1
+12.1fs0
+12.1ds1
+12.15s0
+12.16s0
+12.11s0
+12.12s2
+12.0s0
+12.17s2
+12.4s1
+12.9s0
+12.19s0
+12.cs2
+12.13s0
+12.14s2
+12.3s2
+12.1as0
+12.1bs2
+12.as1
+12.1es1
+12.1cs2
+12.2s2
+12.8s1
+12.7s2
+12.ds0
+12.es0
+12.fs0
+12.18s0
+12.1s0
+12.5s1
+12.10s2
+```
+
+I still needed to be convinced I wasn't removing any valuable data. 
+I inspected a few of them to be sure.
+
+```
+[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0  --pgid 12.10s0 --op list
+"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/07/19/071984080b32e2867 f 1ac6ec2b7d2b8724bc5d75e2850b5e7 f20040ee52F55d1.2~e7rYg3S
+"hash" :1340195137, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
+
+"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/9a/82/9a82a64c3a8439c75d8e584181427b073712afd1454747bec3dcb84bcbe19ac5. 2~urbG4nd
+"hash" :4175566657, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
+
+"12.10s0", {"oid" :"7d927F08-bd9b-4d4b-bfc1-d331eb216e68.22197937.1__shadow Windows Security Internals.pdf.2~g9stQ9inkWvsTq33S9z5xNEHEgST2H4.1_1","key":"", "snapid":-
+"shard id":2,"max":0}]
+...
+
+```
+
+With this information, I now knew:
+
+  - the log exceptions matched the pgs that were impacted from the change in
+    the erasure-coding configuration;
+  - `ceph-objectstore.rgw.buckets.data` had a relationship with those pgs where
+    the configuration was changed;
+  - the objects were familiar with the objects in the buckets, e.g. books;
+
+Since I _did_ modify the erasure-coding profile this is all starting to make sense.
+
+Carefully, the next operation was to remove the offending placement groups.
+Simply removing the pool wouldn't work, as the OSD daemon not starting meant it
+would know nothing about this change, and still not have enough chunks to come alive.
+
+```
+[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.17s2
+ marking collection for removal
+setting '_remove' omap key
+finish_remove_pgs 12.17s2_head removing 12.17s2
+Remove successful
+
+[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.0s0
+ marking collection for removal
+setting '_remove' omap key
+finish_remove_pgs 12.0s0_head removing 12.0s0
+Remove successful
+``````
+
+I did this for every PG listed above. Once I scaled down the maintenance
+deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
+with (hopefully) agreeable placement groups and thankfully, it had come alive.
+
+
+```
+k get pods -n rook-ceph 
+
+rook-ceph-osd-0-6f57466c78-bj96p  2/2  Running
+```
+
+An eager run of `ceph -s` produced both relief and disappointment. The OSD was up, but the pgs were in an `unknown` state.
+
+``` 
+ceph -s 
+...
+ data:
+    pools:   12 pools, 169 pgs
+    objects: 0  objects, 0 B
+    usage:   0 B used, 0 B / 0 B avail
+    pgs:     100.000% pgs unknown
+             169 unknown
+```
+
+At this point, I had mentioned to my friend (helping by playing GuildWars 2)
+that we might be saved. It seemed promising as we at least had `osd-0` running
+again now that the troublesome pgs were removed. 
+
+He agreed, and contemplated changing his character's hair colour instead of
+saving our data.
+
+## mons 
+
+### restoring
+
+I had `/var/lib/rook` backups from each node with the old mon data. At this
+point, with the correct number of placement groups and seeing 100% of them
+remaining in an `unknown` state, it seemed the next step was to restore the
+mons.
+
+I knew from reading the `rook-ceph` docs that if you want to restore the data
+of a monitor to a new cluster, you have to inject the `monmap` into the old
+mons `store.db`.
+
+Before doing this, I scaled `deployment/rook-ceph-mon-a` down to `0` first.
+
+Then, navigating to a directory on my local machine with the backups I ran a
+container to modify the `monmap` on my local fs.
+
+`docker run -it --rm --entrypoint bash -v .:/var/lib/rook rook/ceph:v1.14.9`
+
+```
+touch /etc/ceph/ceph.conf
+cd /var/lib/rook
+ceph-mon --extract-monmap monmap --mon-data ./mon-q/data
+monmaptool --rm q
+monmaptool --rm s
+monmaptool --rm t
+```
+
+Now the old mons `q`, `s` and `t` were removed from the map, I had to add the
+new cluster mon `rook-ceph-mon-a` created following the new
+ceph-cluster.
+
+```
+monmaptool --addv a '[v2:10.50.1.10:3300,v1:10.50.1.10:6789]`
+ceph-mon --inject-monmap monmap --mon-data ./mon-q/data  
+exit
+```
+
+Shoving it back up to the node `rook-ceph-mon-a` lives on:
+
+`scp -r ./mon-q/data/* 3bb@10.50.1.10:/var/lib/rook/mon-a/data/`
+
+Rescheduling the deployment and although the mon log output isn't giving me
+suggestions of suicide, all our pgs still remain in an `unknown` state.
+
+## recovering the mon store 
+
+It turns out that you can actually recover the mon store. It's not a huge deal
+so long as your OSDs have data integrity.
+
+Scaling the useless `mon-a` down, I copied the existing `mon-a` data onto the
+`rook-ceph-osd-0` daemon container.
+
+Another `osd-0` debug container... `k rook-ceph debug start rook-ceph-osd-0`
+
+I rebuilt the mon data, using the existing RocksDB kv store.
+
+This would have worked without the backup, but I was interested to see the
+`osdmaps` trimmed due to the other 2 removed OSDs.
+
+[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-0/ --op update-mon-db --mon-store-path /tmp/mon-a/data/
+osd.0   : 3099 osdmaps trimmed, 635 osdmaps added.
+```
+
+```
+[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-authtool /tmp/mon-a/keyring -n mon. --cap mon 'allow *' --gen-key
+
+[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-monstore-tool /tmp/mon-a/data rebuild -- --keyring /tmp/mon-a/keyring
+4 rocksdb: [db/flush_job.cc:967] [default] [JOB 3] Level-0 flush table #3433: 62997231 bytes OK
+4 rocksdb: EVENT_LOG_v1 {"time_micros": 1722831731454649, "job": 3, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [2, 0, 0, 0, 0, 0, 2], "immutable_memtables": 1}
+4 rocksdb: [file/delete_scheduler.cc:74] Deleted file /tmp/mon-a/data/store.db/003433.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
+4 rocksdb: EVENT_LOG_v1 {"time_micros": 1723067397472153, "job": 4, "event": "table_file_deletion", "file_number": 3433}
+4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
+```
+
+After copying the now _rebuilt_ `mon-a` store back, and bringing everything up
+again, the cluster was finally resurrecting. 
+
+It took some time for the rebalancing and replication to occur, but hours
+later, `ceph -s` reported a healthy cluster and services resumed being entirely
+unaware of the chaos that had ensued over the previous few days:
+
+```
+cluster:
+    id:     47f25963-57c0-4b3b-9b35-bbf68c09eec6
+    health: HEALTH_OK
+ 
+  services:
+    mon: 3 daemons, quorum a,b,c (age 3h)
+    mgr: b(active, since 8h), standbys: a
+    mds: 1/1 daemons up, 1 hot standby
+    osd: 3 osds: 3 up (since 3h), 3 in (since 3h)
+    rgw: 2 daemons active (2 hosts, 1 zones)
+ 
+  data:
+    volumes: 1/1 healthy
+    pools:   12 pools, 169 pgs
+    objects: 4.88k objects, 16 GiB
+    usage:   47 GiB used, 1.1 TiB / 1.2 TiB avail
+    pgs:     169 active+clean
+ 
+  io:
+    client:   639 B/s rd, 9.0 KiB/s wr, 1 op/s rd, 1 op/s wr
+    recovery: 834 KiB/s, 6 objects/s
+```
+
+It seemed like a miracle, but it is entirely credited to how resilient ceph is
+built to tolerate that level of abuse.
+
+# why
+
+_Data appears to be lost_
+
+- ceph OSD daemons fail to start;
+- the OSDs could not reconstruct the data from chunks;
+- the `osdmap` referenced a faulty erasure-coding profile;
+- the monstore `osdmap` still had reference to the above erasure-coding profile;
+- the erasure-coding profile was changed to a topology impossible to satisfy under the current architecture;
+- 2 disks were zapped, hitting the ceiling of the failure domain for `ceph-objectstore.rgw.buckets.data_ecprofile`;
+
+
+The monitor `osdmap` still contained the bad EC profile. 
+
+`ceph-monstore-tool /tmp/mon-bak get osdmap > osdmap.bad`
+
+osdmaptool --dump json osdmap.bad | grep -i profile
+
+```
+"erasure_code_profiles":{
+   "ceph-objectstore.rgw.buckets.data_ecprofile":{
+      "crush-device-class":"",
+      "crush-failure-domain":"host",
+      "crush-root":"default",
+      "jerasure-per-chunk-alignment":"false",
+      "k":"3",
+      "m":"2"
+   }
+}
+```
+
+After rebuilding the monstore...
+
+`ceph-monstore-tool /tmp/mon-a get osdmap > osdmap.good`
+
+```
+"erasure_code_profiles":{
+   "ceph-objectstore.rgw.buckets.data_ecprofile":{
+      "crush-device-class":"",
+      "crush-failure-domain":"host",
+      "crush-root":"default",
+      "jerasure-per-chunk-alignment":"false",
+      "k":"3",
+      "m":"1"
+   }
+}
+```
+
+Therefore, it seems as if I could have attempted to rebuild the monstore first,
+possibly circumventing the `ECAssert` errors. The placement groups on `osd-0` were
+still mapping to 3 OSDs, not 5.
+
+```
+[root@ad9e4c6e7343 rook]# osdmaptool --test-map-pgs-dump --pool 12 osdmap 
+osdmaptool: osdmap file 'osdmap'
+pool 12 pg_num 32
+12.0	[2147483647,2147483647,2147483647]	-1
+12.1	[2147483647,2147483647,2147483647]	-1
+12.2	[2147483647,2147483647,2147483647]	-1
+12.3	[2147483647,2147483647,2147483647]	-1
+12.4	[2147483647,2147483647,2147483647]	-1
+12.5	[2147483647,2147483647,2147483647]	-1
+12.6	[2147483647,2147483647,2147483647]	-1
+12.7	[2147483647,2147483647,2147483647]	-1
+12.8	[2147483647,2147483647,2147483647]	-1
+12.9	[2147483647,2147483647,2147483647]	-1
+12.a	[2147483647,2147483647,2147483647]	-1
+12.b	[2147483647,2147483647,2147483647]	-1
+12.c	[2147483647,2147483647,2147483647]	-1
+12.d	[2147483647,2147483647,2147483647]	-1
+12.e	[2147483647,2147483647,2147483647]	-1
+12.f	[2147483647,2147483647,2147483647]	-1
+12.10	[2147483647,2147483647,2147483647]	-1
+12.11	[2147483647,2147483647,2147483647]	-1
+12.12	[2147483647,2147483647,2147483647]	-1
+12.13	[2147483647,2147483647,2147483647]	-1
+12.14	[2147483647,2147483647,2147483647]	-1
+12.15	[2147483647,2147483647,2147483647]	-1
+12.16	[2147483647,2147483647,2147483647]	-1
+12.17	[2147483647,2147483647,2147483647]	-1
+12.18	[2147483647,2147483647,2147483647]	-1
+12.19	[2147483647,2147483647,2147483647]	-1
+12.1a	[2147483647,2147483647,2147483647]	-1
+12.1b	[2147483647,2147483647,2147483647]	-1
+12.1c	[2147483647,2147483647,2147483647]	-1
+12.1d	[2147483647,2147483647,2147483647]	-1
+12.1e	[2147483647,2147483647,2147483647]	-1
+12.1f	[2147483647,2147483647,2147483647]	-1
+#osd	count	first	primary	c wt	wt
+osd.0	0	0	0	0.488297	1
+osd.1	0	0	0	0.488297	1
+osd.2	0	0	0	0.195297	1
+ in 3
+ avg 0 stddev 0 (-nanx) (expected 0 -nanx))
+size 3	32
+```
+
+Since the cluster did not have enough OSDs (wanted 5 with `k=3,m=2`), the rule
+can be tested against the old crush map, with `--num-rep` representing the
+required OSDs, i.e. `k+m`:
+
+With the original erasure-coding profile (`k+m=3`), everything looks good -- no
+bad mappings.
+
+```
+[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 3 --show-bad-mappings
+
+// healthy
+```
+
+With `k+m=5`, though -- or anything great than `3` OSDs...
+
+```
+[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 5 --show-bad-mappings
+...
+bad mapping rule 20 x 1002 num_rep 5 result [0,2147483647,1,2,2147483647]
+bad mapping rule 20 x 1003 num_rep 5 result [0,2147483647,2,1,2147483647]
+bad mapping rule 20 x 1004 num_rep 5 result [1,0,2147483647,2,2147483647]
+bad mapping rule 20 x 1005 num_rep 5 result [0,1,2147483647,2,2147483647]
+bad mapping rule 20 x 1006 num_rep 5 result [0,1,2147483647,2,2147483647]
+bad mapping rule 20 x 1007 num_rep 5 result [0,1,2147483647,2147483647,2]
+bad mapping rule 20 x 1008 num_rep 5 result [1,2,0,2147483647,2147483647]
+bad mapping rule 20 x 1009 num_rep 5 result [2,1,0,2147483647,2147483647]
+bad mapping rule 20 x 1010 num_rep 5 result [0,1,2,2147483647,2147483647]
+bad mapping rule 20 x 1011 num_rep 5 result [0,2147483647,2,1,2147483647]
+bad mapping rule 20 x 1012 num_rep 5 result [0,1,2147483647,2,2147483647]
+bad mapping rule 20 x 1013 num_rep 5 result [0,2,2147483647,1,2147483647]
+bad mapping rule 20 x 1014 num_rep 5 result [1,0,2147483647,2,2147483647]
+bad mapping rule 20 x 1015 num_rep 5 result [2,0,2147483647,1,2147483647]
+bad mapping rule 20 x 1016 num_rep 5 result [1,0,2,2147483647,2147483647]
+bad mapping rule 20 x 1017 num_rep 5 result [2,0,1,2147483647,2147483647]
+bad mapping rule 20 x 1018 num_rep 5 result [1,0,2147483647,2147483647,2]
+bad mapping rule 20 x 1019 num_rep 5 result [0,1,2,2147483647,2147483647]
+bad mapping rule 20 x 1020 num_rep 5 result [0,2147483647,1,2147483647,2]
+bad mapping rule 20 x 1021 num_rep 5 result [2,1,0,2147483647,2147483647]
+bad mapping rule 20 x 1022 num_rep 5 result [1,0,2,2147483647,2147483647]
+bad mapping rule 20 x 1023 num_rep 5 result [0,2,1,2147483647,2147483647]
+```
+
+Mappings were found on _3_ OSDs, but missing the 4th and 5th reference as
+indicated by the largest 32-bit int (i.e. missing). The object storage data
+would still have been lost, but it could have made the recovery of the cluster
+significantly less painful.