diff --git a/2024-08-10-recovering-ceph-cluster.md b/2024-08-10-recovering-ceph-cluster.md index d4f4f34..f604b21 100644 --- a/2024-08-10-recovering-ceph-cluster.md +++ b/2024-08-10-recovering-ceph-cluster.md @@ -1,5 +1,5 @@ --- -title: "recovering a rook-ceph cluster when all hope seems to be lost" +title: recovering a rook-ceph cluster author: "0x3bb" date: M08-10-2024 --- @@ -19,7 +19,6 @@ During this incident, I found myself in exactly that position, relying heavily o (great) documentation for both _rook_ and _ceph_ itself. # the beginning - In the process of moving nodes, **I accidentally zapped 2 OSDs**. Given that the _"ceph cluster"_ is basically two bedrooms peered together, we @@ -29,7 +28,6 @@ This sounds bad, but it was fine: mons were still up, just a matter of removing the old OSDs from the tree and letting replication work its magic. # the real mistake - I noticed although the block pool was replicated, we had lost all our RADOS object storage. @@ -43,7 +41,6 @@ lost. Nothing critical -- we just needed our CSI volume mounts back online for databases -- packages and other artifacts could easily be restored from backups. - Moving on from this, the first thing I did was _"fix"_ the EC configuration to `k=3, m=2`. This would spread the data over 5 OSDs. @@ -62,7 +59,6 @@ Of course I want that -- the number is bigger than the previous number so everything is going to be better than before. # downtime - Following the merge, all services on the cluster went down at once. I checked the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch of gibberish and decided to check out the GitHub issues. Since nothing is ever @@ -90,13 +86,11 @@ This makes sense as there weren't enough OSDs to split the data. # from bad to suicide ## reverting erasure-coding profile - The first attempt I made was to revert the EC profile back to `k=2, m=1`. The OSDs were still in the same state complaining about the erasure-coding profile. ## causing even more damage - The second attempt (and in hindsight, a very poor choice) was to zap the other two underlying OSD disks: @@ -133,7 +127,6 @@ many more mistakes. If I was going to continue I needed to backup the logical volume used by the `osd-0` node before continuing, which I did. ## clutching at mons - I switched my focus to a new narrative: _something was wrong with the mons_. They were in quorum but I still couldn't figure out why the now last-surviving OSD @@ -165,7 +158,6 @@ outside of the deployment. The only meaningful change we had made was the erasure-coding profile. # initial analysis - First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors: ``` @@ -204,7 +196,6 @@ Accepting the loss of the miniscule data on the object storage pool in favor of saving the block storage, I could correct the misconfiguration. # preparation - To avoid troubleshoting issues caused from my failed attempts, I decided I would do a clear out of the existing CRDs and just focus first on getting the OSD with the data back online. If I ever got the data back, then I'd probably @@ -214,19 +205,16 @@ be conscious of prior misconfiguration and have to do so regardless. - clear out the `rook-ceph` namespace; ## backups - - the logical volume for `osd-0`, so I can re-attach it and afford mistakes; - `/var/lib/rook` on all nodes, containing mon data; ## removal ### deployments/daemonsets - These were the first to go, as I didn't want the `rook-operator` persistently creating Kubernetes objects when I was actively trying to kill them. ### crds - Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed: - `cephblockpoolradosnamespaces` @@ -247,11 +235,9 @@ Removal of the all `rook-ceph` resources, and their finalizers to protect them f - `cephrbdmirrors` ### /var/lib/rook - I had these backed up for later, but I didn't want them there when the cluster came online. ### osd disks - I did not wipe any devices. First, I obviously didn't want to wipe the disk with the data on it. As for the @@ -264,7 +250,6 @@ analysing the status reported from `osd-1` and `osd-2`. ## provisioning - Since at this point I only cared about `osd-0` and it was beneficial to have fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1` within the helm `values.yaml`. @@ -283,7 +268,6 @@ osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph clus ``` # surgery - With less noise and a clean slate, it was time to attempt to fix this mess. - adopt `osd-0` to the new cluster; @@ -291,7 +275,6 @@ With less noise and a clean slate, it was time to attempt to fix this mess. - bring up two new OSDs for replication; ## osd-0 - I started trying to determine how I would _safely_ remove the offending objects. If that happened, then the OSD would have no issues with the erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should @@ -320,10 +303,8 @@ Once you execute that command, it will scale the OSD daemon down and create a new deployment that mirrors the configuration but _without_ the daemon running in order to perform maintenance. - Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool. - ``` [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS @@ -357,7 +338,6 @@ PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_ 12.1b 0 0 0 0 0 0 0 0 unknown 8h ``` - Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD. ``` @@ -439,13 +419,12 @@ Remove successful setting '_remove' omap key finish_remove_pgs 12.0s0_head removing 12.0s0 Remove successful -`````` +``` I did this for every PG listed above. Once I scaled down the maintenance deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon with (hopefully) agreeable placement groups and thankfully, it had come alive. - ``` k get pods -n rook-ceph @@ -475,7 +454,6 @@ saving our data. ## mons ### restoring - I had `/var/lib/rook` backups from each node with the old mon data. At this point, with the correct number of placement groups and seeing 100% of them remaining in an `unknown` state, it seemed the next step was to restore the @@ -519,7 +497,6 @@ Rescheduling the deployment and although the mon log output isn't giving me suggestions of suicide, all our pgs still remain in an `unknown` state. ## recovering the mon store - It turns out that you can actually recover the mon store. It's not a huge deal so long as your OSDs have data integrity. @@ -584,7 +561,6 @@ It seemed like a miracle, but it is entirely credited to how resilient ceph is built to tolerate that level of abuse. # why - _Data appears to be lost_ - ceph OSD daemons fail to start;