Update 2024-08-10-recovering-ceph-cluster.md

2024-08-10 14:39:00 +00:00
parent 9ee96d37a5
commit 1e55cdd1b0
1 changed files with 2 additions and 26 deletions
--- a/2024-08-10-recovering-ceph-cluster.md
+++ b/2024-08-10-recovering-ceph-cluster.md
@@ -1,5 +1,5 @@
 ---
-title: "recovering a rook-ceph cluster when all hope seems to be lost"
+title: recovering a rook-ceph cluster
 author: "0x3bb"
 date: M08-10-2024
 ---
@@ -19,7 +19,6 @@ During this incident, I found myself in exactly that position, relying heavily o
 (great) documentation for both _rook_ and _ceph_ itself.

 # the beginning
-
 In the process of moving nodes, **I accidentally zapped 2 OSDs**. 

 Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
@@ -29,7 +28,6 @@ This sounds bad, but it was fine: mons were still up, just a matter of removing
 the old OSDs from the tree and letting replication work its magic.

 # the real mistake
-
 I noticed although the block pool was replicated, we had lost all our RADOS
 object storage.

@@ -43,7 +41,6 @@ lost. Nothing critical -- we just needed our CSI volume mounts back online for
 databases -- packages and other artifacts could easily be restored from
 backups.

-
 Moving on from this, the first thing I did was _"fix"_ the EC configuration to
 `k=3, m=2`. This would spread the data over 5 OSDs.

@@ -62,7 +59,6 @@ Of course I want that -- the number is bigger than the previous number so
 everything is going to be better than before.

 # downtime
-
 Following the merge, all services on the cluster went down at once. I checked
 the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
 of gibberish and decided to check out the GitHub issues. Since nothing is ever
@@ -90,13 +86,11 @@ This makes sense as there weren't enough OSDs to split the data.
 # from bad to suicide

 ## reverting erasure-coding profile
-
 The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
 OSDs were still in the same state complaining about the erasure-coding
 profile.

 ## causing even more damage
-
 The second attempt (and in hindsight, a very poor choice) was to zap the other
 two underlying OSD disks:

@@ -133,7 +127,6 @@ many more mistakes. If I was going to continue I needed to backup the logical
 volume used by the `osd-0` node before continuing, which I did.

 ## clutching at mons
-
 I switched my focus to a new narrative: _something was wrong with the mons_. 

 They were in quorum but I still couldn't figure out why the now last-surviving OSD
@@ -165,7 +158,6 @@ outside of the deployment. The only meaningful change we had made was the
 erasure-coding profile. 

 # initial analysis
-
 First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
  
 ```
@@ -204,7 +196,6 @@ Accepting the loss of the miniscule data on the object storage pool in favor of
 saving the block storage, I could correct the misconfiguration.
  
 # preparation 
-
 To avoid troubleshoting issues caused from my failed attempts, I decided I
 would do a clear out of the existing CRDs and just focus first on getting the
 OSD with the data back online. If I ever got the data back, then I'd probably
@@ -214,19 +205,16 @@ be conscious of prior misconfiguration and have to do so regardless.
 - clear out the `rook-ceph` namespace;

 ## backups
-
 - the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
 - `/var/lib/rook` on all nodes, containing mon data;

 ## removal 

 ### deployments/daemonsets
-
 These were the first to go, as I didn't want the `rook-operator` persistently
 creating Kubernetes objects when I was actively trying to kill them.

 ### crds 
-
 Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:

 - `cephblockpoolradosnamespaces`
@@ -247,11 +235,9 @@ Removal of the all `rook-ceph` resources, and their finalizers to protect them f
 - `cephrbdmirrors`

 ### /var/lib/rook 
-
 I had these backed up for later, but I didn't want them there when the cluster came online.

 ### osd disks 
-
 I did not wipe any devices. 

 First, I obviously didn't want to wipe the disk with the data on it. As for the
@@ -264,7 +250,6 @@ analysing the status reported from `osd-1` and `osd-2`.


 ## provisioning
-
 Since at this point I only cared about `osd-0` and it was beneficial to have
 fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
 within the helm `values.yaml`.
@@ -283,7 +268,6 @@ osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph clus
 ```

 # surgery
-
 With less noise and a clean slate, it was time to attempt to fix this mess.

 - adopt `osd-0` to the new cluster;
@@ -291,7 +275,6 @@ With less noise and a clean slate, it was time to attempt to fix this mess.
 - bring up two new OSDs for replication;

 ## osd-0
-
 I started trying to determine how I would _safely_ remove the offending
 objects. If that happened, then the OSD would have no issues with the
 erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
@@ -320,10 +303,8 @@ Once you execute that command, it will scale the OSD daemon down and create a
 new deployment that mirrors the configuration but _without_ the daemon running
 in order to perform maintenance.

-
 Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.

-
 ```
 [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
 PG      OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE  SINCE  VERS
@@ -357,7 +338,6 @@ PG      OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_
 12.1b   0       0        0         0        0       0          0          0    unknown 8h
 ```

-
 Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.

 ```
@@ -439,13 +419,12 @@ Remove successful
 setting '_remove' omap key
 finish_remove_pgs 12.0s0_head removing 12.0s0
 Remove successful
-``````
+```

 I did this for every PG listed above. Once I scaled down the maintenance
 deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
 with (hopefully) agreeable placement groups and thankfully, it had come alive.

-
 ```
 k get pods -n rook-ceph 

@@ -475,7 +454,6 @@ saving our data.
 ## mons 

 ### restoring
-
 I had `/var/lib/rook` backups from each node with the old mon data. At this
 point, with the correct number of placement groups and seeing 100% of them
 remaining in an `unknown` state, it seemed the next step was to restore the
@@ -519,7 +497,6 @@ Rescheduling the deployment and although the mon log output isn't giving me
 suggestions of suicide, all our pgs still remain in an `unknown` state.

 ## recovering the mon store 
-
 It turns out that you can actually recover the mon store. It's not a huge deal
 so long as your OSDs have data integrity.

@@ -584,7 +561,6 @@ It seemed like a miracle, but it is entirely credited to how resilient ceph is
 built to tolerate that level of abuse.

 # why
-
 _Data appears to be lost_

 - ceph OSD daemons fail to start;