Update 2024-08-10-recovering-ceph-cluster.md

This commit is contained in:
0x3bb 2024-08-10 14:39:00 +00:00
parent 9ee96d37a5
commit 1e55cdd1b0

View File

@ -1,5 +1,5 @@
---
title: "recovering a rook-ceph cluster when all hope seems to be lost"
title: recovering a rook-ceph cluster
author: "0x3bb"
date: M08-10-2024
---
@ -19,7 +19,6 @@ During this incident, I found myself in exactly that position, relying heavily o
(great) documentation for both _rook_ and _ceph_ itself.
# the beginning
In the process of moving nodes, **I accidentally zapped 2 OSDs**.
Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
@ -29,7 +28,6 @@ This sounds bad, but it was fine: mons were still up, just a matter of removing
the old OSDs from the tree and letting replication work its magic.
# the real mistake
I noticed although the block pool was replicated, we had lost all our RADOS
object storage.
@ -43,7 +41,6 @@ lost. Nothing critical -- we just needed our CSI volume mounts back online for
databases -- packages and other artifacts could easily be restored from
backups.
Moving on from this, the first thing I did was _"fix"_ the EC configuration to
`k=3, m=2`. This would spread the data over 5 OSDs.
@ -62,7 +59,6 @@ Of course I want that -- the number is bigger than the previous number so
everything is going to be better than before.
# downtime
Following the merge, all services on the cluster went down at once. I checked
the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
of gibberish and decided to check out the GitHub issues. Since nothing is ever
@ -90,13 +86,11 @@ This makes sense as there weren't enough OSDs to split the data.
# from bad to suicide
## reverting erasure-coding profile
The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
OSDs were still in the same state complaining about the erasure-coding
profile.
## causing even more damage
The second attempt (and in hindsight, a very poor choice) was to zap the other
two underlying OSD disks:
@ -133,7 +127,6 @@ many more mistakes. If I was going to continue I needed to backup the logical
volume used by the `osd-0` node before continuing, which I did.
## clutching at mons
I switched my focus to a new narrative: _something was wrong with the mons_.
They were in quorum but I still couldn't figure out why the now last-surviving OSD
@ -165,7 +158,6 @@ outside of the deployment. The only meaningful change we had made was the
erasure-coding profile.
# initial analysis
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
```
@ -204,7 +196,6 @@ Accepting the loss of the miniscule data on the object storage pool in favor of
saving the block storage, I could correct the misconfiguration.
# preparation
To avoid troubleshoting issues caused from my failed attempts, I decided I
would do a clear out of the existing CRDs and just focus first on getting the
OSD with the data back online. If I ever got the data back, then I'd probably
@ -214,19 +205,16 @@ be conscious of prior misconfiguration and have to do so regardless.
- clear out the `rook-ceph` namespace;
## backups
- the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
- `/var/lib/rook` on all nodes, containing mon data;
## removal
### deployments/daemonsets
These were the first to go, as I didn't want the `rook-operator` persistently
creating Kubernetes objects when I was actively trying to kill them.
### crds
Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
- `cephblockpoolradosnamespaces`
@ -247,11 +235,9 @@ Removal of the all `rook-ceph` resources, and their finalizers to protect them f
- `cephrbdmirrors`
### /var/lib/rook
I had these backed up for later, but I didn't want them there when the cluster came online.
### osd disks
I did not wipe any devices.
First, I obviously didn't want to wipe the disk with the data on it. As for the
@ -264,7 +250,6 @@ analysing the status reported from `osd-1` and `osd-2`.
## provisioning
Since at this point I only cared about `osd-0` and it was beneficial to have
fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
within the helm `values.yaml`.
@ -283,7 +268,6 @@ osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph clus
```
# surgery
With less noise and a clean slate, it was time to attempt to fix this mess.
- adopt `osd-0` to the new cluster;
@ -291,7 +275,6 @@ With less noise and a clean slate, it was time to attempt to fix this mess.
- bring up two new OSDs for replication;
## osd-0
I started trying to determine how I would _safely_ remove the offending
objects. If that happened, then the OSD would have no issues with the
erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
@ -320,10 +303,8 @@ Once you execute that command, it will scale the OSD daemon down and create a
new deployment that mirrors the configuration but _without_ the daemon running
in order to perform maintenance.
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
```
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
@ -357,7 +338,6 @@ PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_
12.1b 0 0 0 0 0 0 0 0 unknown 8h
```
Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
```
@ -439,13 +419,12 @@ Remove successful
setting '_remove' omap key
finish_remove_pgs 12.0s0_head removing 12.0s0
Remove successful
``````
```
I did this for every PG listed above. Once I scaled down the maintenance
deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
with (hopefully) agreeable placement groups and thankfully, it had come alive.
```
k get pods -n rook-ceph
@ -475,7 +454,6 @@ saving our data.
## mons
### restoring
I had `/var/lib/rook` backups from each node with the old mon data. At this
point, with the correct number of placement groups and seeing 100% of them
remaining in an `unknown` state, it seemed the next step was to restore the
@ -519,7 +497,6 @@ Rescheduling the deployment and although the mon log output isn't giving me
suggestions of suicide, all our pgs still remain in an `unknown` state.
## recovering the mon store
It turns out that you can actually recover the mon store. It's not a huge deal
so long as your OSDs have data integrity.
@ -584,7 +561,6 @@ It seemed like a miracle, but it is entirely credited to how resilient ceph is
built to tolerate that level of abuse.
# why
_Data appears to be lost_
- ceph OSD daemons fail to start;