Update 2024-08-10-recovering-ceph-cluster.md
This commit is contained in:
parent
9ee96d37a5
commit
1e55cdd1b0
@ -1,5 +1,5 @@
|
||||
---
|
||||
title: "recovering a rook-ceph cluster when all hope seems to be lost"
|
||||
title: recovering a rook-ceph cluster
|
||||
author: "0x3bb"
|
||||
date: M08-10-2024
|
||||
---
|
||||
@ -19,7 +19,6 @@ During this incident, I found myself in exactly that position, relying heavily o
|
||||
(great) documentation for both _rook_ and _ceph_ itself.
|
||||
|
||||
# the beginning
|
||||
|
||||
In the process of moving nodes, **I accidentally zapped 2 OSDs**.
|
||||
|
||||
Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
|
||||
@ -29,7 +28,6 @@ This sounds bad, but it was fine: mons were still up, just a matter of removing
|
||||
the old OSDs from the tree and letting replication work its magic.
|
||||
|
||||
# the real mistake
|
||||
|
||||
I noticed although the block pool was replicated, we had lost all our RADOS
|
||||
object storage.
|
||||
|
||||
@ -43,7 +41,6 @@ lost. Nothing critical -- we just needed our CSI volume mounts back online for
|
||||
databases -- packages and other artifacts could easily be restored from
|
||||
backups.
|
||||
|
||||
|
||||
Moving on from this, the first thing I did was _"fix"_ the EC configuration to
|
||||
`k=3, m=2`. This would spread the data over 5 OSDs.
|
||||
|
||||
@ -62,7 +59,6 @@ Of course I want that -- the number is bigger than the previous number so
|
||||
everything is going to be better than before.
|
||||
|
||||
# downtime
|
||||
|
||||
Following the merge, all services on the cluster went down at once. I checked
|
||||
the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
|
||||
of gibberish and decided to check out the GitHub issues. Since nothing is ever
|
||||
@ -90,13 +86,11 @@ This makes sense as there weren't enough OSDs to split the data.
|
||||
# from bad to suicide
|
||||
|
||||
## reverting erasure-coding profile
|
||||
|
||||
The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
|
||||
OSDs were still in the same state complaining about the erasure-coding
|
||||
profile.
|
||||
|
||||
## causing even more damage
|
||||
|
||||
The second attempt (and in hindsight, a very poor choice) was to zap the other
|
||||
two underlying OSD disks:
|
||||
|
||||
@ -133,7 +127,6 @@ many more mistakes. If I was going to continue I needed to backup the logical
|
||||
volume used by the `osd-0` node before continuing, which I did.
|
||||
|
||||
## clutching at mons
|
||||
|
||||
I switched my focus to a new narrative: _something was wrong with the mons_.
|
||||
|
||||
They were in quorum but I still couldn't figure out why the now last-surviving OSD
|
||||
@ -165,7 +158,6 @@ outside of the deployment. The only meaningful change we had made was the
|
||||
erasure-coding profile.
|
||||
|
||||
# initial analysis
|
||||
|
||||
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
|
||||
|
||||
```
|
||||
@ -204,7 +196,6 @@ Accepting the loss of the miniscule data on the object storage pool in favor of
|
||||
saving the block storage, I could correct the misconfiguration.
|
||||
|
||||
# preparation
|
||||
|
||||
To avoid troubleshoting issues caused from my failed attempts, I decided I
|
||||
would do a clear out of the existing CRDs and just focus first on getting the
|
||||
OSD with the data back online. If I ever got the data back, then I'd probably
|
||||
@ -214,19 +205,16 @@ be conscious of prior misconfiguration and have to do so regardless.
|
||||
- clear out the `rook-ceph` namespace;
|
||||
|
||||
## backups
|
||||
|
||||
- the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
|
||||
- `/var/lib/rook` on all nodes, containing mon data;
|
||||
|
||||
## removal
|
||||
|
||||
### deployments/daemonsets
|
||||
|
||||
These were the first to go, as I didn't want the `rook-operator` persistently
|
||||
creating Kubernetes objects when I was actively trying to kill them.
|
||||
|
||||
### crds
|
||||
|
||||
Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
|
||||
|
||||
- `cephblockpoolradosnamespaces`
|
||||
@ -247,11 +235,9 @@ Removal of the all `rook-ceph` resources, and their finalizers to protect them f
|
||||
- `cephrbdmirrors`
|
||||
|
||||
### /var/lib/rook
|
||||
|
||||
I had these backed up for later, but I didn't want them there when the cluster came online.
|
||||
|
||||
### osd disks
|
||||
|
||||
I did not wipe any devices.
|
||||
|
||||
First, I obviously didn't want to wipe the disk with the data on it. As for the
|
||||
@ -264,7 +250,6 @@ analysing the status reported from `osd-1` and `osd-2`.
|
||||
|
||||
|
||||
## provisioning
|
||||
|
||||
Since at this point I only cared about `osd-0` and it was beneficial to have
|
||||
fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
|
||||
within the helm `values.yaml`.
|
||||
@ -283,7 +268,6 @@ osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph clus
|
||||
```
|
||||
|
||||
# surgery
|
||||
|
||||
With less noise and a clean slate, it was time to attempt to fix this mess.
|
||||
|
||||
- adopt `osd-0` to the new cluster;
|
||||
@ -291,7 +275,6 @@ With less noise and a clean slate, it was time to attempt to fix this mess.
|
||||
- bring up two new OSDs for replication;
|
||||
|
||||
## osd-0
|
||||
|
||||
I started trying to determine how I would _safely_ remove the offending
|
||||
objects. If that happened, then the OSD would have no issues with the
|
||||
erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
|
||||
@ -320,10 +303,8 @@ Once you execute that command, it will scale the OSD daemon down and create a
|
||||
new deployment that mirrors the configuration but _without_ the daemon running
|
||||
in order to perform maintenance.
|
||||
|
||||
|
||||
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
|
||||
|
||||
|
||||
```
|
||||
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
|
||||
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
|
||||
@ -357,7 +338,6 @@ PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_
|
||||
12.1b 0 0 0 0 0 0 0 0 unknown 8h
|
||||
```
|
||||
|
||||
|
||||
Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
|
||||
|
||||
```
|
||||
@ -439,13 +419,12 @@ Remove successful
|
||||
setting '_remove' omap key
|
||||
finish_remove_pgs 12.0s0_head removing 12.0s0
|
||||
Remove successful
|
||||
``````
|
||||
```
|
||||
|
||||
I did this for every PG listed above. Once I scaled down the maintenance
|
||||
deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
|
||||
with (hopefully) agreeable placement groups and thankfully, it had come alive.
|
||||
|
||||
|
||||
```
|
||||
k get pods -n rook-ceph
|
||||
|
||||
@ -475,7 +454,6 @@ saving our data.
|
||||
## mons
|
||||
|
||||
### restoring
|
||||
|
||||
I had `/var/lib/rook` backups from each node with the old mon data. At this
|
||||
point, with the correct number of placement groups and seeing 100% of them
|
||||
remaining in an `unknown` state, it seemed the next step was to restore the
|
||||
@ -519,7 +497,6 @@ Rescheduling the deployment and although the mon log output isn't giving me
|
||||
suggestions of suicide, all our pgs still remain in an `unknown` state.
|
||||
|
||||
## recovering the mon store
|
||||
|
||||
It turns out that you can actually recover the mon store. It's not a huge deal
|
||||
so long as your OSDs have data integrity.
|
||||
|
||||
@ -584,7 +561,6 @@ It seemed like a miracle, but it is entirely credited to how resilient ceph is
|
||||
built to tolerate that level of abuse.
|
||||
|
||||
# why
|
||||
|
||||
_Data appears to be lost_
|
||||
|
||||
- ceph OSD daemons fail to start;
|
||||
|
Loading…
Reference in New Issue
Block a user