Update 2024-08-10-recovering-ceph-cluster.md

This commit is contained in:
0x3bb 2024-08-10 14:39:00 +00:00
parent 9ee96d37a5
commit 1e55cdd1b0

View File

@ -1,5 +1,5 @@
--- ---
title: "recovering a rook-ceph cluster when all hope seems to be lost" title: recovering a rook-ceph cluster
author: "0x3bb" author: "0x3bb"
date: M08-10-2024 date: M08-10-2024
--- ---
@ -19,7 +19,6 @@ During this incident, I found myself in exactly that position, relying heavily o
(great) documentation for both _rook_ and _ceph_ itself. (great) documentation for both _rook_ and _ceph_ itself.
# the beginning # the beginning
In the process of moving nodes, **I accidentally zapped 2 OSDs**. In the process of moving nodes, **I accidentally zapped 2 OSDs**.
Given that the _"ceph cluster"_ is basically two bedrooms peered together, we Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
@ -29,7 +28,6 @@ This sounds bad, but it was fine: mons were still up, just a matter of removing
the old OSDs from the tree and letting replication work its magic. the old OSDs from the tree and letting replication work its magic.
# the real mistake # the real mistake
I noticed although the block pool was replicated, we had lost all our RADOS I noticed although the block pool was replicated, we had lost all our RADOS
object storage. object storage.
@ -43,7 +41,6 @@ lost. Nothing critical -- we just needed our CSI volume mounts back online for
databases -- packages and other artifacts could easily be restored from databases -- packages and other artifacts could easily be restored from
backups. backups.
Moving on from this, the first thing I did was _"fix"_ the EC configuration to Moving on from this, the first thing I did was _"fix"_ the EC configuration to
`k=3, m=2`. This would spread the data over 5 OSDs. `k=3, m=2`. This would spread the data over 5 OSDs.
@ -62,7 +59,6 @@ Of course I want that -- the number is bigger than the previous number so
everything is going to be better than before. everything is going to be better than before.
# downtime # downtime
Following the merge, all services on the cluster went down at once. I checked Following the merge, all services on the cluster went down at once. I checked
the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
of gibberish and decided to check out the GitHub issues. Since nothing is ever of gibberish and decided to check out the GitHub issues. Since nothing is ever
@ -90,13 +86,11 @@ This makes sense as there weren't enough OSDs to split the data.
# from bad to suicide # from bad to suicide
## reverting erasure-coding profile ## reverting erasure-coding profile
The first attempt I made was to revert the EC profile back to `k=2, m=1`. The The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
OSDs were still in the same state complaining about the erasure-coding OSDs were still in the same state complaining about the erasure-coding
profile. profile.
## causing even more damage ## causing even more damage
The second attempt (and in hindsight, a very poor choice) was to zap the other The second attempt (and in hindsight, a very poor choice) was to zap the other
two underlying OSD disks: two underlying OSD disks:
@ -133,7 +127,6 @@ many more mistakes. If I was going to continue I needed to backup the logical
volume used by the `osd-0` node before continuing, which I did. volume used by the `osd-0` node before continuing, which I did.
## clutching at mons ## clutching at mons
I switched my focus to a new narrative: _something was wrong with the mons_. I switched my focus to a new narrative: _something was wrong with the mons_.
They were in quorum but I still couldn't figure out why the now last-surviving OSD They were in quorum but I still couldn't figure out why the now last-surviving OSD
@ -165,7 +158,6 @@ outside of the deployment. The only meaningful change we had made was the
erasure-coding profile. erasure-coding profile.
# initial analysis # initial analysis
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors: First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
``` ```
@ -204,7 +196,6 @@ Accepting the loss of the miniscule data on the object storage pool in favor of
saving the block storage, I could correct the misconfiguration. saving the block storage, I could correct the misconfiguration.
# preparation # preparation
To avoid troubleshoting issues caused from my failed attempts, I decided I To avoid troubleshoting issues caused from my failed attempts, I decided I
would do a clear out of the existing CRDs and just focus first on getting the would do a clear out of the existing CRDs and just focus first on getting the
OSD with the data back online. If I ever got the data back, then I'd probably OSD with the data back online. If I ever got the data back, then I'd probably
@ -214,19 +205,16 @@ be conscious of prior misconfiguration and have to do so regardless.
- clear out the `rook-ceph` namespace; - clear out the `rook-ceph` namespace;
## backups ## backups
- the logical volume for `osd-0`, so I can re-attach it and afford mistakes; - the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
- `/var/lib/rook` on all nodes, containing mon data; - `/var/lib/rook` on all nodes, containing mon data;
## removal ## removal
### deployments/daemonsets ### deployments/daemonsets
These were the first to go, as I didn't want the `rook-operator` persistently These were the first to go, as I didn't want the `rook-operator` persistently
creating Kubernetes objects when I was actively trying to kill them. creating Kubernetes objects when I was actively trying to kill them.
### crds ### crds
Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed: Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
- `cephblockpoolradosnamespaces` - `cephblockpoolradosnamespaces`
@ -247,11 +235,9 @@ Removal of the all `rook-ceph` resources, and their finalizers to protect them f
- `cephrbdmirrors` - `cephrbdmirrors`
### /var/lib/rook ### /var/lib/rook
I had these backed up for later, but I didn't want them there when the cluster came online. I had these backed up for later, but I didn't want them there when the cluster came online.
### osd disks ### osd disks
I did not wipe any devices. I did not wipe any devices.
First, I obviously didn't want to wipe the disk with the data on it. As for the First, I obviously didn't want to wipe the disk with the data on it. As for the
@ -264,7 +250,6 @@ analysing the status reported from `osd-1` and `osd-2`.
## provisioning ## provisioning
Since at this point I only cared about `osd-0` and it was beneficial to have Since at this point I only cared about `osd-0` and it was beneficial to have
fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1` fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
within the helm `values.yaml`. within the helm `values.yaml`.
@ -283,7 +268,6 @@ osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph clus
``` ```
# surgery # surgery
With less noise and a clean slate, it was time to attempt to fix this mess. With less noise and a clean slate, it was time to attempt to fix this mess.
- adopt `osd-0` to the new cluster; - adopt `osd-0` to the new cluster;
@ -291,7 +275,6 @@ With less noise and a clean slate, it was time to attempt to fix this mess.
- bring up two new OSDs for replication; - bring up two new OSDs for replication;
## osd-0 ## osd-0
I started trying to determine how I would _safely_ remove the offending I started trying to determine how I would _safely_ remove the offending
objects. If that happened, then the OSD would have no issues with the objects. If that happened, then the OSD would have no issues with the
erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
@ -320,10 +303,8 @@ Once you execute that command, it will scale the OSD daemon down and create a
new deployment that mirrors the configuration but _without_ the daemon running new deployment that mirrors the configuration but _without_ the daemon running
in order to perform maintenance. in order to perform maintenance.
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool. Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
``` ```
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data [root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
@ -357,7 +338,6 @@ PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_
12.1b 0 0 0 0 0 0 0 0 unknown 8h 12.1b 0 0 0 0 0 0 0 0 unknown 8h
``` ```
Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD. Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
``` ```
@ -439,13 +419,12 @@ Remove successful
setting '_remove' omap key setting '_remove' omap key
finish_remove_pgs 12.0s0_head removing 12.0s0 finish_remove_pgs 12.0s0_head removing 12.0s0
Remove successful Remove successful
`````` ```
I did this for every PG listed above. Once I scaled down the maintenance I did this for every PG listed above. Once I scaled down the maintenance
deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
with (hopefully) agreeable placement groups and thankfully, it had come alive. with (hopefully) agreeable placement groups and thankfully, it had come alive.
``` ```
k get pods -n rook-ceph k get pods -n rook-ceph
@ -475,7 +454,6 @@ saving our data.
## mons ## mons
### restoring ### restoring
I had `/var/lib/rook` backups from each node with the old mon data. At this I had `/var/lib/rook` backups from each node with the old mon data. At this
point, with the correct number of placement groups and seeing 100% of them point, with the correct number of placement groups and seeing 100% of them
remaining in an `unknown` state, it seemed the next step was to restore the remaining in an `unknown` state, it seemed the next step was to restore the
@ -519,7 +497,6 @@ Rescheduling the deployment and although the mon log output isn't giving me
suggestions of suicide, all our pgs still remain in an `unknown` state. suggestions of suicide, all our pgs still remain in an `unknown` state.
## recovering the mon store ## recovering the mon store
It turns out that you can actually recover the mon store. It's not a huge deal It turns out that you can actually recover the mon store. It's not a huge deal
so long as your OSDs have data integrity. so long as your OSDs have data integrity.
@ -584,7 +561,6 @@ It seemed like a miracle, but it is entirely credited to how resilient ceph is
built to tolerate that level of abuse. built to tolerate that level of abuse.
# why # why
_Data appears to be lost_ _Data appears to be lost_
- ceph OSD daemons fail to start; - ceph OSD daemons fail to start;