Update 2024-08-10-recovering-ceph-cluster.md
This commit is contained in:
parent
9ee96d37a5
commit
1e55cdd1b0
@ -1,5 +1,5 @@
|
|||||||
---
|
---
|
||||||
title: "recovering a rook-ceph cluster when all hope seems to be lost"
|
title: recovering a rook-ceph cluster
|
||||||
author: "0x3bb"
|
author: "0x3bb"
|
||||||
date: M08-10-2024
|
date: M08-10-2024
|
||||||
---
|
---
|
||||||
@ -19,7 +19,6 @@ During this incident, I found myself in exactly that position, relying heavily o
|
|||||||
(great) documentation for both _rook_ and _ceph_ itself.
|
(great) documentation for both _rook_ and _ceph_ itself.
|
||||||
|
|
||||||
# the beginning
|
# the beginning
|
||||||
|
|
||||||
In the process of moving nodes, **I accidentally zapped 2 OSDs**.
|
In the process of moving nodes, **I accidentally zapped 2 OSDs**.
|
||||||
|
|
||||||
Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
|
Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
|
||||||
@ -29,7 +28,6 @@ This sounds bad, but it was fine: mons were still up, just a matter of removing
|
|||||||
the old OSDs from the tree and letting replication work its magic.
|
the old OSDs from the tree and letting replication work its magic.
|
||||||
|
|
||||||
# the real mistake
|
# the real mistake
|
||||||
|
|
||||||
I noticed although the block pool was replicated, we had lost all our RADOS
|
I noticed although the block pool was replicated, we had lost all our RADOS
|
||||||
object storage.
|
object storage.
|
||||||
|
|
||||||
@ -43,7 +41,6 @@ lost. Nothing critical -- we just needed our CSI volume mounts back online for
|
|||||||
databases -- packages and other artifacts could easily be restored from
|
databases -- packages and other artifacts could easily be restored from
|
||||||
backups.
|
backups.
|
||||||
|
|
||||||
|
|
||||||
Moving on from this, the first thing I did was _"fix"_ the EC configuration to
|
Moving on from this, the first thing I did was _"fix"_ the EC configuration to
|
||||||
`k=3, m=2`. This would spread the data over 5 OSDs.
|
`k=3, m=2`. This would spread the data over 5 OSDs.
|
||||||
|
|
||||||
@ -62,7 +59,6 @@ Of course I want that -- the number is bigger than the previous number so
|
|||||||
everything is going to be better than before.
|
everything is going to be better than before.
|
||||||
|
|
||||||
# downtime
|
# downtime
|
||||||
|
|
||||||
Following the merge, all services on the cluster went down at once. I checked
|
Following the merge, all services on the cluster went down at once. I checked
|
||||||
the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
|
the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
|
||||||
of gibberish and decided to check out the GitHub issues. Since nothing is ever
|
of gibberish and decided to check out the GitHub issues. Since nothing is ever
|
||||||
@ -90,13 +86,11 @@ This makes sense as there weren't enough OSDs to split the data.
|
|||||||
# from bad to suicide
|
# from bad to suicide
|
||||||
|
|
||||||
## reverting erasure-coding profile
|
## reverting erasure-coding profile
|
||||||
|
|
||||||
The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
|
The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
|
||||||
OSDs were still in the same state complaining about the erasure-coding
|
OSDs were still in the same state complaining about the erasure-coding
|
||||||
profile.
|
profile.
|
||||||
|
|
||||||
## causing even more damage
|
## causing even more damage
|
||||||
|
|
||||||
The second attempt (and in hindsight, a very poor choice) was to zap the other
|
The second attempt (and in hindsight, a very poor choice) was to zap the other
|
||||||
two underlying OSD disks:
|
two underlying OSD disks:
|
||||||
|
|
||||||
@ -133,7 +127,6 @@ many more mistakes. If I was going to continue I needed to backup the logical
|
|||||||
volume used by the `osd-0` node before continuing, which I did.
|
volume used by the `osd-0` node before continuing, which I did.
|
||||||
|
|
||||||
## clutching at mons
|
## clutching at mons
|
||||||
|
|
||||||
I switched my focus to a new narrative: _something was wrong with the mons_.
|
I switched my focus to a new narrative: _something was wrong with the mons_.
|
||||||
|
|
||||||
They were in quorum but I still couldn't figure out why the now last-surviving OSD
|
They were in quorum but I still couldn't figure out why the now last-surviving OSD
|
||||||
@ -165,7 +158,6 @@ outside of the deployment. The only meaningful change we had made was the
|
|||||||
erasure-coding profile.
|
erasure-coding profile.
|
||||||
|
|
||||||
# initial analysis
|
# initial analysis
|
||||||
|
|
||||||
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
|
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -204,7 +196,6 @@ Accepting the loss of the miniscule data on the object storage pool in favor of
|
|||||||
saving the block storage, I could correct the misconfiguration.
|
saving the block storage, I could correct the misconfiguration.
|
||||||
|
|
||||||
# preparation
|
# preparation
|
||||||
|
|
||||||
To avoid troubleshoting issues caused from my failed attempts, I decided I
|
To avoid troubleshoting issues caused from my failed attempts, I decided I
|
||||||
would do a clear out of the existing CRDs and just focus first on getting the
|
would do a clear out of the existing CRDs and just focus first on getting the
|
||||||
OSD with the data back online. If I ever got the data back, then I'd probably
|
OSD with the data back online. If I ever got the data back, then I'd probably
|
||||||
@ -214,19 +205,16 @@ be conscious of prior misconfiguration and have to do so regardless.
|
|||||||
- clear out the `rook-ceph` namespace;
|
- clear out the `rook-ceph` namespace;
|
||||||
|
|
||||||
## backups
|
## backups
|
||||||
|
|
||||||
- the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
|
- the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
|
||||||
- `/var/lib/rook` on all nodes, containing mon data;
|
- `/var/lib/rook` on all nodes, containing mon data;
|
||||||
|
|
||||||
## removal
|
## removal
|
||||||
|
|
||||||
### deployments/daemonsets
|
### deployments/daemonsets
|
||||||
|
|
||||||
These were the first to go, as I didn't want the `rook-operator` persistently
|
These were the first to go, as I didn't want the `rook-operator` persistently
|
||||||
creating Kubernetes objects when I was actively trying to kill them.
|
creating Kubernetes objects when I was actively trying to kill them.
|
||||||
|
|
||||||
### crds
|
### crds
|
||||||
|
|
||||||
Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
|
Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
|
||||||
|
|
||||||
- `cephblockpoolradosnamespaces`
|
- `cephblockpoolradosnamespaces`
|
||||||
@ -247,11 +235,9 @@ Removal of the all `rook-ceph` resources, and their finalizers to protect them f
|
|||||||
- `cephrbdmirrors`
|
- `cephrbdmirrors`
|
||||||
|
|
||||||
### /var/lib/rook
|
### /var/lib/rook
|
||||||
|
|
||||||
I had these backed up for later, but I didn't want them there when the cluster came online.
|
I had these backed up for later, but I didn't want them there when the cluster came online.
|
||||||
|
|
||||||
### osd disks
|
### osd disks
|
||||||
|
|
||||||
I did not wipe any devices.
|
I did not wipe any devices.
|
||||||
|
|
||||||
First, I obviously didn't want to wipe the disk with the data on it. As for the
|
First, I obviously didn't want to wipe the disk with the data on it. As for the
|
||||||
@ -264,7 +250,6 @@ analysing the status reported from `osd-1` and `osd-2`.
|
|||||||
|
|
||||||
|
|
||||||
## provisioning
|
## provisioning
|
||||||
|
|
||||||
Since at this point I only cared about `osd-0` and it was beneficial to have
|
Since at this point I only cared about `osd-0` and it was beneficial to have
|
||||||
fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
|
fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
|
||||||
within the helm `values.yaml`.
|
within the helm `values.yaml`.
|
||||||
@ -283,7 +268,6 @@ osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph clus
|
|||||||
```
|
```
|
||||||
|
|
||||||
# surgery
|
# surgery
|
||||||
|
|
||||||
With less noise and a clean slate, it was time to attempt to fix this mess.
|
With less noise and a clean slate, it was time to attempt to fix this mess.
|
||||||
|
|
||||||
- adopt `osd-0` to the new cluster;
|
- adopt `osd-0` to the new cluster;
|
||||||
@ -291,7 +275,6 @@ With less noise and a clean slate, it was time to attempt to fix this mess.
|
|||||||
- bring up two new OSDs for replication;
|
- bring up two new OSDs for replication;
|
||||||
|
|
||||||
## osd-0
|
## osd-0
|
||||||
|
|
||||||
I started trying to determine how I would _safely_ remove the offending
|
I started trying to determine how I would _safely_ remove the offending
|
||||||
objects. If that happened, then the OSD would have no issues with the
|
objects. If that happened, then the OSD would have no issues with the
|
||||||
erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
|
erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
|
||||||
@ -320,10 +303,8 @@ Once you execute that command, it will scale the OSD daemon down and create a
|
|||||||
new deployment that mirrors the configuration but _without_ the daemon running
|
new deployment that mirrors the configuration but _without_ the daemon running
|
||||||
in order to perform maintenance.
|
in order to perform maintenance.
|
||||||
|
|
||||||
|
|
||||||
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
|
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
|
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
|
||||||
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
|
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
|
||||||
@ -357,7 +338,6 @@ PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_
|
|||||||
12.1b 0 0 0 0 0 0 0 0 unknown 8h
|
12.1b 0 0 0 0 0 0 0 0 unknown 8h
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
|
Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -439,13 +419,12 @@ Remove successful
|
|||||||
setting '_remove' omap key
|
setting '_remove' omap key
|
||||||
finish_remove_pgs 12.0s0_head removing 12.0s0
|
finish_remove_pgs 12.0s0_head removing 12.0s0
|
||||||
Remove successful
|
Remove successful
|
||||||
``````
|
```
|
||||||
|
|
||||||
I did this for every PG listed above. Once I scaled down the maintenance
|
I did this for every PG listed above. Once I scaled down the maintenance
|
||||||
deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
|
deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
|
||||||
with (hopefully) agreeable placement groups and thankfully, it had come alive.
|
with (hopefully) agreeable placement groups and thankfully, it had come alive.
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
k get pods -n rook-ceph
|
k get pods -n rook-ceph
|
||||||
|
|
||||||
@ -475,7 +454,6 @@ saving our data.
|
|||||||
## mons
|
## mons
|
||||||
|
|
||||||
### restoring
|
### restoring
|
||||||
|
|
||||||
I had `/var/lib/rook` backups from each node with the old mon data. At this
|
I had `/var/lib/rook` backups from each node with the old mon data. At this
|
||||||
point, with the correct number of placement groups and seeing 100% of them
|
point, with the correct number of placement groups and seeing 100% of them
|
||||||
remaining in an `unknown` state, it seemed the next step was to restore the
|
remaining in an `unknown` state, it seemed the next step was to restore the
|
||||||
@ -519,7 +497,6 @@ Rescheduling the deployment and although the mon log output isn't giving me
|
|||||||
suggestions of suicide, all our pgs still remain in an `unknown` state.
|
suggestions of suicide, all our pgs still remain in an `unknown` state.
|
||||||
|
|
||||||
## recovering the mon store
|
## recovering the mon store
|
||||||
|
|
||||||
It turns out that you can actually recover the mon store. It's not a huge deal
|
It turns out that you can actually recover the mon store. It's not a huge deal
|
||||||
so long as your OSDs have data integrity.
|
so long as your OSDs have data integrity.
|
||||||
|
|
||||||
@ -584,7 +561,6 @@ It seemed like a miracle, but it is entirely credited to how resilient ceph is
|
|||||||
built to tolerate that level of abuse.
|
built to tolerate that level of abuse.
|
||||||
|
|
||||||
# why
|
# why
|
||||||
|
|
||||||
_Data appears to be lost_
|
_Data appears to be lost_
|
||||||
|
|
||||||
- ceph OSD daemons fail to start;
|
- ceph OSD daemons fail to start;
|
||||||
|
Loading…
Reference in New Issue
Block a user