Add 2024-08-10-recovering-ceph-cluster.md
This commit is contained in:
parent
c5389fbda8
commit
df00d13f84
728
2024-08-10-recovering-ceph-cluster.md
Normal file
728
2024-08-10-recovering-ceph-cluster.md
Normal file
@ -0,0 +1,728 @@
|
||||
---
|
||||
title: "recovering a rook-ceph cluster when all hope seems to be lost"
|
||||
author: "0x3bb"
|
||||
date: M08-10-2024
|
||||
---
|
||||
|
||||
I share a _rook-ceph_ cluster between a friend of mine approximately 2000km
|
||||
away. I had reservations whether this would be a good idea at first because of
|
||||
the latency and the fact consumer grade WAN might bring anything
|
||||
that breathes to its knees. However, I'm pleased with how robust it has been
|
||||
under those circumstances.
|
||||
|
||||
Getting started with _rook-ceph_ is really simple because it orchestrates
|
||||
everything for you. That's convenient, although, if and when you have problems,
|
||||
you can suddenly become an operator of a distributed system that you may know
|
||||
very little about due to the abstraction.
|
||||
|
||||
During this incident, I found myself in exactly that position, relying heavily on the
|
||||
(great) documentation for both _rook_ and _ceph_ itself.
|
||||
|
||||
# the beginning
|
||||
|
||||
In the process of moving nodes, **I accidentally zapped 2 OSDs**.
|
||||
|
||||
Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
|
||||
were down to the final OSD.
|
||||
|
||||
This sounds bad, but it was fine: mons were still up, just a matter of removing
|
||||
the old OSDs from the tree and letting replication work its magic.
|
||||
|
||||
# the real mistake
|
||||
|
||||
I noticed although the block pool was replicated, we had lost all our RADOS
|
||||
object storage.
|
||||
|
||||
|
||||

|
||||
|
||||
The erasure-coding profile was `k=2, m=1`. That meant we could only lose 2
|
||||
OSDs, which had already happened.
|
||||
|
||||
Object storage (which our applications interfaced with via Ceph S3 Gateway) was
|
||||
lost. Nothing critical -- we just needed our CSI volume mounts back online for
|
||||
databases -- packages and other artifacts could easily be restored from
|
||||
backups.
|
||||
|
||||
|
||||
Moving on from this, the first thing I did was _"fix"_ the EC configuration to
|
||||
`k=3, m=2`. This would spread the data over 5 OSDs.
|
||||
|
||||
```
|
||||
- dataChunks: 2
|
||||
- codingChunks: 1
|
||||
+ dataChunks: 3
|
||||
+ codingChunks: 2
|
||||
```
|
||||
|
||||
Happy with that, I restored a backup of the objects. Everything was working.
|
||||
|
||||
A few days later, _renovatebot_ arrived with a new PR to bump _rook-ceph_ `1.14.9`.
|
||||
|
||||
Of course I want that -- the number is bigger than the previous number so
|
||||
everything is going to be better than before.
|
||||
|
||||
# downtime
|
||||
|
||||
Following the merge, all services on the cluster went down at once. I checked
|
||||
the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
|
||||
of gibberish and decided to check out the GitHub issues. Since nothing is ever
|
||||
my fault, I decided to see who or what was to blame.
|
||||
|
||||
With no clues, I had still hoped this would be relatively simple to fix.
|
||||
|
||||
Resigning to actually reading the OSD logs in the `rook-ceph-crashcollector`
|
||||
pods, I saw (but did not understand) the problem:
|
||||
|
||||
`osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)`
|
||||
|
||||

|
||||
|
||||
With the _"fixed"_ configuration, what I had actually done is split the object
|
||||
store pool across _5_ OSDs. We had _3_.
|
||||
|
||||
Due to the `rook-ceph-operator` being rescheduled from the version bump, the
|
||||
OSD daemons had been reloaded as part of the update procedure and now demanded
|
||||
an answer for the data and coding chunks that simply did not exist. Sure enough,
|
||||
`ceph -s` also reported undersized placement groups.
|
||||
|
||||
This makes sense as there weren't enough OSDs to split the data.
|
||||
|
||||
# from bad to suicide
|
||||
|
||||
## reverting erasure-coding profile
|
||||
|
||||
The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
|
||||
OSDs were still in the same state complaining about the erasure-coding
|
||||
profile.
|
||||
|
||||
## causing even more damage
|
||||
|
||||
The second attempt (and in hindsight, a very poor choice) was to zap the other
|
||||
two underlying OSD disks:
|
||||
|
||||
[ `osd-1`, `osd-2` ].
|
||||
|
||||
`zap.sh`
|
||||
|
||||
```
|
||||
DISK="/dev/vdb"
|
||||
|
||||
# Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean)
|
||||
sgdisk --zap-all $DISK
|
||||
|
||||
# Wipe a large portion of the beginning of the disk to remove more LVM metadata that may be present
|
||||
dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
|
||||
|
||||
# SSDs may be better cleaned with blkdiscard instead of dd
|
||||
blkdiscard $DISK
|
||||
|
||||
# Inform the OS of partition table changes
|
||||
partprobe $DISK
|
||||
```
|
||||
|
||||
Perhaps having two other OSDs online would allow me to replicate the healthy
|
||||
pgs without the offending RADOS objects.
|
||||
|
||||
Sure enough, the 2 new OSDs started.
|
||||
|
||||
Since the `osd-0` with the actual data still wouldn't start, the cluster was
|
||||
still in a broken state.
|
||||
|
||||
Now down to the last OSD, at this point I knew that I was going to make many,
|
||||
many more mistakes. If I was going to continue I needed to backup the logical
|
||||
volume used by the `osd-0` node before continuing, which I did.
|
||||
|
||||
## clutching at mons
|
||||
|
||||
I switched my focus to a new narrative: _something was wrong with the mons_.
|
||||
|
||||
They were in quorum but I still couldn't figure out why the now last-surviving OSD
|
||||
was having issues starting.
|
||||
|
||||
The `mon_host` configuration was correct in `secret/rook-ceph-config`:
|
||||
|
||||
`mon_host: '[v2:10.50.1.10:3300,v1:10.50.1.10:6789],[v2:10.50.1.11:3300,v1:10.50.1.11:6789],[v2:10.55.1.10:3300,v1:10.55.1.10:6789]'`
|
||||
|
||||
Nothing had changed with the underlying data on those mons. Maybe there was
|
||||
corruption in the monitor store? The monitors maintain a map of the cluster
|
||||
state: the `osdmap`, the `crushmap`, etc.
|
||||
|
||||
My theory was: if the cluster map did not have the correct placement groups and
|
||||
other OSD metadata then perhaps replacing it would help.
|
||||
|
||||
I replaced the data from another mon to the one used for the failing OSD
|
||||
deployment (`store.db`) and scaled up the deployment:
|
||||
|
||||
`2024-08-04T00:49:47.698+0000 7f12fc78f700 0 mon.s@2(probing) e30 removed
|
||||
from monmap, suicide.`
|
||||
|
||||
With all data potentially lost and it being almost 1AM, that message was not
|
||||
very reassuring. I did manually change the monmap and inject it back in, but
|
||||
ended up back in the same position.
|
||||
|
||||
I figured I had done enough experimenting at this point and had to look deeper
|
||||
outside of the deployment. The only meaningful change we had made was the
|
||||
erasure-coding profile.
|
||||
|
||||
# initial analysis
|
||||
|
||||
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
|
||||
|
||||
```
|
||||
2024-08-03T18:58:49.845+0000 7f2a7d6da700 1 osd.0 pg_epoch: 8916
|
||||
pg[12.11s2( v 8915'287 (0'0,8915'287] local-lis/les=8893/8894 n=23 ec=5063/5059
|
||||
lis/c=8893/8413 les/c/f=8894/8414/0 sis=8916 pruub=9.256847382s)
|
||||
[1,2,NONE]p1(0) r=-1 lpr=8916 pi=[8413,8916)/1 crt=8915'287 mlcod 0'0 unknown
|
||||
NOTIFY pruub 20723.630859375s@ mbc={}] state<Start>: transitioning to Stray
|
||||
2024-08-03T18:58:49.849+0000 7f2a7ced9700 -1G
|
||||
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
|
||||
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
|
||||
7f2a7e6dc700 time 2024-08-03T18:58:49.853351+0000
|
||||
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
|
||||
34: FAILED ceph_assert(stripe_width % stripe_size == 0)
|
||||
```
|
||||
|
||||
```
|
||||
-2> 2024-08-03T18:59:00.086+0000 7ffa48b48640 5 osd.2 pg_epoch: 8894 pg[12.9(unlocked)] enter Initial
|
||||
...
|
||||
/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)
|
||||
|
||||
```
|
||||
|
||||
I noticed a pattern: all the failing pg IDs were prefixed with `12`.
|
||||
|
||||
Seeing this, I had concluded:
|
||||
- mons were ok and in quorum;
|
||||
- the `osd-0` daemon fails to start;
|
||||
- other fresh OSDs (`osd-1` and `osd-2`) start fine, this was a data
|
||||
integrity issue confined to `osd-0` (and the previous OSDs had I not nuked
|
||||
them);
|
||||
- the cause was a change in the erasure-coding profile, which happened on
|
||||
only one pool where the chunk distribution was modified;
|
||||
|
||||
Accepting the loss of the miniscule data on the object storage pool in favor of
|
||||
saving the block storage, I could correct the misconfiguration.
|
||||
|
||||
# preparation
|
||||
|
||||
To avoid troubleshoting issues caused from my failed attempts, I decided I
|
||||
would do a clear out of the existing CRDs and just focus first on getting the
|
||||
OSD with the data back online. If I ever got the data back, then I'd probably
|
||||
be conscious of prior misconfiguration and have to do so regardless.
|
||||
|
||||
- backup the important shit;
|
||||
- clear out the `rook-ceph` namespace;
|
||||
|
||||
## backups
|
||||
|
||||
- the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
|
||||
- `/var/lib/rook` on all nodes, containing mon data;
|
||||
|
||||
## removal
|
||||
|
||||
### deployments/daemonsets
|
||||
|
||||
These were the first to go, as I didn't want the `rook-operator` persistently
|
||||
creating Kubernetes objects when I was actively trying to kill them.
|
||||
|
||||
### crds
|
||||
|
||||
Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
|
||||
|
||||
- `cephblockpoolradosnamespaces`
|
||||
- `cephblockpools`
|
||||
- `cephbucketnotifications`
|
||||
- `cephclients`
|
||||
- `cephclusters`
|
||||
- `cephcosidrivers`
|
||||
- `cephfilesystemmirrors`
|
||||
- `cephfilesystems`
|
||||
- `cephfilesystemsubvolumegroups`
|
||||
- `cephnfses`
|
||||
- `cephobjectrealms`
|
||||
- `cephobjectstores`
|
||||
- `cephobjectstoreusers`
|
||||
- `cephobjectzonegroups`
|
||||
- `cephobjectzones`
|
||||
- `cephrbdmirrors`
|
||||
|
||||
### /var/lib/rook
|
||||
|
||||
I had these backed up for later, but I didn't want them there when the cluster came online.
|
||||
|
||||
### osd disks
|
||||
|
||||
I did not wipe any devices.
|
||||
|
||||
First, I obviously didn't want to wipe the disk with the data on it. As for the
|
||||
other, now useless OSDs that I had mistakenly created over the old ones; I knew
|
||||
spawning the `rook-operator` would create new OSDs if they didn't belong to an
|
||||
old ceph cluster.
|
||||
|
||||
This would make troubleshooting `osd-0` more difficult, as I'd now have to consider
|
||||
analysing the status reported from `osd-1` and `osd-2`.
|
||||
|
||||
|
||||
## provisioning
|
||||
|
||||
Since at this point I only cared about `osd-0` and it was beneficial to have
|
||||
fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
|
||||
within the helm `values.yaml`.
|
||||
|
||||
Following this, I simply reconciled the chart.
|
||||
|
||||
I noticed the `rook-ceph-operator`, `rook-ceph-mon-a`, `rook-ceph-mgr-a` came online as expected.
|
||||
|
||||
Because the OSDs were part of an old cluster, I now had a ceph-cluster with no
|
||||
OSDs, as shown in the `rook-ceph-osd-prepare-*` jobs for each node.
|
||||
|
||||
```
|
||||
osd.0: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
|
||||
osd.1: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
|
||||
osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
|
||||
```
|
||||
|
||||
# surgery
|
||||
|
||||
With less noise and a clean slate, it was time to attempt to fix this mess.
|
||||
|
||||
- adopt `osd-0` to the new cluster;
|
||||
- remove the corrupted pgs from `osd-0`;
|
||||
- bring up two new OSDs for replication;
|
||||
|
||||
## osd-0
|
||||
|
||||
I started trying to determine how I would _safely_ remove the offending
|
||||
objects. If that happened, then the OSD would have no issues with the
|
||||
erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
|
||||
start.
|
||||
|
||||
- If the placement groups contained only objects created from the
|
||||
_RADOS Object Gateway_, then I can simply remove the pgs.
|
||||
|
||||
- If, however, the pgs contain both the former _and_ block device objects
|
||||
then it would require careful removal of all non-rdb (block storage)
|
||||
objects as there would be valuable data loss by purging the entire
|
||||
placement groups.
|
||||
|
||||
Since OSD pools have a `1:N` relationship with pgs, the second scenario seemed
|
||||
unlikely, perhaps impossible.
|
||||
|
||||
Next, I needed to inspect the OSD somehow, because the existing deployment would continously crash.
|
||||
|
||||
`kubectl rook-ceph debug start rook-ceph-osd-0`
|
||||
|
||||
Running this command allowed me to observe the OSD without it actually joining
|
||||
the cluster. The "real" OSD deployment need only be scheduled, but crashing
|
||||
continously was ok.
|
||||
|
||||
Once you execute that command, it will scale the OSD daemon down and create a
|
||||
new deployment that mirrors the configuration but _without_ the daemon running
|
||||
in order to perform maintenance.
|
||||
|
||||
|
||||
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
|
||||
|
||||
|
||||
```
|
||||
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
|
||||
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
|
||||
12.0 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.1 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.2 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.3 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.4 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.5 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.6 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.7 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.8 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.9 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.a 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.b 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.c 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.d 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.e 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.f 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.10 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.11 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.12 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.13 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.14 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.15 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.16 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.17 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.18 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.19 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.1a 0 0 0 0 0 0 0 0 unknown 8h
|
||||
12.1b 0 0 0 0 0 0 0 0 unknown 8h
|
||||
```
|
||||
|
||||
|
||||
Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
|
||||
|
||||
```
|
||||
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-pgs | grep ^12
|
||||
12.bs1
|
||||
12.6s1
|
||||
12.1fs0
|
||||
12.1ds1
|
||||
12.15s0
|
||||
12.16s0
|
||||
12.11s0
|
||||
12.12s2
|
||||
12.0s0
|
||||
12.17s2
|
||||
12.4s1
|
||||
12.9s0
|
||||
12.19s0
|
||||
12.cs2
|
||||
12.13s0
|
||||
12.14s2
|
||||
12.3s2
|
||||
12.1as0
|
||||
12.1bs2
|
||||
12.as1
|
||||
12.1es1
|
||||
12.1cs2
|
||||
12.2s2
|
||||
12.8s1
|
||||
12.7s2
|
||||
12.ds0
|
||||
12.es0
|
||||
12.fs0
|
||||
12.18s0
|
||||
12.1s0
|
||||
12.5s1
|
||||
12.10s2
|
||||
```
|
||||
|
||||
I still needed to be convinced I wasn't removing any valuable data.
|
||||
I inspected a few of them to be sure.
|
||||
|
||||
```
|
||||
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 12.10s0 --op list
|
||||
"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/07/19/071984080b32e2867 f 1ac6ec2b7d2b8724bc5d75e2850b5e7 f20040ee52F55d1.2~e7rYg3S
|
||||
"hash" :1340195137, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
|
||||
|
||||
"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/9a/82/9a82a64c3a8439c75d8e584181427b073712afd1454747bec3dcb84bcbe19ac5. 2~urbG4nd
|
||||
"hash" :4175566657, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
|
||||
|
||||
"12.10s0", {"oid" :"7d927F08-bd9b-4d4b-bfc1-d331eb216e68.22197937.1__shadow Windows Security Internals.pdf.2~g9stQ9inkWvsTq33S9z5xNEHEgST2H4.1_1","key":"", "snapid":-
|
||||
"shard id":2,"max":0}]
|
||||
...
|
||||
|
||||
```
|
||||
|
||||
With this information, I now knew:
|
||||
|
||||
- the log exceptions matched the pgs that were impacted from the change in
|
||||
the erasure-coding configuration;
|
||||
- `ceph-objectstore.rgw.buckets.data` had a relationship with those pgs where
|
||||
the configuration was changed;
|
||||
- the objects were familiar with the objects in the buckets, e.g. books;
|
||||
|
||||
Since I _did_ modify the erasure-coding profile this is all starting to make sense.
|
||||
|
||||
Carefully, the next operation was to remove the offending placement groups.
|
||||
Simply removing the pool wouldn't work, as the OSD daemon not starting meant it
|
||||
would know nothing about this change, and still not have enough chunks to come alive.
|
||||
|
||||
```
|
||||
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.17s2
|
||||
marking collection for removal
|
||||
setting '_remove' omap key
|
||||
finish_remove_pgs 12.17s2_head removing 12.17s2
|
||||
Remove successful
|
||||
|
||||
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.0s0
|
||||
marking collection for removal
|
||||
setting '_remove' omap key
|
||||
finish_remove_pgs 12.0s0_head removing 12.0s0
|
||||
Remove successful
|
||||
``````
|
||||
|
||||
I did this for every PG listed above. Once I scaled down the maintenance
|
||||
deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
|
||||
with (hopefully) agreeable placement groups and thankfully, it had come alive.
|
||||
|
||||
|
||||
```
|
||||
k get pods -n rook-ceph
|
||||
|
||||
rook-ceph-osd-0-6f57466c78-bj96p 2/2 Running
|
||||
```
|
||||
|
||||
An eager run of `ceph -s` produced both relief and disappointment. The OSD was up, but the pgs were in an `unknown` state.
|
||||
|
||||
```
|
||||
ceph -s
|
||||
...
|
||||
data:
|
||||
pools: 12 pools, 169 pgs
|
||||
objects: 0 objects, 0 B
|
||||
usage: 0 B used, 0 B / 0 B avail
|
||||
pgs: 100.000% pgs unknown
|
||||
169 unknown
|
||||
```
|
||||
|
||||
At this point, I had mentioned to my friend (helping by playing GuildWars 2)
|
||||
that we might be saved. It seemed promising as we at least had `osd-0` running
|
||||
again now that the troublesome pgs were removed.
|
||||
|
||||
He agreed, and contemplated changing his character's hair colour instead of
|
||||
saving our data.
|
||||
|
||||
## mons
|
||||
|
||||
### restoring
|
||||
|
||||
I had `/var/lib/rook` backups from each node with the old mon data. At this
|
||||
point, with the correct number of placement groups and seeing 100% of them
|
||||
remaining in an `unknown` state, it seemed the next step was to restore the
|
||||
mons.
|
||||
|
||||
I knew from reading the `rook-ceph` docs that if you want to restore the data
|
||||
of a monitor to a new cluster, you have to inject the `monmap` into the old
|
||||
mons `store.db`.
|
||||
|
||||
Before doing this, I scaled `deployment/rook-ceph-mon-a` down to `0` first.
|
||||
|
||||
Then, navigating to a directory on my local machine with the backups I ran a
|
||||
container to modify the `monmap` on my local fs.
|
||||
|
||||
`docker run -it --rm --entrypoint bash -v .:/var/lib/rook rook/ceph:v1.14.9`
|
||||
|
||||
```
|
||||
touch /etc/ceph/ceph.conf
|
||||
cd /var/lib/rook
|
||||
ceph-mon --extract-monmap monmap --mon-data ./mon-q/data
|
||||
monmaptool --rm q
|
||||
monmaptool --rm s
|
||||
monmaptool --rm t
|
||||
```
|
||||
|
||||
Now the old mons `q`, `s` and `t` were removed from the map, I had to add the
|
||||
new cluster mon `rook-ceph-mon-a` created following the new
|
||||
ceph-cluster.
|
||||
|
||||
```
|
||||
monmaptool --addv a '[v2:10.50.1.10:3300,v1:10.50.1.10:6789]`
|
||||
ceph-mon --inject-monmap monmap --mon-data ./mon-q/data
|
||||
exit
|
||||
```
|
||||
|
||||
Shoving it back up to the node `rook-ceph-mon-a` lives on:
|
||||
|
||||
`scp -r ./mon-q/data/* 3bb@10.50.1.10:/var/lib/rook/mon-a/data/`
|
||||
|
||||
Rescheduling the deployment and although the mon log output isn't giving me
|
||||
suggestions of suicide, all our pgs still remain in an `unknown` state.
|
||||
|
||||
## recovering the mon store
|
||||
|
||||
It turns out that you can actually recover the mon store. It's not a huge deal
|
||||
so long as your OSDs have data integrity.
|
||||
|
||||
Scaling the useless `mon-a` down, I copied the existing `mon-a` data onto the
|
||||
`rook-ceph-osd-0` daemon container.
|
||||
|
||||
Another `osd-0` debug container... `k rook-ceph debug start rook-ceph-osd-0`
|
||||
|
||||
I rebuilt the mon data, using the existing RocksDB kv store.
|
||||
|
||||
This would have worked without the backup, but I was interested to see the
|
||||
`osdmaps` trimmed due to the other 2 removed OSDs.
|
||||
|
||||
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-0/ --op update-mon-db --mon-store-path /tmp/mon-a/data/
|
||||
osd.0 : 3099 osdmaps trimmed, 635 osdmaps added.
|
||||
```
|
||||
|
||||
```
|
||||
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-authtool /tmp/mon-a/keyring -n mon. --cap mon 'allow *' --gen-key
|
||||
|
||||
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-monstore-tool /tmp/mon-a/data rebuild -- --keyring /tmp/mon-a/keyring
|
||||
4 rocksdb: [db/flush_job.cc:967] [default] [JOB 3] Level-0 flush table #3433: 62997231 bytes OK
|
||||
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1722831731454649, "job": 3, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [2, 0, 0, 0, 0, 0, 2], "immutable_memtables": 1}
|
||||
4 rocksdb: [file/delete_scheduler.cc:74] Deleted file /tmp/mon-a/data/store.db/003433.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
|
||||
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1723067397472153, "job": 4, "event": "table_file_deletion", "file_number": 3433}
|
||||
4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
|
||||
```
|
||||
|
||||
After copying the now _rebuilt_ `mon-a` store back, and bringing everything up
|
||||
again, the cluster was finally resurrecting.
|
||||
|
||||
It took some time for the rebalancing and replication to occur, but hours
|
||||
later, `ceph -s` reported a healthy cluster and services resumed being entirely
|
||||
unaware of the chaos that had ensued over the previous few days:
|
||||
|
||||
```
|
||||
cluster:
|
||||
id: 47f25963-57c0-4b3b-9b35-bbf68c09eec6
|
||||
health: HEALTH_OK
|
||||
|
||||
services:
|
||||
mon: 3 daemons, quorum a,b,c (age 3h)
|
||||
mgr: b(active, since 8h), standbys: a
|
||||
mds: 1/1 daemons up, 1 hot standby
|
||||
osd: 3 osds: 3 up (since 3h), 3 in (since 3h)
|
||||
rgw: 2 daemons active (2 hosts, 1 zones)
|
||||
|
||||
data:
|
||||
volumes: 1/1 healthy
|
||||
pools: 12 pools, 169 pgs
|
||||
objects: 4.88k objects, 16 GiB
|
||||
usage: 47 GiB used, 1.1 TiB / 1.2 TiB avail
|
||||
pgs: 169 active+clean
|
||||
|
||||
io:
|
||||
client: 639 B/s rd, 9.0 KiB/s wr, 1 op/s rd, 1 op/s wr
|
||||
recovery: 834 KiB/s, 6 objects/s
|
||||
```
|
||||
|
||||
It seemed like a miracle, but it is entirely credited to how resilient ceph is
|
||||
built to tolerate that level of abuse.
|
||||
|
||||
# why
|
||||
|
||||
_Data appears to be lost_
|
||||
|
||||
- ceph OSD daemons fail to start;
|
||||
- the OSDs could not reconstruct the data from chunks;
|
||||
- the `osdmap` referenced a faulty erasure-coding profile;
|
||||
- the monstore `osdmap` still had reference to the above erasure-coding profile;
|
||||
- the erasure-coding profile was changed to a topology impossible to satisfy under the current architecture;
|
||||
- 2 disks were zapped, hitting the ceiling of the failure domain for `ceph-objectstore.rgw.buckets.data_ecprofile`;
|
||||
|
||||
|
||||
The monitor `osdmap` still contained the bad EC profile.
|
||||
|
||||
`ceph-monstore-tool /tmp/mon-bak get osdmap > osdmap.bad`
|
||||
|
||||
osdmaptool --dump json osdmap.bad | grep -i profile
|
||||
|
||||
```
|
||||
"erasure_code_profiles":{
|
||||
"ceph-objectstore.rgw.buckets.data_ecprofile":{
|
||||
"crush-device-class":"",
|
||||
"crush-failure-domain":"host",
|
||||
"crush-root":"default",
|
||||
"jerasure-per-chunk-alignment":"false",
|
||||
"k":"3",
|
||||
"m":"2"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
After rebuilding the monstore...
|
||||
|
||||
`ceph-monstore-tool /tmp/mon-a get osdmap > osdmap.good`
|
||||
|
||||
```
|
||||
"erasure_code_profiles":{
|
||||
"ceph-objectstore.rgw.buckets.data_ecprofile":{
|
||||
"crush-device-class":"",
|
||||
"crush-failure-domain":"host",
|
||||
"crush-root":"default",
|
||||
"jerasure-per-chunk-alignment":"false",
|
||||
"k":"3",
|
||||
"m":"1"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Therefore, it seems as if I could have attempted to rebuild the monstore first,
|
||||
possibly circumventing the `ECAssert` errors. The placement groups on `osd-0` were
|
||||
still mapping to 3 OSDs, not 5.
|
||||
|
||||
```
|
||||
[root@ad9e4c6e7343 rook]# osdmaptool --test-map-pgs-dump --pool 12 osdmap
|
||||
osdmaptool: osdmap file 'osdmap'
|
||||
pool 12 pg_num 32
|
||||
12.0 [2147483647,2147483647,2147483647] -1
|
||||
12.1 [2147483647,2147483647,2147483647] -1
|
||||
12.2 [2147483647,2147483647,2147483647] -1
|
||||
12.3 [2147483647,2147483647,2147483647] -1
|
||||
12.4 [2147483647,2147483647,2147483647] -1
|
||||
12.5 [2147483647,2147483647,2147483647] -1
|
||||
12.6 [2147483647,2147483647,2147483647] -1
|
||||
12.7 [2147483647,2147483647,2147483647] -1
|
||||
12.8 [2147483647,2147483647,2147483647] -1
|
||||
12.9 [2147483647,2147483647,2147483647] -1
|
||||
12.a [2147483647,2147483647,2147483647] -1
|
||||
12.b [2147483647,2147483647,2147483647] -1
|
||||
12.c [2147483647,2147483647,2147483647] -1
|
||||
12.d [2147483647,2147483647,2147483647] -1
|
||||
12.e [2147483647,2147483647,2147483647] -1
|
||||
12.f [2147483647,2147483647,2147483647] -1
|
||||
12.10 [2147483647,2147483647,2147483647] -1
|
||||
12.11 [2147483647,2147483647,2147483647] -1
|
||||
12.12 [2147483647,2147483647,2147483647] -1
|
||||
12.13 [2147483647,2147483647,2147483647] -1
|
||||
12.14 [2147483647,2147483647,2147483647] -1
|
||||
12.15 [2147483647,2147483647,2147483647] -1
|
||||
12.16 [2147483647,2147483647,2147483647] -1
|
||||
12.17 [2147483647,2147483647,2147483647] -1
|
||||
12.18 [2147483647,2147483647,2147483647] -1
|
||||
12.19 [2147483647,2147483647,2147483647] -1
|
||||
12.1a [2147483647,2147483647,2147483647] -1
|
||||
12.1b [2147483647,2147483647,2147483647] -1
|
||||
12.1c [2147483647,2147483647,2147483647] -1
|
||||
12.1d [2147483647,2147483647,2147483647] -1
|
||||
12.1e [2147483647,2147483647,2147483647] -1
|
||||
12.1f [2147483647,2147483647,2147483647] -1
|
||||
#osd count first primary c wt wt
|
||||
osd.0 0 0 0 0.488297 1
|
||||
osd.1 0 0 0 0.488297 1
|
||||
osd.2 0 0 0 0.195297 1
|
||||
in 3
|
||||
avg 0 stddev 0 (-nanx) (expected 0 -nanx))
|
||||
size 3 32
|
||||
```
|
||||
|
||||
Since the cluster did not have enough OSDs (wanted 5 with `k=3,m=2`), the rule
|
||||
can be tested against the old crush map, with `--num-rep` representing the
|
||||
required OSDs, i.e. `k+m`:
|
||||
|
||||
With the original erasure-coding profile (`k+m=3`), everything looks good -- no
|
||||
bad mappings.
|
||||
|
||||
```
|
||||
[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 3 --show-bad-mappings
|
||||
|
||||
// healthy
|
||||
```
|
||||
|
||||
With `k+m=5`, though -- or anything great than `3` OSDs...
|
||||
|
||||
```
|
||||
[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 5 --show-bad-mappings
|
||||
...
|
||||
bad mapping rule 20 x 1002 num_rep 5 result [0,2147483647,1,2,2147483647]
|
||||
bad mapping rule 20 x 1003 num_rep 5 result [0,2147483647,2,1,2147483647]
|
||||
bad mapping rule 20 x 1004 num_rep 5 result [1,0,2147483647,2,2147483647]
|
||||
bad mapping rule 20 x 1005 num_rep 5 result [0,1,2147483647,2,2147483647]
|
||||
bad mapping rule 20 x 1006 num_rep 5 result [0,1,2147483647,2,2147483647]
|
||||
bad mapping rule 20 x 1007 num_rep 5 result [0,1,2147483647,2147483647,2]
|
||||
bad mapping rule 20 x 1008 num_rep 5 result [1,2,0,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1009 num_rep 5 result [2,1,0,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1010 num_rep 5 result [0,1,2,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1011 num_rep 5 result [0,2147483647,2,1,2147483647]
|
||||
bad mapping rule 20 x 1012 num_rep 5 result [0,1,2147483647,2,2147483647]
|
||||
bad mapping rule 20 x 1013 num_rep 5 result [0,2,2147483647,1,2147483647]
|
||||
bad mapping rule 20 x 1014 num_rep 5 result [1,0,2147483647,2,2147483647]
|
||||
bad mapping rule 20 x 1015 num_rep 5 result [2,0,2147483647,1,2147483647]
|
||||
bad mapping rule 20 x 1016 num_rep 5 result [1,0,2,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1017 num_rep 5 result [2,0,1,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1018 num_rep 5 result [1,0,2147483647,2147483647,2]
|
||||
bad mapping rule 20 x 1019 num_rep 5 result [0,1,2,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1020 num_rep 5 result [0,2147483647,1,2147483647,2]
|
||||
bad mapping rule 20 x 1021 num_rep 5 result [2,1,0,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1022 num_rep 5 result [1,0,2,2147483647,2147483647]
|
||||
bad mapping rule 20 x 1023 num_rep 5 result [0,2,1,2147483647,2147483647]
|
||||
```
|
||||
|
||||
Mappings were found on _3_ OSDs, but missing the 4th and 5th reference as
|
||||
indicated by the largest 32-bit int (i.e. missing). The object storage data
|
||||
would still have been lost, but it could have made the recovery of the cluster
|
||||
significantly less painful.
|
Loading…
Reference in New Issue
Block a user