710 lines
28 KiB
Markdown
710 lines
28 KiB
Markdown
---
|
|
title: recovering a rook-ceph cluster
|
|
author: "0x3bb"
|
|
date: M08-10-2024
|
|
---
|
|
|
|
I share a _rook-ceph_ cluster between a friend of mine approximately 2000km
|
|
away. I had reservations whether this would be a good idea at first because of
|
|
the latency and the fact consumer grade WAN might bring anything
|
|
that breathes to its knees. However, I'm pleased with how robust it has been
|
|
under those circumstances.
|
|
|
|
Getting started with _rook-ceph_ is really simple because it orchestrates
|
|
everything for you. That's convenient, although, if and when you have problems,
|
|
you can suddenly become an operator of a distributed system that you may know
|
|
very little about due to the abstraction.
|
|
|
|
During this incident, I found myself in exactly that position, relying heavily on the
|
|
(great) documentation for both _rook_ and _ceph_ itself.
|
|
|
|
# the beginning
|
|
In the process of moving nodes, **I accidentally zapped 2 OSDs**.
|
|
|
|
Given that the _"ceph cluster"_ is basically two bedrooms peered together, we
|
|
were down to the final OSD.
|
|
|
|
This sounds bad, but it was fine: mons were still up, just a matter of removing
|
|
the old OSDs from the tree and letting replication work its magic.
|
|
|
|
# the real mistake
|
|
I noticed although the block pool was replicated, we had lost all our RADOS
|
|
object storage.
|
|
|
|
![](/b/images/ec-2-1.png)
|
|
|
|
The erasure-coding profile was `k=2, m=1`. That meant we could only lose 2
|
|
OSDs, which had already happened.
|
|
|
|
Object storage (which our applications interfaced with via Ceph S3 Gateway) was
|
|
lost. Nothing critical -- we just needed our CSI volume mounts back online for
|
|
databases -- packages and other artifacts could easily be restored from
|
|
backups.
|
|
|
|
Moving on from this, the first thing I did was _"fix"_ the EC configuration to
|
|
`k=3, m=2`. This would spread the data over 5 OSDs.
|
|
|
|
```
|
|
- dataChunks: 2
|
|
- codingChunks: 1
|
|
+ dataChunks: 3
|
|
+ codingChunks: 2
|
|
```
|
|
|
|
Happy with that, I restored a backup of the objects. Everything was working.
|
|
|
|
A few days later, _renovatebot_ arrived with a new PR to bump _rook-ceph_ `1.14.9`.
|
|
|
|
![](/b/images/rook-ceph-pr.png)
|
|
|
|
Of course I want that -- the number is bigger than the previous number so
|
|
everything is going to be better than before.
|
|
|
|
# downtime
|
|
Following the merge, all services on the cluster went down at once. I checked
|
|
the OSDs which were in `CrashLoopBackOff`. Inspecting the logs, I saw a bunch
|
|
of gibberish and decided to check out the GitHub issues.
|
|
|
|
Since nothing is ever my fault, I decided to see who or what was to blame.
|
|
|
|
With no clues, I had still hoped this would be relatively simple to fix.
|
|
|
|
Resigning to actually reading the OSD logs in the `rook-ceph-crashcollector`
|
|
pods, I saw (but did not understand) the problem:
|
|
|
|
`osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)`
|
|
|
|
![](/b/images/ec-3-2.png)
|
|
|
|
With the _"fixed"_ configuration, what I had actually done is split the object
|
|
store pool across _5_ OSDs. We had _3_.
|
|
|
|
Due to the `rook-ceph-operator` being rescheduled from the version bump, the
|
|
OSD daemons had been reloaded as part of the update procedure and now demanded
|
|
an answer for the data and coding chunks that simply did not exist. Sure enough,
|
|
`ceph -s` also reported undersized placement groups.
|
|
|
|
This makes sense as there weren't enough OSDs to split the data.
|
|
|
|
# from bad to suicide
|
|
|
|
## reverting erasure-coding profile
|
|
The first attempt I made was to revert the EC profile back to `k=2, m=1`. The
|
|
OSDs were still in the same state complaining about the erasure-coding
|
|
profile.
|
|
|
|
## causing even more damage
|
|
The second attempt (and in hindsight, a very poor choice) was to zap the other
|
|
two underlying OSD disks:
|
|
|
|
[ `osd-1`, `osd-2` ].
|
|
|
|
`zap.sh`
|
|
|
|
```
|
|
DISK="/dev/vdb"
|
|
|
|
# Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean)
|
|
sgdisk --zap-all $DISK
|
|
|
|
# Wipe a large portion of the beginning of the disk to remove more LVM metadata that may be present
|
|
dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
|
|
|
|
# SSDs may be better cleaned with blkdiscard instead of dd
|
|
blkdiscard $DISK
|
|
|
|
# Inform the OS of partition table changes
|
|
partprobe $DISK
|
|
```
|
|
|
|
Perhaps having two other OSDs online would allow me to replicate the healthy
|
|
pgs without the offending RADOS objects.
|
|
|
|
Sure enough, the 2 new OSDs started.
|
|
|
|
Since `osd-0` with the actual data still wouldn't start, the cluster was
|
|
still in a broken state.
|
|
|
|
Now down to the last OSD, at this point I knew that I was going to make many,
|
|
many more mistakes. If I was going to continue I needed to backup the logical
|
|
volume used by the `osd-0` node before continuing, which I did.
|
|
|
|
## clutching at mons
|
|
I switched my focus to a new narrative: _something was wrong with the mons_.
|
|
|
|
They were in quorum but I still couldn't figure out why the now last-surviving OSD
|
|
was having issues starting.
|
|
|
|
The `mon_host` configuration was correct in `secret/rook-ceph-config`:
|
|
|
|
`mon_host: '[v2:10.50.1.10:3300,v1:10.50.1.10:6789],[v2:10.50.1.11:3300,v1:10.50.1.11:6789],[v2:10.55.1.10:3300,v1:10.55.1.10:6789]'`
|
|
|
|
Nothing had changed with the underlying data on those mons. Maybe there was
|
|
corruption in the monitor store? The monitors maintain a map of the cluster
|
|
state: the `osdmap`, the `crushmap`, etc.
|
|
|
|
My theory was: if the cluster map did not have the correct placement groups and
|
|
other OSD metadata then perhaps replacing it would help.
|
|
|
|
I replaced the data from another mon to the one used for the failing OSD
|
|
deployment (`store.db`) and scaled up the deployment:
|
|
|
|
`2024-08-04T00:49:47.698+0000 7f12fc78f700 0 mon.s@2(probing) e30 removed
|
|
from monmap, suicide.`
|
|
|
|
With all data potentially lost and it being almost 1AM, that message was not
|
|
very reassuring. I did manually change the monmap and inject it back in, but
|
|
ended up back in the same position.
|
|
|
|
I figured I had done enough experimenting at this point and had to look deeper
|
|
outside of the deployment. The only meaningful change we had made was the
|
|
erasure-coding profile.
|
|
|
|
# initial analysis
|
|
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
|
|
|
|
```
|
|
2024-08-03T18:58:49.845+0000 7f2a7d6da700 1 osd.0 pg_epoch: 8916
|
|
pg[12.11s2( v 8915'287 (0'0,8915'287] local-lis/les=8893/8894 n=23 ec=5063/5059
|
|
lis/c=8893/8413 les/c/f=8894/8414/0 sis=8916 pruub=9.256847382s)
|
|
[1,2,NONE]p1(0) r=-1 lpr=8916 pi=[8413,8916)/1 crt=8915'287 mlcod 0'0 unknown
|
|
NOTIFY pruub 20723.630859375s@ mbc={}] state<Start>: transitioning to Stray
|
|
2024-08-03T18:58:49.849+0000 7f2a7ced9700 -1G
|
|
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
|
|
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
|
|
7f2a7e6dc700 time 2024-08-03T18:58:49.853351+0000
|
|
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
|
|
34: FAILED ceph_assert(stripe_width % stripe_size == 0)
|
|
```
|
|
|
|
```
|
|
-2> 2024-08-03T18:59:00.086+0000 7ffa48b48640 5 osd.2 pg_epoch: 8894 pg[12.9(unlocked)] enter Initial
|
|
...
|
|
/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)
|
|
|
|
```
|
|
|
|
I noticed a pattern: all the failing pg IDs were prefixed with `12`.
|
|
|
|
Seeing this, I had concluded:
|
|
- mons were ok and in quorum;
|
|
- the `osd-0` daemon fails to start;
|
|
- other fresh OSDs (`osd-1` and `osd-2`) start fine, this was a data
|
|
integrity issue confined to `osd-0` (and the previous OSDs had I not nuked
|
|
them);
|
|
- the cause was a change in the erasure-coding profile, which happened on
|
|
only one pool where the chunk distribution was modified;
|
|
|
|
Accepting the loss of the miniscule data on the object storage pool in favor of
|
|
saving the block storage, I could correct the misconfiguration.
|
|
|
|
# preparation
|
|
To avoid troubleshoting issues caused from my failed attempts, I decided I
|
|
would do a clear out of the existing CRDs and just focus first on getting the
|
|
OSD with the data back online. If I ever got the data back, then I'd probably
|
|
be conscious of prior misconfiguration and have to do so regardless.
|
|
|
|
- backup the important shit;
|
|
- clear out the `rook-ceph` namespace;
|
|
|
|
## backups
|
|
- the logical volume for `osd-0`, so I can re-attach it and afford mistakes;
|
|
- `/var/lib/rook` on all nodes, containing mon data;
|
|
|
|
## removal
|
|
|
|
### deployments/daemonsets
|
|
These were the first to go, as I didn't want the `rook-operator` persistently
|
|
creating Kubernetes objects when I was actively trying to kill them.
|
|
|
|
### crds
|
|
Removal of the all `rook-ceph` resources, and their finalizers to protect them from being removed:
|
|
|
|
- `cephblockpoolradosnamespaces`
|
|
- `cephblockpools`
|
|
- `cephbucketnotifications`
|
|
- `cephclients`
|
|
- `cephclusters`
|
|
- `cephcosidrivers`
|
|
- `cephfilesystemmirrors`
|
|
- `cephfilesystems`
|
|
- `cephfilesystemsubvolumegroups`
|
|
- `cephnfses`
|
|
- `cephobjectrealms`
|
|
- `cephobjectstores`
|
|
- `cephobjectstoreusers`
|
|
- `cephobjectzonegroups`
|
|
- `cephobjectzones`
|
|
- `cephrbdmirrors`
|
|
|
|
### /var/lib/rook
|
|
I had these backed up for later, but I didn't want them there when the cluster came online.
|
|
|
|
### osd disks
|
|
I did not wipe any devices.
|
|
|
|
First, I obviously didn't want to wipe the disk with the data on it. As for the
|
|
other, now useless OSDs that I had mistakenly created over the old ones; I knew
|
|
spawning the `rook-operator` would create new OSDs if they didn't belong to an
|
|
old ceph cluster.
|
|
|
|
This would make troubleshooting `osd-0` more difficult, as I'd now have to consider
|
|
analysing the status reported from `osd-1` and `osd-2`.
|
|
|
|
|
|
## provisioning
|
|
Since at this point I only cared about `osd-0` and it was beneficial to have
|
|
fewer moving parts to work with, I changed the `rook-ceph-cluster` mon count to `1`
|
|
within the helm `values.yaml`.
|
|
|
|
Following this, I simply reconciled the chart.
|
|
|
|
I noticed the `rook-ceph-operator`, `rook-ceph-mon-a`, `rook-ceph-mgr-a` came online as expected.
|
|
|
|
Because the OSDs were part of an old cluster, I now had a ceph-cluster with no
|
|
OSDs, as shown in the `rook-ceph-osd-prepare-*` jobs for each node.
|
|
|
|
```
|
|
osd.0: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
|
|
osd.1: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
|
|
osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
|
|
```
|
|
|
|
# surgery
|
|
With less noise and a clean slate, it was time to attempt to fix this mess.
|
|
|
|
- adopt `osd-0` to the new cluster;
|
|
- remove the corrupted pgs from `osd-0`;
|
|
- bring up two new OSDs for replication;
|
|
|
|
## osd-0
|
|
I started trying to determine how I would _safely_ remove the offending
|
|
objects. If that happened, then the OSD would have no issues with the
|
|
erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should
|
|
start.
|
|
|
|
- If the placement groups contained only objects created from the
|
|
_RADOS Object Gateway_, then I can simply remove the pgs.
|
|
|
|
- If, however, the pgs contain both the former _and_ block device objects
|
|
then it would require careful removal of all non-rdb (block storage)
|
|
objects as there would be valuable data loss by purging the entire
|
|
placement groups.
|
|
|
|
Since OSD pools have a `1:N` relationship with pgs, the second scenario seemed
|
|
unlikely, perhaps impossible.
|
|
|
|
Next, I needed to inspect the OSD somehow, because the existing deployment would continously crash.
|
|
|
|
`kubectl rook-ceph debug start rook-ceph-osd-0`
|
|
|
|
Running this command allowed me to observe the OSD without it actually joining
|
|
the cluster. The "real" OSD deployment need only be scheduled, but crashing
|
|
continuously was ok.
|
|
|
|
Once you execute that command, it will scale the OSD daemon down and create a
|
|
new deployment that mirrors the configuration but _without_ the daemon running
|
|
in order to perform maintenance.
|
|
|
|
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
|
|
|
|
```
|
|
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
|
|
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
|
|
12.0 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.1 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.2 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.3 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.4 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.5 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.6 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.7 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.8 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.9 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.a 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.b 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.c 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.d 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.e 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.f 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.10 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.11 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.12 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.13 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.14 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.15 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.16 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.17 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.18 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.19 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.1a 0 0 0 0 0 0 0 0 unknown 8h
|
|
12.1b 0 0 0 0 0 0 0 0 unknown 8h
|
|
```
|
|
|
|
Seeing this, I first checked to see how many placement groups prefixed with `12` existed using the actual path to the OSD.
|
|
|
|
```
|
|
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-pgs | grep ^12
|
|
12.bs1
|
|
12.6s1
|
|
12.1fs0
|
|
12.1ds1
|
|
12.15s0
|
|
12.16s0
|
|
12.11s0
|
|
12.12s2
|
|
12.0s0
|
|
12.17s2
|
|
12.4s1
|
|
12.9s0
|
|
12.19s0
|
|
12.cs2
|
|
12.13s0
|
|
12.14s2
|
|
12.3s2
|
|
12.1as0
|
|
12.1bs2
|
|
12.as1
|
|
12.1es1
|
|
12.1cs2
|
|
12.2s2
|
|
12.8s1
|
|
12.7s2
|
|
12.ds0
|
|
12.es0
|
|
12.fs0
|
|
12.18s0
|
|
12.1s0
|
|
12.5s1
|
|
12.10s2
|
|
```
|
|
|
|
I still needed to be convinced I wasn't removing any valuable data.
|
|
I inspected a few of them to be sure.
|
|
|
|
```
|
|
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 12.10s0 --op list
|
|
"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/07/19/071984080b32e2867 f 1ac6ec2b7d2b8724bc5d75e2850b5e7 f20040ee52F55d1.2~e7rYg3S
|
|
"hash" :1340195137, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
|
|
|
|
"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/9a/82/9a82a64c3a8439c75d8e584181427b073712afd1454747bec3dcb84bcbe19ac5. 2~urbG4nd
|
|
"hash" :4175566657, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
|
|
|
|
"12.10s0", {"oid" :"7d927F08-bd9b-4d4b-bfc1-d331eb216e68.22197937.1__shadow Windows Security Internals.pdf.2~g9stQ9inkWvsTq33S9z5xNEHEgST2H4.1_1","key":"", "snapid":-
|
|
"shard id":2,"max":0}]
|
|
...
|
|
|
|
```
|
|
|
|
With this information, I now knew:
|
|
|
|
- the log exceptions matched the pgs that were impacted from the change in
|
|
the erasure-coding configuration;
|
|
- `ceph-objectstore.rgw.buckets.data` had a relationship with those pgs where
|
|
the configuration was changed;
|
|
- the objects were familiar with the objects in the buckets, e.g. books;
|
|
|
|
Since I _did_ modify the erasure-coding profile this is all starting to make sense.
|
|
|
|
Carefully, the next operation was to remove the offending placement groups.
|
|
Simply removing the pool wouldn't work, as the OSD daemon not starting meant it
|
|
would know nothing about this change, and still not have enough chunks to come alive.
|
|
|
|
```
|
|
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.17s2
|
|
marking collection for removal
|
|
setting '_remove' omap key
|
|
finish_remove_pgs 12.17s2_head removing 12.17s2
|
|
Remove successful
|
|
|
|
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.0s0
|
|
marking collection for removal
|
|
setting '_remove' omap key
|
|
finish_remove_pgs 12.0s0_head removing 12.0s0
|
|
Remove successful
|
|
```
|
|
|
|
I did this for every PG listed above. Once I scaled down the maintenance
|
|
deployment, I then scaled back `deployment/rook-ceph-osd-0` to start the daemon
|
|
with (hopefully) agreeable placement groups and thankfully, it had come alive.
|
|
|
|
```
|
|
k get pods -n rook-ceph
|
|
|
|
rook-ceph-osd-0-6f57466c78-bj96p 2/2 Running
|
|
```
|
|
|
|
An eager run of `ceph -s` produced both relief and disappointment. The OSD was up, but the pgs were in an `unknown` state.
|
|
|
|
```
|
|
ceph -s
|
|
...
|
|
data:
|
|
pools: 12 pools, 169 pgs
|
|
objects: 0 objects, 0 B
|
|
usage: 0 B used, 0 B / 0 B avail
|
|
pgs: 100.000% pgs unknown
|
|
169 unknown
|
|
```
|
|
|
|
At this point, I had mentioned to my friend (helping by playing GuildWars 2)
|
|
that we might be saved. It seemed promising as we at least had `osd-0` running
|
|
again now that the troublesome pgs were removed.
|
|
|
|
He agreed, and contemplated changing his character's hair colour instead of
|
|
saving our data.
|
|
|
|
## mons
|
|
|
|
### restoring
|
|
I had `/var/lib/rook` backups from each node with the old mon data. At this
|
|
point, with the correct number of placement groups and seeing 100% of them
|
|
remaining in an `unknown` state, it seemed the next step was to restore the
|
|
mons.
|
|
|
|
I knew from reading the `rook-ceph` docs that if you want to restore the data
|
|
of a monitor to a new cluster, you have to inject the `monmap` into the old
|
|
mons `store.db`.
|
|
|
|
Before doing this, I scaled `deployment/rook-ceph-mon-a` down to `0` first.
|
|
|
|
Then, navigating to a directory on my local machine with the backups I ran a
|
|
container to modify the `monmap` on my local fs.
|
|
|
|
`docker run -it --rm --entrypoint bash -v .:/var/lib/rook rook/ceph:v1.14.9`
|
|
|
|
```
|
|
touch /etc/ceph/ceph.conf
|
|
cd /var/lib/rook
|
|
ceph-mon --extract-monmap monmap --mon-data ./mon-q/data
|
|
monmaptool --rm q
|
|
monmaptool --rm s
|
|
monmaptool --rm t
|
|
```
|
|
|
|
Now the old mons `q`, `s` and `t` were removed from the map, I had to add the
|
|
new cluster mon `rook-ceph-mon-a` created following the new
|
|
ceph-cluster.
|
|
|
|
```
|
|
monmaptool --addv a '[v2:10.50.1.10:3300,v1:10.50.1.10:6789]`
|
|
ceph-mon --inject-monmap monmap --mon-data ./mon-q/data
|
|
exit
|
|
```
|
|
|
|
Shoving it back up to the node `rook-ceph-mon-a` lives on:
|
|
|
|
`scp -r ./mon-q/data/* 3bb@10.50.1.10:/var/lib/rook/mon-a/data/`
|
|
|
|
Rescheduling the deployment and although the mon log output isn't giving me
|
|
suggestions of suicide, all our pgs still remain in an `unknown` state.
|
|
|
|
## recovering the mon store
|
|
It turns out that you can actually recover the mon store. It's not a huge deal
|
|
so long as your OSDs have data integrity.
|
|
|
|
Scaling the useless `mon-a` down, I copied the existing `mon-a` data onto the
|
|
`rook-ceph-osd-0` daemon container.
|
|
|
|
Another `osd-0` debug container... `k rook-ceph debug start rook-ceph-osd-0`
|
|
|
|
I rebuilt the mon data, using the existing RocksDB kv store.
|
|
|
|
This would have worked without the backup, but I was interested to see the
|
|
`osdmaps` trimmed due to the other 2 removed OSDs.
|
|
|
|
```
|
|
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-0/ --op update-mon-db --mon-store-path /tmp/mon-a/data/
|
|
osd.0 : 3099 osdmaps trimmed, 635 osdmaps added.
|
|
```
|
|
|
|
```
|
|
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-authtool /tmp/mon-a/keyring -n mon. --cap mon 'allow *' --gen-key
|
|
|
|
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-monstore-tool /tmp/mon-a/data rebuild -- --keyring /tmp/mon-a/keyring
|
|
4 rocksdb: [db/flush_job.cc:967] [default] [JOB 3] Level-0 flush table #3433: 62997231 bytes OK
|
|
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1722831731454649, "job": 3, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [2, 0, 0, 0, 0, 0, 2], "immutable_memtables": 1}
|
|
4 rocksdb: [file/delete_scheduler.cc:74] Deleted file /tmp/mon-a/data/store.db/003433.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
|
|
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1723067397472153, "job": 4, "event": "table_file_deletion", "file_number": 3433}
|
|
4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
|
|
```
|
|
|
|
After copying the now _rebuilt_ `mon-a` store back, and bringing everything up
|
|
again, the cluster was finally resurrecting.
|
|
|
|
It took some time for the rebalancing and replication to finish, but hours
|
|
later, `ceph -s` reported a healthy cluster and services resumed being entirely
|
|
unaware of the chaos that had ensued over the previous few days:
|
|
|
|
```
|
|
cluster:
|
|
id: 47f25963-57c0-4b3b-9b35-bbf68c09eec6
|
|
health: HEALTH_OK
|
|
|
|
services:
|
|
mon: 3 daemons, quorum a,b,c (age 3h)
|
|
mgr: b(active, since 8h), standbys: a
|
|
mds: 1/1 daemons up, 1 hot standby
|
|
osd: 3 osds: 3 up (since 3h), 3 in (since 3h)
|
|
rgw: 2 daemons active (2 hosts, 1 zones)
|
|
|
|
data:
|
|
volumes: 1/1 healthy
|
|
pools: 12 pools, 169 pgs
|
|
objects: 4.88k objects, 16 GiB
|
|
usage: 47 GiB used, 1.1 TiB / 1.2 TiB avail
|
|
pgs: 169 active+clean
|
|
|
|
io:
|
|
client: 639 B/s rd, 9.0 KiB/s wr, 1 op/s rd, 1 op/s wr
|
|
recovery: 834 KiB/s, 6 objects/s
|
|
```
|
|
|
|
It seemed like a miracle, but it is entirely credited to how resilient ceph is
|
|
built to tolerate that level of abuse.
|
|
|
|
# why
|
|
_Data appears to be lost_
|
|
|
|
- ceph OSD daemons fail to start;
|
|
- the OSDs could not reconstruct the data from chunks;
|
|
- the `osdmap` referenced a faulty erasure-coding profile;
|
|
- the monstore `osdmap` still had reference to the above erasure-coding profile;
|
|
- the erasure-coding profile was changed to a topology impossible to satisfy under the current architecture;
|
|
- 2 disks were zapped, hitting the ceiling of the failure domain for `ceph-objectstore.rgw.buckets.data_ecprofile`;
|
|
|
|
|
|
The monitor `osdmap` still contained the bad EC profile.
|
|
|
|
`ceph-monstore-tool /tmp/mon-bak get osdmap > osdmap.bad`
|
|
|
|
`osdmaptool --dump json osdmap.bad | grep -i profile`
|
|
|
|
```
|
|
"erasure_code_profiles":{
|
|
"ceph-objectstore.rgw.buckets.data_ecprofile":{
|
|
"crush-device-class":"",
|
|
"crush-failure-domain":"host",
|
|
"crush-root":"default",
|
|
"jerasure-per-chunk-alignment":"false",
|
|
"k":"3",
|
|
"m":"2"
|
|
}
|
|
}
|
|
```
|
|
|
|
After rebuilding the monstore...
|
|
|
|
`ceph-monstore-tool /tmp/mon-a get osdmap > osdmap.good`
|
|
|
|
`osdmaptool --dump json osdmap.good | grep -i profile`
|
|
|
|
```
|
|
"erasure_code_profiles":{
|
|
"ceph-objectstore.rgw.buckets.data_ecprofile":{
|
|
"crush-device-class":"",
|
|
"crush-failure-domain":"host",
|
|
"crush-root":"default",
|
|
"jerasure-per-chunk-alignment":"false",
|
|
"k":"3",
|
|
"m":"1"
|
|
}
|
|
}
|
|
```
|
|
|
|
Therefore, it seems as if I could have attempted to rebuild the monstore first,
|
|
possibly circumventing the _EC Assert_ errors. The placement groups on `osd-0` were
|
|
still mapping to 3 OSDs, not 5.
|
|
|
|
```
|
|
[root@ad9e4c6e7343 rook]# osdmaptool --test-map-pgs-dump --pool 12 osdmap
|
|
osdmaptool: osdmap file 'osdmap'
|
|
pool 12 pg_num 32
|
|
12.0 [2147483647,2147483647,2147483647] -1
|
|
12.1 [2147483647,2147483647,2147483647] -1
|
|
12.2 [2147483647,2147483647,2147483647] -1
|
|
12.3 [2147483647,2147483647,2147483647] -1
|
|
12.4 [2147483647,2147483647,2147483647] -1
|
|
12.5 [2147483647,2147483647,2147483647] -1
|
|
12.6 [2147483647,2147483647,2147483647] -1
|
|
12.7 [2147483647,2147483647,2147483647] -1
|
|
12.8 [2147483647,2147483647,2147483647] -1
|
|
12.9 [2147483647,2147483647,2147483647] -1
|
|
12.a [2147483647,2147483647,2147483647] -1
|
|
12.b [2147483647,2147483647,2147483647] -1
|
|
12.c [2147483647,2147483647,2147483647] -1
|
|
12.d [2147483647,2147483647,2147483647] -1
|
|
12.e [2147483647,2147483647,2147483647] -1
|
|
12.f [2147483647,2147483647,2147483647] -1
|
|
12.10 [2147483647,2147483647,2147483647] -1
|
|
12.11 [2147483647,2147483647,2147483647] -1
|
|
12.12 [2147483647,2147483647,2147483647] -1
|
|
12.13 [2147483647,2147483647,2147483647] -1
|
|
12.14 [2147483647,2147483647,2147483647] -1
|
|
12.15 [2147483647,2147483647,2147483647] -1
|
|
12.16 [2147483647,2147483647,2147483647] -1
|
|
12.17 [2147483647,2147483647,2147483647] -1
|
|
12.18 [2147483647,2147483647,2147483647] -1
|
|
12.19 [2147483647,2147483647,2147483647] -1
|
|
12.1a [2147483647,2147483647,2147483647] -1
|
|
12.1b [2147483647,2147483647,2147483647] -1
|
|
12.1c [2147483647,2147483647,2147483647] -1
|
|
12.1d [2147483647,2147483647,2147483647] -1
|
|
12.1e [2147483647,2147483647,2147483647] -1
|
|
12.1f [2147483647,2147483647,2147483647] -1
|
|
#osd count first primary c wt wt
|
|
osd.0 0 0 0 0.488297 1
|
|
osd.1 0 0 0 0.488297 1
|
|
osd.2 0 0 0 0.195297 1
|
|
in 3
|
|
avg 0 stddev 0 (-nanx) (expected 0 -nanx))
|
|
size 3 32
|
|
```
|
|
|
|
Since the cluster did not have enough OSDs (wanted 5 with `k=3,m=2`), the rule
|
|
can be tested against the old crush map, with `--num-rep` representing the
|
|
required OSDs, i.e. `k+m`:
|
|
|
|
With the original erasure-coding profile (`k+m=3`), everything looks good -- no
|
|
bad mappings.
|
|
|
|
```
|
|
[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 3 --show-bad-mappings
|
|
|
|
// healthy
|
|
```
|
|
|
|
With `k+m=5`, though -- or anything greater than `3` OSDs...
|
|
|
|
```
|
|
[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 5 --show-bad-mappings
|
|
...
|
|
bad mapping rule 20 x 1002 num_rep 5 result [0,2147483647,1,2,2147483647]
|
|
bad mapping rule 20 x 1003 num_rep 5 result [0,2147483647,2,1,2147483647]
|
|
bad mapping rule 20 x 1004 num_rep 5 result [1,0,2147483647,2,2147483647]
|
|
bad mapping rule 20 x 1005 num_rep 5 result [0,1,2147483647,2,2147483647]
|
|
bad mapping rule 20 x 1006 num_rep 5 result [0,1,2147483647,2,2147483647]
|
|
bad mapping rule 20 x 1007 num_rep 5 result [0,1,2147483647,2147483647,2]
|
|
bad mapping rule 20 x 1008 num_rep 5 result [1,2,0,2147483647,2147483647]
|
|
bad mapping rule 20 x 1009 num_rep 5 result [2,1,0,2147483647,2147483647]
|
|
bad mapping rule 20 x 1010 num_rep 5 result [0,1,2,2147483647,2147483647]
|
|
bad mapping rule 20 x 1011 num_rep 5 result [0,2147483647,2,1,2147483647]
|
|
bad mapping rule 20 x 1012 num_rep 5 result [0,1,2147483647,2,2147483647]
|
|
bad mapping rule 20 x 1013 num_rep 5 result [0,2,2147483647,1,2147483647]
|
|
bad mapping rule 20 x 1014 num_rep 5 result [1,0,2147483647,2,2147483647]
|
|
bad mapping rule 20 x 1015 num_rep 5 result [2,0,2147483647,1,2147483647]
|
|
bad mapping rule 20 x 1016 num_rep 5 result [1,0,2,2147483647,2147483647]
|
|
bad mapping rule 20 x 1017 num_rep 5 result [2,0,1,2147483647,2147483647]
|
|
bad mapping rule 20 x 1018 num_rep 5 result [1,0,2147483647,2147483647,2]
|
|
bad mapping rule 20 x 1019 num_rep 5 result [0,1,2,2147483647,2147483647]
|
|
bad mapping rule 20 x 1020 num_rep 5 result [0,2147483647,1,2147483647,2]
|
|
bad mapping rule 20 x 1021 num_rep 5 result [2,1,0,2147483647,2147483647]
|
|
bad mapping rule 20 x 1022 num_rep 5 result [1,0,2,2147483647,2147483647]
|
|
bad mapping rule 20 x 1023 num_rep 5 result [0,2,1,2147483647,2147483647]
|
|
```
|
|
|
|
Mappings were found on _3_ OSDs, but missing the 4th and 5th reference as
|
|
indicated by the largest 32-bit int (i.e. missing). The object storage data
|
|
would still have been lost, but it could have made the recovery of the cluster
|
|
significantly less painful.
|