28 KiB
title | author | date |
---|---|---|
recovering a rook-ceph cluster | 0x3bb | M08-10-2024 |
I share a rook-ceph cluster between a friend of mine approximately 2000km away. I had reservations whether this would be a good idea at first because of the latency and the fact consumer grade WAN might bring anything that breathes to its knees. However, I'm pleased with how robust it has been under those circumstances.
Getting started with rook-ceph is really simple because it orchestrates everything for you. That's convenient, although, if and when you have problems, you can suddenly become an operator of a distributed system that you may know very little about due to the abstraction.
During this incident, I found myself in exactly that position, relying heavily on the (great) documentation for both rook and ceph itself.
the beginning
In the process of moving nodes, I accidentally zapped 2 OSDs.
Given that the "ceph cluster" is basically two bedrooms peered together, we were down to the final OSD.
This sounds bad, but it was fine: mons were still up, just a matter of removing the old OSDs from the tree and letting replication work its magic.
the real mistake
I noticed although the block pool was replicated, we had lost all our RADOS object storage.
The erasure-coding profile was k=2, m=1
. That meant we could only lose 2
OSDs, which had already happened.
Object storage (which our applications interfaced with via Ceph S3 Gateway) was lost. Nothing critical -- we just needed our CSI volume mounts back online for databases -- packages and other artifacts could easily be restored from backups.
Moving on from this, the first thing I did was "fix" the EC configuration to
k=3, m=2
. This would spread the data over 5 OSDs.
- dataChunks: 2
- codingChunks: 1
+ dataChunks: 3
+ codingChunks: 2
Happy with that, I restored a backup of the objects. Everything was working.
A few days later, renovatebot arrived with a new PR to bump rook-ceph 1.14.9
.
Of course I want that -- the number is bigger than the previous number so everything is going to be better than before.
downtime
Following the merge, all services on the cluster went down at once. I checked
the OSDs which were in CrashLoopBackOff
. Inspecting the logs, I saw a bunch
of gibberish and decided to check out the GitHub issues. Since nothing is ever
my fault, I decided to see who or what was to blame.
With no clues, I had still hoped this would be relatively simple to fix.
Resigning to actually reading the OSD logs in the rook-ceph-crashcollector
pods, I saw (but did not understand) the problem:
osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)
With the "fixed" configuration, what I had actually done is split the object store pool across 5 OSDs. We had 3.
Due to the rook-ceph-operator
being rescheduled from the version bump, the
OSD daemons had been reloaded as part of the update procedure and now demanded
an answer for the data and coding chunks that simply did not exist. Sure enough,
ceph -s
also reported undersized placement groups.
This makes sense as there weren't enough OSDs to split the data.
from bad to suicide
reverting erasure-coding profile
The first attempt I made was to revert the EC profile back to k=2, m=1
. The
OSDs were still in the same state complaining about the erasure-coding
profile.
causing even more damage
The second attempt (and in hindsight, a very poor choice) was to zap the other two underlying OSD disks:
[ osd-1
, osd-2
].
zap.sh
DISK="/dev/vdb"
# Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean)
sgdisk --zap-all $DISK
# Wipe a large portion of the beginning of the disk to remove more LVM metadata that may be present
dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
# SSDs may be better cleaned with blkdiscard instead of dd
blkdiscard $DISK
# Inform the OS of partition table changes
partprobe $DISK
Perhaps having two other OSDs online would allow me to replicate the healthy pgs without the offending RADOS objects.
Sure enough, the 2 new OSDs started.
Since the osd-0
with the actual data still wouldn't start, the cluster was
still in a broken state.
Now down to the last OSD, at this point I knew that I was going to make many,
many more mistakes. If I was going to continue I needed to backup the logical
volume used by the osd-0
node before continuing, which I did.
clutching at mons
I switched my focus to a new narrative: something was wrong with the mons.
They were in quorum but I still couldn't figure out why the now last-surviving OSD was having issues starting.
The mon_host
configuration was correct in secret/rook-ceph-config
:
mon_host: '[v2:10.50.1.10:3300,v1:10.50.1.10:6789],[v2:10.50.1.11:3300,v1:10.50.1.11:6789],[v2:10.55.1.10:3300,v1:10.55.1.10:6789]'
Nothing had changed with the underlying data on those mons. Maybe there was
corruption in the monitor store? The monitors maintain a map of the cluster
state: the osdmap
, the crushmap
, etc.
My theory was: if the cluster map did not have the correct placement groups and other OSD metadata then perhaps replacing it would help.
I replaced the data from another mon to the one used for the failing OSD
deployment (store.db
) and scaled up the deployment:
2024-08-04T00:49:47.698+0000 7f12fc78f700 0 mon.s@2(probing) e30 removed from monmap, suicide.
With all data potentially lost and it being almost 1AM, that message was not very reassuring. I did manually change the monmap and inject it back in, but ended up back in the same position.
I figured I had done enough experimenting at this point and had to look deeper outside of the deployment. The only meaningful change we had made was the erasure-coding profile.
initial analysis
First, I looked back to the OSD logs. They are monstrous, so I focused on the erasure-coding errors:
2024-08-03T18:58:49.845+0000 7f2a7d6da700 1 osd.0 pg_epoch: 8916
pg[12.11s2( v 8915'287 (0'0,8915'287] local-lis/les=8893/8894 n=23 ec=5063/5059
lis/c=8893/8413 les/c/f=8894/8414/0 sis=8916 pruub=9.256847382s)
[1,2,NONE]p1(0) r=-1 lpr=8916 pi=[8413,8916)/1 crt=8915'287 mlcod 0'0 unknown
NOTIFY pruub 20723.630859375s@ mbc={}] state<Start>: transitioning to Stray
2024-08-03T18:58:49.849+0000 7f2a7ced9700 -1G
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
7f2a7e6dc700 time 2024-08-03T18:58:49.853351+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osd/ECUtil.h:
34: FAILED ceph_assert(stripe_width % stripe_size == 0)
-2> 2024-08-03T18:59:00.086+0000 7ffa48b48640 5 osd.2 pg_epoch: 8894 pg[12.9(unlocked)] enter Initial
...
/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)
I noticed a pattern: all the failing pg IDs were prefixed with 12
.
Seeing this, I had concluded:
- mons were ok and in quorum;
- the
osd-0
daemon fails to start; - other fresh OSDs (
osd-1
andosd-2
) start fine, this was a data integrity issue confined toosd-0
(and the previous OSDs had I not nuked them); - the cause was a change in the erasure-coding profile, which happened on only one pool where the chunk distribution was modified;
Accepting the loss of the miniscule data on the object storage pool in favor of saving the block storage, I could correct the misconfiguration.
preparation
To avoid troubleshoting issues caused from my failed attempts, I decided I would do a clear out of the existing CRDs and just focus first on getting the OSD with the data back online. If I ever got the data back, then I'd probably be conscious of prior misconfiguration and have to do so regardless.
- backup the important shit;
- clear out the
rook-ceph
namespace;
backups
- the logical volume for
osd-0
, so I can re-attach it and afford mistakes; /var/lib/rook
on all nodes, containing mon data;
removal
deployments/daemonsets
These were the first to go, as I didn't want the rook-operator
persistently
creating Kubernetes objects when I was actively trying to kill them.
crds
Removal of the all rook-ceph
resources, and their finalizers to protect them from being removed:
cephblockpoolradosnamespaces
cephblockpools
cephbucketnotifications
cephclients
cephclusters
cephcosidrivers
cephfilesystemmirrors
cephfilesystems
cephfilesystemsubvolumegroups
cephnfses
cephobjectrealms
cephobjectstores
cephobjectstoreusers
cephobjectzonegroups
cephobjectzones
cephrbdmirrors
/var/lib/rook
I had these backed up for later, but I didn't want them there when the cluster came online.
osd disks
I did not wipe any devices.
First, I obviously didn't want to wipe the disk with the data on it. As for the
other, now useless OSDs that I had mistakenly created over the old ones; I knew
spawning the rook-operator
would create new OSDs if they didn't belong to an
old ceph cluster.
This would make troubleshooting osd-0
more difficult, as I'd now have to consider
analysing the status reported from osd-1
and osd-2
.
provisioning
Since at this point I only cared about osd-0
and it was beneficial to have
fewer moving parts to work with, I changed the rook-ceph-cluster
mon count to 1
within the helm values.yaml
.
Following this, I simply reconciled the chart.
I noticed the rook-ceph-operator
, rook-ceph-mon-a
, rook-ceph-mgr-a
came online as expected.
Because the OSDs were part of an old cluster, I now had a ceph-cluster with no
OSDs, as shown in the rook-ceph-osd-prepare-*
jobs for each node.
osd.0: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
osd.1: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
osd.2: "cd427c63-b43f-40cb-99a4-7f58af25d624" belonging to a different ceph cluster "47f25963-57c0-4b3b-9b35-bbf68c09eec6"
surgery
With less noise and a clean slate, it was time to attempt to fix this mess.
- adopt
osd-0
to the new cluster; - remove the corrupted pgs from
osd-0
; - bring up two new OSDs for replication;
osd-0
I started trying to determine how I would safely remove the offending objects. If that happened, then the OSD would have no issues with the erasure-coding profile since the pgs wouldn't exist, and the OSD daemon should start.
-
If the placement groups contained only objects created from the RADOS Object Gateway, then I can simply remove the pgs.
-
If, however, the pgs contain both the former and block device objects then it would require careful removal of all non-rdb (block storage) objects as there would be valuable data loss by purging the entire placement groups.
Since OSD pools have a 1:N
relationship with pgs, the second scenario seemed
unlikely, perhaps impossible.
Next, I needed to inspect the OSD somehow, because the existing deployment would continously crash.
kubectl rook-ceph debug start rook-ceph-osd-0
Running this command allowed me to observe the OSD without it actually joining the cluster. The "real" OSD deployment need only be scheduled, but crashing continously was ok.
Once you execute that command, it will scale the OSD daemon down and create a new deployment that mirrors the configuration but without the daemon running in order to perform maintenance.
Now in a shell of the debug OSD container, I confirmed these belonged to the object storage pool.
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph pg ls-by-pool ceph-objectstore.rgw.buckets.data
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERS
12.0 0 0 0 0 0 0 0 0 unknown 8h
12.1 0 0 0 0 0 0 0 0 unknown 8h
12.2 0 0 0 0 0 0 0 0 unknown 8h
12.3 0 0 0 0 0 0 0 0 unknown 8h
12.4 0 0 0 0 0 0 0 0 unknown 8h
12.5 0 0 0 0 0 0 0 0 unknown 8h
12.6 0 0 0 0 0 0 0 0 unknown 8h
12.7 0 0 0 0 0 0 0 0 unknown 8h
12.8 0 0 0 0 0 0 0 0 unknown 8h
12.9 0 0 0 0 0 0 0 0 unknown 8h
12.a 0 0 0 0 0 0 0 0 unknown 8h
12.b 0 0 0 0 0 0 0 0 unknown 8h
12.c 0 0 0 0 0 0 0 0 unknown 8h
12.d 0 0 0 0 0 0 0 0 unknown 8h
12.e 0 0 0 0 0 0 0 0 unknown 8h
12.f 0 0 0 0 0 0 0 0 unknown 8h
12.10 0 0 0 0 0 0 0 0 unknown 8h
12.11 0 0 0 0 0 0 0 0 unknown 8h
12.12 0 0 0 0 0 0 0 0 unknown 8h
12.13 0 0 0 0 0 0 0 0 unknown 8h
12.14 0 0 0 0 0 0 0 0 unknown 8h
12.15 0 0 0 0 0 0 0 0 unknown 8h
12.16 0 0 0 0 0 0 0 0 unknown 8h
12.17 0 0 0 0 0 0 0 0 unknown 8h
12.18 0 0 0 0 0 0 0 0 unknown 8h
12.19 0 0 0 0 0 0 0 0 unknown 8h
12.1a 0 0 0 0 0 0 0 0 unknown 8h
12.1b 0 0 0 0 0 0 0 0 unknown 8h
Seeing this, I first checked to see how many placement groups prefixed with 12
existed using the actual path to the OSD.
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-pgs | grep ^12
12.bs1
12.6s1
12.1fs0
12.1ds1
12.15s0
12.16s0
12.11s0
12.12s2
12.0s0
12.17s2
12.4s1
12.9s0
12.19s0
12.cs2
12.13s0
12.14s2
12.3s2
12.1as0
12.1bs2
12.as1
12.1es1
12.1cs2
12.2s2
12.8s1
12.7s2
12.ds0
12.es0
12.fs0
12.18s0
12.1s0
12.5s1
12.10s2
I still needed to be convinced I wasn't removing any valuable data. I inspected a few of them to be sure.
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 12.10s0 --op list
"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/07/19/071984080b32e2867 f 1ac6ec2b7d2b8724bc5d75e2850b5e7 f20040ee52F55d1.2~e7rYg3S
"hash" :1340195137, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
"12.10s0", {"oid" :"7d92708-bd9b-4d4b-bfc1-d331eb216e68.21763481.3__shadow_packages/9a/82/9a82a64c3a8439c75d8e584181427b073712afd1454747bec3dcb84bcbe19ac5. 2~urbG4nd
"hash" :4175566657, "max":0, "pool":12, "namespace":"","shard_id":2, "max":0}
"12.10s0", {"oid" :"7d927F08-bd9b-4d4b-bfc1-d331eb216e68.22197937.1__shadow Windows Security Internals.pdf.2~g9stQ9inkWvsTq33S9z5xNEHEgST2H4.1_1","key":"", "snapid":-
"shard id":2,"max":0}]
...
With this information, I now knew:
- the log exceptions matched the pgs that were impacted from the change in the erasure-coding configuration;
ceph-objectstore.rgw.buckets.data
had a relationship with those pgs where the configuration was changed;- the objects were familiar with the objects in the buckets, e.g. books;
Since I did modify the erasure-coding profile this is all starting to make sense.
Carefully, the next operation was to remove the offending placement groups. Simply removing the pool wouldn't work, as the OSD daemon not starting meant it would know nothing about this change, and still not have enough chunks to come alive.
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.17s2
marking collection for removal
setting '_remove' omap key
finish_remove_pgs 12.17s2_head removing 12.17s2
Remove successful
[root@rook-ceph-osd-0-maintenance-686bbf69cc-5bcmj ceph]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op remove --type bluestore --force --pgid 12.0s0
marking collection for removal
setting '_remove' omap key
finish_remove_pgs 12.0s0_head removing 12.0s0
Remove successful
I did this for every PG listed above. Once I scaled down the maintenance
deployment, I then scaled back deployment/rook-ceph-osd-0
to start the daemon
with (hopefully) agreeable placement groups and thankfully, it had come alive.
k get pods -n rook-ceph
rook-ceph-osd-0-6f57466c78-bj96p 2/2 Running
An eager run of ceph -s
produced both relief and disappointment. The OSD was up, but the pgs were in an unknown
state.
ceph -s
...
data:
pools: 12 pools, 169 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
169 unknown
At this point, I had mentioned to my friend (helping by playing GuildWars 2)
that we might be saved. It seemed promising as we at least had osd-0
running
again now that the troublesome pgs were removed.
He agreed, and contemplated changing his character's hair colour instead of saving our data.
mons
restoring
I had /var/lib/rook
backups from each node with the old mon data. At this
point, with the correct number of placement groups and seeing 100% of them
remaining in an unknown
state, it seemed the next step was to restore the
mons.
I knew from reading the rook-ceph
docs that if you want to restore the data
of a monitor to a new cluster, you have to inject the monmap
into the old
mons store.db
.
Before doing this, I scaled deployment/rook-ceph-mon-a
down to 0
first.
Then, navigating to a directory on my local machine with the backups I ran a
container to modify the monmap
on my local fs.
docker run -it --rm --entrypoint bash -v .:/var/lib/rook rook/ceph:v1.14.9
touch /etc/ceph/ceph.conf
cd /var/lib/rook
ceph-mon --extract-monmap monmap --mon-data ./mon-q/data
monmaptool --rm q
monmaptool --rm s
monmaptool --rm t
Now the old mons q
, s
and t
were removed from the map, I had to add the
new cluster mon rook-ceph-mon-a
created following the new
ceph-cluster.
monmaptool --addv a '[v2:10.50.1.10:3300,v1:10.50.1.10:6789]`
ceph-mon --inject-monmap monmap --mon-data ./mon-q/data
exit
Shoving it back up to the node rook-ceph-mon-a
lives on:
scp -r ./mon-q/data/* 3bb@10.50.1.10:/var/lib/rook/mon-a/data/
Rescheduling the deployment and although the mon log output isn't giving me
suggestions of suicide, all our pgs still remain in an unknown
state.
recovering the mon store
It turns out that you can actually recover the mon store. It's not a huge deal so long as your OSDs have data integrity.
Scaling the useless mon-a
down, I copied the existing mon-a
data onto the
rook-ceph-osd-0
daemon container.
Another osd-0
debug container... k rook-ceph debug start rook-ceph-osd-0
I rebuilt the mon data, using the existing RocksDB kv store.
This would have worked without the backup, but I was interested to see the
osdmaps
trimmed due to the other 2 removed OSDs.
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-0/ --op update-mon-db --mon-store-path /tmp/mon-a/data/
osd.0 : 3099 osdmaps trimmed, 635 osdmaps added.
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-authtool /tmp/mon-a/keyring -n mon. --cap mon 'allow *' --gen-key
[root@he-prod-k3s-controlplane-ch-a-1 ceph]# ceph-monstore-tool /tmp/mon-a/data rebuild -- --keyring /tmp/mon-a/keyring
4 rocksdb: [db/flush_job.cc:967] [default] [JOB 3] Level-0 flush table #3433: 62997231 bytes OK
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1722831731454649, "job": 3, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [2, 0, 0, 0, 0, 0, 2], "immutable_memtables": 1}
4 rocksdb: [file/delete_scheduler.cc:74] Deleted file /tmp/mon-a/data/store.db/003433.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1723067397472153, "job": 4, "event": "table_file_deletion", "file_number": 3433}
4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
After copying the now rebuilt mon-a
store back, and bringing everything up
again, the cluster was finally resurrecting.
It took some time for the rebalancing and replication to occur, but hours
later, ceph -s
reported a healthy cluster and services resumed being entirely
unaware of the chaos that had ensued over the previous few days:
cluster:
id: 47f25963-57c0-4b3b-9b35-bbf68c09eec6
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 3h)
mgr: b(active, since 8h), standbys: a
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 3h), 3 in (since 3h)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 4.88k objects, 16 GiB
usage: 47 GiB used, 1.1 TiB / 1.2 TiB avail
pgs: 169 active+clean
io:
client: 639 B/s rd, 9.0 KiB/s wr, 1 op/s rd, 1 op/s wr
recovery: 834 KiB/s, 6 objects/s
It seemed like a miracle, but it is entirely credited to how resilient ceph is built to tolerate that level of abuse.
why
Data appears to be lost
- ceph OSD daemons fail to start;
- the OSDs could not reconstruct the data from chunks;
- the
osdmap
referenced a faulty erasure-coding profile; - the monstore
osdmap
still had reference to the above erasure-coding profile; - the erasure-coding profile was changed to a topology impossible to satisfy under the current architecture;
- 2 disks were zapped, hitting the ceiling of the failure domain for
ceph-objectstore.rgw.buckets.data_ecprofile
;
The monitor osdmap
still contained the bad EC profile.
ceph-monstore-tool /tmp/mon-bak get osdmap > osdmap.bad
osdmaptool --dump json osdmap.bad | grep -i profile
"erasure_code_profiles":{
"ceph-objectstore.rgw.buckets.data_ecprofile":{
"crush-device-class":"",
"crush-failure-domain":"host",
"crush-root":"default",
"jerasure-per-chunk-alignment":"false",
"k":"3",
"m":"2"
}
}
After rebuilding the monstore...
ceph-monstore-tool /tmp/mon-a get osdmap > osdmap.good
"erasure_code_profiles":{
"ceph-objectstore.rgw.buckets.data_ecprofile":{
"crush-device-class":"",
"crush-failure-domain":"host",
"crush-root":"default",
"jerasure-per-chunk-alignment":"false",
"k":"3",
"m":"1"
}
}
Therefore, it seems as if I could have attempted to rebuild the monstore first,
possibly circumventing the ECAssert
errors. The placement groups on osd-0
were
still mapping to 3 OSDs, not 5.
[root@ad9e4c6e7343 rook]# osdmaptool --test-map-pgs-dump --pool 12 osdmap
osdmaptool: osdmap file 'osdmap'
pool 12 pg_num 32
12.0 [2147483647,2147483647,2147483647] -1
12.1 [2147483647,2147483647,2147483647] -1
12.2 [2147483647,2147483647,2147483647] -1
12.3 [2147483647,2147483647,2147483647] -1
12.4 [2147483647,2147483647,2147483647] -1
12.5 [2147483647,2147483647,2147483647] -1
12.6 [2147483647,2147483647,2147483647] -1
12.7 [2147483647,2147483647,2147483647] -1
12.8 [2147483647,2147483647,2147483647] -1
12.9 [2147483647,2147483647,2147483647] -1
12.a [2147483647,2147483647,2147483647] -1
12.b [2147483647,2147483647,2147483647] -1
12.c [2147483647,2147483647,2147483647] -1
12.d [2147483647,2147483647,2147483647] -1
12.e [2147483647,2147483647,2147483647] -1
12.f [2147483647,2147483647,2147483647] -1
12.10 [2147483647,2147483647,2147483647] -1
12.11 [2147483647,2147483647,2147483647] -1
12.12 [2147483647,2147483647,2147483647] -1
12.13 [2147483647,2147483647,2147483647] -1
12.14 [2147483647,2147483647,2147483647] -1
12.15 [2147483647,2147483647,2147483647] -1
12.16 [2147483647,2147483647,2147483647] -1
12.17 [2147483647,2147483647,2147483647] -1
12.18 [2147483647,2147483647,2147483647] -1
12.19 [2147483647,2147483647,2147483647] -1
12.1a [2147483647,2147483647,2147483647] -1
12.1b [2147483647,2147483647,2147483647] -1
12.1c [2147483647,2147483647,2147483647] -1
12.1d [2147483647,2147483647,2147483647] -1
12.1e [2147483647,2147483647,2147483647] -1
12.1f [2147483647,2147483647,2147483647] -1
#osd count first primary c wt wt
osd.0 0 0 0 0.488297 1
osd.1 0 0 0 0.488297 1
osd.2 0 0 0 0.195297 1
in 3
avg 0 stddev 0 (-nanx) (expected 0 -nanx))
size 3 32
Since the cluster did not have enough OSDs (wanted 5 with k=3,m=2
), the rule
can be tested against the old crush map, with --num-rep
representing the
required OSDs, i.e. k+m
:
With the original erasure-coding profile (k+m=3
), everything looks good -- no
bad mappings.
[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 3 --show-bad-mappings
// healthy
With k+m=5
, though -- or anything great than 3
OSDs...
[root@ad9e4c6e7343 rook]# crushtool -i crush --test --num-rep 5 --show-bad-mappings
...
bad mapping rule 20 x 1002 num_rep 5 result [0,2147483647,1,2,2147483647]
bad mapping rule 20 x 1003 num_rep 5 result [0,2147483647,2,1,2147483647]
bad mapping rule 20 x 1004 num_rep 5 result [1,0,2147483647,2,2147483647]
bad mapping rule 20 x 1005 num_rep 5 result [0,1,2147483647,2,2147483647]
bad mapping rule 20 x 1006 num_rep 5 result [0,1,2147483647,2,2147483647]
bad mapping rule 20 x 1007 num_rep 5 result [0,1,2147483647,2147483647,2]
bad mapping rule 20 x 1008 num_rep 5 result [1,2,0,2147483647,2147483647]
bad mapping rule 20 x 1009 num_rep 5 result [2,1,0,2147483647,2147483647]
bad mapping rule 20 x 1010 num_rep 5 result [0,1,2,2147483647,2147483647]
bad mapping rule 20 x 1011 num_rep 5 result [0,2147483647,2,1,2147483647]
bad mapping rule 20 x 1012 num_rep 5 result [0,1,2147483647,2,2147483647]
bad mapping rule 20 x 1013 num_rep 5 result [0,2,2147483647,1,2147483647]
bad mapping rule 20 x 1014 num_rep 5 result [1,0,2147483647,2,2147483647]
bad mapping rule 20 x 1015 num_rep 5 result [2,0,2147483647,1,2147483647]
bad mapping rule 20 x 1016 num_rep 5 result [1,0,2,2147483647,2147483647]
bad mapping rule 20 x 1017 num_rep 5 result [2,0,1,2147483647,2147483647]
bad mapping rule 20 x 1018 num_rep 5 result [1,0,2147483647,2147483647,2]
bad mapping rule 20 x 1019 num_rep 5 result [0,1,2,2147483647,2147483647]
bad mapping rule 20 x 1020 num_rep 5 result [0,2147483647,1,2147483647,2]
bad mapping rule 20 x 1021 num_rep 5 result [2,1,0,2147483647,2147483647]
bad mapping rule 20 x 1022 num_rep 5 result [1,0,2,2147483647,2147483647]
bad mapping rule 20 x 1023 num_rep 5 result [0,2,1,2147483647,2147483647]
Mappings were found on 3 OSDs, but missing the 4th and 5th reference as indicated by the largest 32-bit int (i.e. missing). The object storage data would still have been lost, but it could have made the recovery of the cluster significantly less painful.