# Ceph and Deep-Scrubs

## Introduction

I manage a 192TB Ceph cluster consisting mostly out of spinning rust. It's a weird cluster, lot's of PG's, as the cluster is storing millions and millions of tiny files in radosgw. To ensure all the PG's can be recovered in under a day, the pool's PG's are way higher then traditional recommendations - but that also means deep scrubbing is becoming a bit problematic (lot's of PG's per OSD and spinning rust is slow).

So here's how to correctly configure the deep scrub interval... (it's not documented anywhere)

## How Deep Scrubbing Works

So, Ceph has two scrubbing operations. A normal scrub and a **deep**-scrub.

Normal scrubs are quick since they are only checking the consistency of a given PG's metadata (not the PG's correctness ). From my mailing list travels, Ceph developers have not observed any performance impact of these quick scrubs - so it's easy to ignore them (but don't).

Then there's deep scrubbing, which is designed to catch bit-rot (aka, the correctness of the PG's), it's naturally a data intensive operation. On my SAS drives, I see Ceph deep-scrubbing at about 20MB/s, for a 50GB PG, that's 42 minutes of deep scrubbing - in my case, eating 40% of my IOPS. It's not inconceivable that every OSD in the cluster could be deep-scrubbing something at the same time.

Thankfully, Ceph has several knobs we can tweak to reduce this impact:

`osd_deep_scrub_interval`

Default: `604800`

(7 days)

This one seems like an obvious first thing to look at, and you would be right. But it's a little nuanced.

The value is the number of seconds before a deep-scrub is mandated (bypasses load restrictions). That said, deep-scrubs will **still** happen before this interval - see `osd_deep_scrub_randomize_ratio`

.

`osd_scrub_min_interval`

Default: `86400`

(1 day)

The *normal* interval for normal scrubs and a percentage of deep-scrubs (see `osd_deep_scrub_randomize_ratio`

).

To me, when I see a min/max interval, I think something is going to be randomly executed between those two temporal lines. No, that's not how Ceph uses this option.

Ceph delays normal scrubs (never deep-scrubs) if the load of the node is too high (defaults to 0.5 load, normalized for CPU count), but normally schedules normal scrubs at `random(osd_scrub_min_interval, osd_scrub_min_interval + (osd_scrub_min_interval * osd_scrub_interval_randomize_ratio))`

. Which means that considering the default `osd_scrub_min_interval`

value of 1 day, normal scrubs will be spread between 1 day and 1.5 days.

`osd_scrub_max_interval`

Default: `604800`

(7 days)

The *mandatory* interval for normal scrubs. See `osd_scrub_min_interval`

.

`osd_scrub_interval_randomize_ratio`

Default: `0.5`

(50%)

A ratio used to spread out scrubs, between `osd_scrub_min_interval`

and `osd_scrub_min_interval * osd_scrub_interval_randomize_ratio`

. The option `osd_scrub_max_interval`

has no impact to this window of possible times.

`osd_deep_scrub_randomize_ratio`

Default: `0.15`

(15%)

Completely not related to `osd_scrub_interval_randomize_ratio`

and not documented on the official docs, but it's been around since 2015 (Hammer), I found it in this PR.

Add the option

`osd_deep_scrub_randomize_ratio`

which defines the rate at which scrubs will randomly turn into deep scrubs.

You might ask why you would want normal, quick scrubs that normally happen once a week, to magically become deep-scrubs. I definitely did...

It makes sense if you consider what `osd_deep_scrub_interval`

means, it means after *this* value, *mandate* a deep-scrub. So you'll end up in a position where all your deep-scrubs will run at the same time. You'll also notice that there's no version of the `osd_scrub_interval_randomize_ratio`

for deep-scrubs. So, at the end of the day, Ceph is using the existing plumping of the normal scrub, to prevent the thundering herd problem.

The default value of 15% will turn 15% of normal scrubs into deep-scrubs. Meaning, with the default `osd_scrub_min_interval`

, 15% of the cluster's deep-scrubs will execute each 1.25 days on average. The math gets a little complicated after that (at least for someone who's mostly forgotten statistics), but I *think* the work is generally spread out in my experience.

So that means `osd_deep_scrub_interval`

**and** the `osd_scrub_min_interval`

are important regarding deep-scrubbing.

## How to Set a Deep-Scrub Interval

Note that this is in the

`global`

namespace,notthe`osd`

namespace. This is important because the monitors that emit the PG_NOT_DEEP_SCRUBBED warning based on thisOSDsetting, so it needs to match between the`osd`

and`mon`

namespaces, or just use`global`

.

I've also seen docs/been told I need to restart the monitors and OSD's after these settings - that doesn't seem correct these days with the central Ceph configuration store.