-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Compactor.MaxLookback Option for Limiting Blocks Loaded in Compaction Cycles #10585
Conversation
💻 Deploy preview deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about this new config setting, ops-wise. The primary concern we've talked about with this new setting is around block upload and/or backfills: If we set max-lookback
(a server-level option) to X days, then any tenant-level block upload with a block ID over X days old will go unnoticed. (My cursory scan of the block upload code says that it will honor whatever block ID is given on import, rather than creating new $now
ULIDs.)
Two specific bad things I'm thinking about are:
- footguns around forgetting to adjust
max-lookback
in preparation of uploads - operating a large multitenant Mimir cluster and one tenant needs to upload data from 1 year ago. This will cause us to set
max-lookback
to 1y for all tenants and probably blow up compactor resources.
What do you think of a slightly different impl: using IterWithAttributes
to learn each block's mtime
and instead using that as a filtering device. Then (at least, from my 30' view) a tenant can perform a block upload and the server doesn't have to be specially reconfigured to match the data.
RE my 2nd point:
If the attributes thing doesn't work out, I think we should consider doing something like one of these:
|
Agreed. It doesn't make sense to have this as a server-level option. I did some research and seems it could be configured per-tenant if the option was defined in limits.go with an override (ref: docs). As you mentioned in the second comment, probably need some additional work on limits. |
I believe both |
require.Contains(t, actualMetas, olderULID) | ||
}) | ||
|
||
t.Run("should return no block metas when fetcher lookback is set short", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ulid uses a funky and specific timestamp format, how about a test case that verifies that we're playing nice with that? Something like three blocks at -3G, -2G, -1G where G is near the timestamp granularity and verify that maxlookback of -1.5G does the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ulid is precise down to 1ms (see: docs). It's possible that we will fail to filter out blocks that are less than 1ms older than the threshold (e.g. 168h0m0.0009s
), but we would never filter out blocks that are just barely within the threshold because of ulid's precision. Small illustrative example below.
func main() {
ns := time.Nanosecond
tdel, olddel, newdel := -1000000 * ns, -1500000 * ns, -500000 * ns
now := time.Now()
minAllowed, _ := ulid.New(ulid.Timestamp(now.Add(time.Duration(tdel))), nil) // threshold: 1ms
justOlder, _ := ulid.New(ulid.Timestamp(now.Add(time.Duration(olddel))), nil) // 1.5ms old
justNewer, _ := ulid.New(ulid.Timestamp(now.Add(time.Duration(newdel))), nil) // 0.5ms old
// can be (-1, 0) OR (0, 1); fetcher skips if `id.Compare(threshold)` == -1
fmt.Println(justOlder.Compare(minAllowed), justNewer.Compare(minAllowed))
}
I can't see an easy way to test this in fetcher_test.go
without getting very flaky results. I just added a test that checks that we will skip a block that's 1s beyond the threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…tion Cycles (#10585) As described in #7664, a Mimir block's meta.json file can become large if its sources list grows due to intermediate blocks from split groups, compactor merge shards, or re-compactions due to OOO blocks. When a bucket contains many blocks with many sources each, this may cause unexpectedly high resource usage on the compactor. This PR adds an experimental option, -compactor.max-lookback which can reduce the CPU time and memory spent reading blocks' metadata files by instructing the compactor to read metadata for only the blocks that have been uploaded within the max-lookback period. If set, this value should be set well above the maximum of -compactor.block-ranges (e.g. 168h) to not interfere w. the regular compaction process. This PR also adds a metric, cortex_compactor_meta_blocks_synced which describes the number of blocks discovered in object storage and labels them with their state, i.e. were they loaded, marked for deletion, skipped because they were outside the lookback period, etc.
What this PR does
As described in #7664, a Mimir block's
meta.json
file can become large if its sources list grows due to intermediate blocks from split groups, compactor merge shards, or re-compactions due to OOO blocks. When a bucket contains many blocks with many sources each, this may cause unexpectedly high resource usage on the compactor.This PR adds an experimental option,
-compactor.max-lookback
which can reduce the CPU time and memory spent reading blocks' metadata files by instructing the compactor to read metadata for only the blocks that have been uploaded within themax-lookback
period. If set, this value should be set well above the maximum of-compactor.block-ranges
(e.g.168h
) to not interfere w. the regular compaction process.This PR also adds a metric,
cortex_compactor_meta_blocks_synced
which describes the number of blocks discovered in object storage and labels them with theirstate
, i.e. were they loaded, marked for deletion, skipped because they were outside the lookback period, etc.Which issue(s) this PR fixes or relates to
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.