-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-8686 crash_tracker: remove old crash files #24972
CORE-8686 crash_tracker: remove old crash files #24972
Conversation
CI test resultstest results on build#61347
test results on build#61447
test results on build#61516
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm just a couple questions/comments
src/v/crash_tracker/recorder.cc
Outdated
continue; | ||
} | ||
|
||
crash_files.emplace_back(entry.path(), entry.last_write_time()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we trust the timestamp on the file? Or should we read the file and check its timestamp? I'm wondering if you could just reuse record::get_recorded_crash()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored this now to reuse get_recorded_crashes()
. To be able to handle malformed crash files, I am using sorting the crash files by recorded_crash::timestamp()
which uses the timestamp in the file if the file is valid and falls back to the last_write_time
if the file could not be parsed.
ss::future<> recorder::start() { | ||
co_await ensure_crashdir_exists(); | ||
co_await remove_old_crashfiles(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to make a comment that maybe we should clean up on a certain frequency rather than whenever RP reboots... but these files only get created when RP crashes so yeah this is fine :)
4c98bbb
to
97aaece
Compare
force-push: address code review feedback (reuse record::get_recorded_crash().) |
Extract some functions before adding another to `start()`. Pure refactor, no behaviour change.
The broker will retain the 50 most recent crash files and delete all older crashes on startup. This is to ensure that crash report files do not fill up the disk. Note that we remove files before creating them to ensure cleanup succeeds before we create new files. This means that technically we will have up to 51 crash reports on disk (50 kept + 1 created). This is acceptable since 50 is a loose limit.
Implement a simple test to verify that crash reports are getting removed on `start()` when their number exceeds `crash_files_to_keep`.
97aaece
to
e7d5780
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized my request for the value_or
was wrong. Wasn't fully awake/processing when I suggested that.
The broker will retain the 50 most recent crash files and delete all older crashes on startup. This is to ensure that crash report files do not fill up the disk.
Note that we remove files before creating them to ensure cleanup succeeds before we create new files. This means that technically we will have up to 51 crash reports on disk (50 kept + 1 created). This is acceptable since 50 is a loose limit.
A fixture test is added to verify this behaviour.
Fixes https://redpandadata.atlassian.net/browse/CORE-8686
Backports Required
Release Notes