-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-8619 crash_tracker: print recorded crash reports #24883
base: dev
Are you sure you want to change the base?
CORE-8619 crash_tracker: print recorded crash reports #24883
Conversation
Defines it under `${datadir}/crash_reports` to store crash reports. It needs to be special-cased in the `compute_storage.py` script to ensure it isn't mistaken for a log file directory.
Define a custom exception for `crash_loop_limit_reached` to allow pattern matching for it in a follow up commit. We specifically do not want to record `crash_loop_limit_reached` errors as crashes because they are generally not informative crashes. Recording them as real crashes would build up garbage on disk, which would lead to real crash logs expiring on disk earlier.
Introduces the core stateful writer which is used to write out `crash_description` serde objects to crash report files. It is able to write out the crash reports in an async-signal-safe way by: * using only async signal safe syscalls * pre-allocating all the memory necessary to construct, serialize and write our `crash_description` objects to disk * using only non-allocating methods The `prepared_writer` is hooked up to the `recorder` in a later commit of this PR. Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html
Implement the logic to choose a unique crash log file name, and hook up the code for preparing the crash recording code and the crash file cleanup code on clean shutdown.
Implement the recording of startup exceptions into crash files.
Add some simple ducktape tests to assert that crash reports are being generater (or not generated) as expected. The contents of the crash files are going to be validated as a follow up once the contents are actually printed to logs.
The limiter uses node configs, so this should be included in the list of dependencies. This was missed from cmake only. It is already present in bazel.
Implement the logic of formatting a list of crashes to a string. This is going to be printed to the logs when the crash loop limit is reached. See the added unit tests for what the output is going to look like.
... when the crash loop limit is reached.
Verify that information about recorded startup exceptions is correctly printed to the logs when the crash loop limit is reached. Through this we test that both: * Startup crash information is correctly recorded * The recorded crash information is correctly printed
CI test resultstest results on build#61040
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of questions/nits
const auto opt_stacktrace = cd.stacktrace.c_str(); | ||
const auto has_stacktrace = strlen(opt_stacktrace) > 0; | ||
if (has_stacktrace) { | ||
fmt::print(os, " Backtrace: {}.", opt_stacktrace); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use cd.stacktrace.size()
?
const auto opt_add_info = cd.addition_info.c_str(); | ||
const auto has_add_info = strlen(opt_add_info) > 0; | ||
if (has_add_info) { | ||
fmt::print(os, " {}", opt_add_info); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question
for (const auto& entry : | ||
std::filesystem::directory_iterator(crash_report_dir)) { | ||
if (!entry.path().string().ends_with(crash_report_suffix)) { | ||
// Filter only for crash files | ||
continue; | ||
} | ||
|
||
auto buf = co_await read_fully(entry.path()); | ||
try { | ||
auto crash_desc = serde::from_iobuf<crash_description>( | ||
std::move(buf)); | ||
result.emplace_back(std::move(crash_desc)); | ||
} catch (const serde::serde_exception&) { | ||
vlog( | ||
ctlog.warn, | ||
"Ignoring malformed crash report file {}", | ||
entry.path()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Can this possibly return an unsorted list of reports? Or does directory_iterator
guarantee some sort of ordering?
Note: this PR stacks on #24854. Please only review the last 4 commits because the first 6 are going to be rebased away once the base PR merges.
The PR implements reading recorded crash reports and extending the crash loop limit reached log message with a description of the recorded crashes.
To limit the length of the output, up to the 5 earliest and 5 latest recorded crashes are printed only. A unit test is added to test this logic and the output format.
A ducktape test is added to verify that information about recorded startup exceptions is correctly printed to the logs when the crash loop limit is reached. Through this we test that both:
Fixes: https://redpandadata.atlassian.net/browse/CORE-8619
Backports Required
Release Notes