Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8619 crash_tracker: print recorded crash reports #24883

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

pgellert
Copy link
Contributor

Note: this PR stacks on #24854. Please only review the last 4 commits because the first 6 are going to be rebased away once the base PR merges.

The PR implements reading recorded crash reports and extending the crash loop limit reached log message with a description of the recorded crashes.

To limit the length of the output, up to the 5 earliest and 5 latest recorded crashes are printed only. A unit test is added to test this logic and the output format.

A ducktape test is added to verify that information about recorded startup exceptions is correctly printed to the logs when the crash loop limit is reached. Through this we test that both:

  • Startup crash information is correctly recorded
  • The recorded crash information is correctly printed

Fixes: https://redpandadata.atlassian.net/browse/CORE-8619

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

Defines it under `${datadir}/crash_reports` to store crash reports.

It needs to be special-cased in the `compute_storage.py` script to
ensure it isn't mistaken for a log file directory.
Define a custom exception for `crash_loop_limit_reached` to allow
pattern matching for it in a follow up commit.

We specifically do not want to record `crash_loop_limit_reached` errors
as crashes because they are generally not informative crashes. Recording
them as real crashes would build up garbage on disk, which would lead to
real crash logs expiring on disk earlier.
Introduces the core stateful writer which is used to write out
`crash_description` serde objects to crash report files. It is able to
write out the crash reports in an async-signal-safe way by:
 * using only async signal safe syscalls
 * pre-allocating all the memory necessary to construct, serialize and
   write our `crash_description` objects to disk
 * using only non-allocating methods

The `prepared_writer` is hooked up to the `recorder` in a later commit
of this PR.

Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html
Implement the logic to choose a unique crash log file name, and hook up
the code for preparing the crash recording code and the crash file
cleanup code on clean shutdown.
Implement the recording of startup exceptions into crash files.
Add some simple ducktape tests to assert that crash reports are being
generater (or not generated) as expected.

The contents of the crash files are going to be validated as a follow up
once the contents are actually printed to logs.
The limiter uses node configs, so this should be included in the list of
dependencies.

This was missed from cmake only. It is already present in bazel.
Implement the logic of formatting a list of crashes to a string. This is
going to be printed to the logs when the crash loop limit is reached.

See the added unit tests for what the output is going to look like.
... when the crash loop limit is reached.
Verify that information about recorded startup exceptions is correctly
printed to the logs when the crash loop limit is reached. Through this
we test that both:
 * Startup crash information is correctly recorded
 * The recorded crash information is correctly printed
@pgellert pgellert requested a review from a team January 22, 2025 11:15
@pgellert pgellert self-assigned this Jan 22, 2025
@pgellert pgellert requested review from oleiman and removed request for a team January 22, 2025 11:15
@pgellert pgellert requested a review from a team as a code owner January 22, 2025 11:15
@pgellert pgellert requested review from a team and IoannisRP and removed request for oleiman and a team January 22, 2025 11:16
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#61040
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61040#01948e05-41a8-41be-9919-fb59395d528a FLAKY 1/2

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of questions/nits

Comment on lines +19 to +23
const auto opt_stacktrace = cd.stacktrace.c_str();
const auto has_stacktrace = strlen(opt_stacktrace) > 0;
if (has_stacktrace) {
fmt::print(os, " Backtrace: {}.", opt_stacktrace);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use cd.stacktrace.size()?

Comment on lines +25 to +29
const auto opt_add_info = cd.addition_info.c_str();
const auto has_add_info = strlen(opt_add_info) > 0;
if (has_add_info) {
fmt::print(os, " {}", opt_add_info);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question

Comment on lines +149 to +167
for (const auto& entry :
std::filesystem::directory_iterator(crash_report_dir)) {
if (!entry.path().string().ends_with(crash_report_suffix)) {
// Filter only for crash files
continue;
}

auto buf = co_await read_fully(entry.path());
try {
auto crash_desc = serde::from_iobuf<crash_description>(
std::move(buf));
result.emplace_back(std::move(crash_desc));
} catch (const serde::serde_exception&) {
vlog(
ctlog.warn,
"Ignoring malformed crash report file {}",
entry.path());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Can this possibly return an unsorted list of reports? Or does directory_iterator guarantee some sort of ordering?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants