CORE-8619 crash_tracker: print recorded crash reports #24883

pgellert · 2025-01-22T11:15:55Z

Note: this PR stacks on #24854. Please only review the last 4 commits because the first 6 are going to be rebased away once the base PR merges.

The PR implements reading recorded crash reports and extending the crash loop limit reached log message with a description of the recorded crashes.

To limit the length of the output, up to the 5 earliest and 5 latest recorded crashes are printed only. A unit test is added to test this logic and the output format.

A ducktape test is added to verify that information about recorded startup exceptions is correctly printed to the logs when the crash loop limit is reached. Through this we test that both:

Startup crash information is correctly recorded
The recorded crash information is correctly printed

Fixes: https://redpandadata.atlassian.net/browse/CORE-8619

Backports Required

Release Notes

none

Defines it under `${datadir}/crash_reports` to store crash reports. It needs to be special-cased in the `compute_storage.py` script to ensure it isn't mistaken for a log file directory.

Define a custom exception for `crash_loop_limit_reached` to allow pattern matching for it in a follow up commit. We specifically do not want to record `crash_loop_limit_reached` errors as crashes because they are generally not informative crashes. Recording them as real crashes would build up garbage on disk, which would lead to real crash logs expiring on disk earlier.

Introduces the core stateful writer which is used to write out `crash_description` serde objects to crash report files. It is able to write out the crash reports in an async-signal-safe way by: * using only async signal safe syscalls * pre-allocating all the memory necessary to construct, serialize and write our `crash_description` objects to disk * using only non-allocating methods The `prepared_writer` is hooked up to the `recorder` in a later commit of this PR. Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html

Implement the logic to choose a unique crash log file name, and hook up the code for preparing the crash recording code and the crash file cleanup code on clean shutdown.

Implement the recording of startup exceptions into crash files.

Add some simple ducktape tests to assert that crash reports are being generater (or not generated) as expected. The contents of the crash files are going to be validated as a follow up once the contents are actually printed to logs.

The limiter uses node configs, so this should be included in the list of dependencies. This was missed from cmake only. It is already present in bazel.

Implement the logic of formatting a list of crashes to a string. This is going to be printed to the logs when the crash loop limit is reached. See the added unit tests for what the output is going to look like.

... when the crash loop limit is reached.

Verify that information about recorded startup exceptions is correctly printed to the logs when the crash loop limit is reached. Through this we test that both: * Startup crash information is correctly recorded * The recorded crash information is correctly printed

vbotbuildovich · 2025-01-22T15:03:01Z

CI test results

test results on build#61040

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61040#01948e05-41a8-41be-9919-fb59395d528a	FLAKY	1/2

michael-redpanda

couple of questions/nits

michael-redpanda · 2025-01-22T15:02:49Z

src/v/crash_tracker/types.cc

+    const auto opt_stacktrace = cd.stacktrace.c_str();
+    const auto has_stacktrace = strlen(opt_stacktrace) > 0;
+    if (has_stacktrace) {
+        fmt::print(os, " Backtrace: {}.", opt_stacktrace);
+    }


why not use cd.stacktrace.size()?

michael-redpanda · 2025-01-22T15:03:02Z

src/v/crash_tracker/types.cc

+    const auto opt_add_info = cd.addition_info.c_str();
+    const auto has_add_info = strlen(opt_add_info) > 0;
+    if (has_add_info) {
+        fmt::print(os, " {}", opt_add_info);
+    }


same question

michael-redpanda · 2025-01-22T15:19:43Z

src/v/crash_tracker/recorder.cc

+    for (const auto& entry :
+         std::filesystem::directory_iterator(crash_report_dir)) {
+        if (!entry.path().string().ends_with(crash_report_suffix)) {
+            // Filter only for crash files
+            continue;
+        }
+
+        auto buf = co_await read_fully(entry.path());
+        try {
+            auto crash_desc = serde::from_iobuf<crash_description>(
+              std::move(buf));
+            result.emplace_back(std::move(crash_desc));
+        } catch (const serde::serde_exception&) {
+            vlog(
+              ctlog.warn,
+              "Ignoring malformed crash report file {}",
+              entry.path());
+        }
+    }


question: Can this possibly return an unsorted list of reports? Or does directory_iterator guarantee some sort of ordering?

pgellert added 10 commits January 21, 2025 18:10

config: define crash_reports directory

1fae3d0

Defines it under `${datadir}/crash_reports` to store crash reports. It needs to be special-cased in the `compute_storage.py` script to ensure it isn't mistaken for a log file directory.

crash_tracker: prepare and release crash log file

8a5ef27

Implement the logic to choose a unique crash log file name, and hook up the code for preparing the crash recording code and the crash file cleanup code on clean shutdown.

crash_tracker: record startup exception crashes

452e456

Implement the recording of startup exceptions into crash files.

dt: test crash report file creation and cleanup

cfae5ae

Add some simple ducktape tests to assert that crash reports are being generater (or not generated) as expected. The contents of the crash files are going to be validated as a follow up once the contents are actually printed to logs.

crash_tracker: add config as cmake dependency

b8706a0

The limiter uses node configs, so this should be included in the list of dependencies. This was missed from cmake only. It is already present in bazel.

crash_tracker: implement describing crashes

41b19e2

Implement the logic of formatting a list of crashes to a string. This is going to be printed to the logs when the crash loop limit is reached. See the added unit tests for what the output is going to look like.

crash_tracker: gather and print recorded crashes

4af1ca6

... when the crash loop limit is reached.

pgellert requested a review from a team January 22, 2025 11:15

pgellert self-assigned this Jan 22, 2025

pgellert requested review from oleiman and removed request for a team January 22, 2025 11:15

pgellert requested a review from a team as a code owner January 22, 2025 11:15

pgellert requested review from a team and IoannisRP and removed request for oleiman and a team January 22, 2025 11:16

github-actions bot added area/build area/redpanda labels Jan 22, 2025

pgellert requested a review from michael-redpanda January 22, 2025 11:16

michael-redpanda reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-8619 crash_tracker: print recorded crash reports #24883

CORE-8619 crash_tracker: print recorded crash reports #24883

pgellert commented Jan 22, 2025

vbotbuildovich commented Jan 22, 2025

michael-redpanda left a comment

michael-redpanda Jan 22, 2025

michael-redpanda Jan 22, 2025

michael-redpanda Jan 22, 2025

CORE-8619 crash_tracker: print recorded crash reports #24883

Are you sure you want to change the base?

CORE-8619 crash_tracker: print recorded crash reports #24883

Conversation

pgellert commented Jan 22, 2025

Backports Required

Release Notes

vbotbuildovich commented Jan 22, 2025

CI test results

michael-redpanda left a comment

Choose a reason for hiding this comment

michael-redpanda Jan 22, 2025

Choose a reason for hiding this comment

michael-redpanda Jan 22, 2025

Choose a reason for hiding this comment

michael-redpanda Jan 22, 2025

Choose a reason for hiding this comment