Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8618 crash_tracker: record uncaught startup exceptions #24854

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented Jan 17, 2025

This PR adds functionality to record crashes caused by an exception during startup as crash reports to disk. This is the first step to implement the core crash recording project which aims to persist information about redpanda broker crashes to disk to help debugging.

The PR implements the core functionality of preparing state to allow writing crash reports in an async-signal-safe way, so that this code can be reused in signal handlers in follow-up PRs.

The crash reports are going to be stored under the data directory. The filenames are constructed as: ${datadir}/crash_reports/${broker_start_time_ms}_${random}.crash.

Reports are serialized using serde and iobuf (with pre-reserved) memory, and the implementation relies on their allocation-free behaviour to ensure that writing crash files in signal handlers is reliable even in OOM scenarios. (This will become important in a follow-up PR when the signal handlers are actually hooked up.)

To write to the crash reports, we need to use the low-level open() function here instead of the seastar API or higher-level C++ primitives because we need to be able to manipulate the file using async-signal-safe, allocation-free functions inside signal handlers. Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html

Fixes https://redpandadata.atlassian.net/browse/CORE-8618

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@pgellert pgellert requested a review from a team January 17, 2025 15:47
@pgellert pgellert self-assigned this Jan 17, 2025
@pgellert pgellert requested review from michael-redpanda and a team and removed request for a team and michael-redpanda January 17, 2025 15:47
@pgellert pgellert requested a review from a team as a code owner January 17, 2025 15:47
@pgellert pgellert requested review from BenPope, michael-redpanda and IoannisRP and removed request for a team January 17, 2025 15:47
@pgellert pgellert marked this pull request as draft January 17, 2025 15:48
@pgellert pgellert marked this pull request as ready for review January 17, 2025 15:48
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 17, 2025

CI test results

test results on build#60910
test_id test_kind job_url test_status passed
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60910#01947552-5d4b-4150-bcdb-b7b1bb0fd156 FLAKY 1/2
test results on build#60955
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60955#01948420-a364-4301-adf9-f2249f270189 FLAKY 1/2
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60955#01948420-a364-4464-a871-097ffe11e6c2 FLAKY 1/2
test results on build#61003
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61003#01948a54-d3a4-4eaf-b3e8-9ebafd176099 FLAKY 1/2

@pgellert pgellert force-pushed the crashlog/record-startup-exc branch from 1dc3a86 to 09ccff8 Compare January 20, 2025 13:15
@pgellert
Copy link
Contributor Author

force-push: rebase to pgellert:crashlog/define-api to update the first 4 commits of this stacked PR with the latest changes

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking pretty cool!

src/v/config/node_config.h Show resolved Hide resolved
src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved
Comment on lines 60 to 62
_prepared_cd.crash_message.resize(4096, '\0');
_prepared_cd.stacktrace.resize(4096, '\0');
_prepared_cd.addition_info.resize(4094, '\0');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\0 is the default value to fill with using resize(size)

// Create the crash recorder file
auto f = co_await ss::open_file_dma(
_crash_report_file_name.c_str(),
ss::open_flags::create | ss::open_flags::rw);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate
exclusive

_crash_report_file_name.c_str(),
ss::open_flags::create | ss::open_flags::rw);
co_await f.close();
co_await ss::sync_directory(_crash_report_file_name.parent_path().string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you protecting against a certain scenario by syncing the directory here? a comment would probably be helpful since it's likely important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this comment:

Sync the parent dir to ensure that the new file is observable to the ::open() call below and to later restarts of the process

Let me know if that explains it well enough.

Copy link
Member

@dotnwat dotnwat Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't hurt anything, but you don't really need to do the sync for these cases--it's only good for if the actual node crashes.

for (const auto& frag : _serde_output) {
auto written = ::write(_fd, frag.get(), frag.size());
if (written == -1) {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here?

Copy link
Contributor Author

@pgellert pgellert Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect us to be able to continue to write to the file descriptor if an earlier ::write() call failed. Am I missing something here? But I also don't see any problems with continuing to try to write (the man page doesn't say that this would be a problem), so I've changed the code to continue to try to write here.

Comment on lines +141 to +142
co_await ss::sync_directory(
_crash_report_file_name.parent_path().string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync is probably unnecessary, but maybe you could provide a little context on what you're trying to protect against?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comment I've added:

Sync the parent dir to ensure that later restarts of the process cannot observe this file

Let me know if it explains it well enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visibility is handled by the kernel. process restarts don't need the sync. it doesn't hurt anything, except performance if you were doing many syncs.

"{}_{}{}", time_now, random_int, crash_report_suffix);
if (co_await ss::file_exists(try_name.string())) {
// Try again in the rare case of a collision
co_await ss::sleep(1ms);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the sleep for? you're not going to try again with the same filename, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is not needed. I've removed it now

Comment on lines +79 to +142
std::lock_guard<ss::util::spinlock> g(_writer_lock);
co_await _writer.release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on the use of a spinlock here compared to a blocking mutex? generally i would think that holding a spinlock for anything other than the tiniest of cpu-bound critical sections (and release here is performing I/O) would not perform as well as a blocking mutex where the kernel can put a blocking thread to sleep and understand the chain of dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this comment now:

The writer has shared state, so accessing it from multiple threads
is guarded by this lock. We are using the async-signal safe
ss::util::spinlock instead of more efficient locking mechanisms like the
seastar primitives or std::mutex to ensure that we can safely use the
lock in a singal handler context.

This lock will be used in signal handlers in a follow-up PR so it needs to be async-signal safe. The pthread mutex isn't: https://man7.org/linux/man-pages/man3/pthread_mutex_lock.3.html.
The standard seastar concurrency primitives are also not available in the signal handler because the signal handler runs outside of the reactor. ss::util::spinlock is specifically an async-signal safe spinlock that seastar uses in its signal handlers: https://github.com/scylladb/seastar/blob/c3f33c116ecde6f37cea6f656e1dfed2a4cd26a2/include/seastar/util/spinlock.hh#L85-L88

I am not worried about the performance overhead of the spinlock because all of the code paths where the lock is used are outside of performance-critical sections (startup, shutdown, mid-crash), and I don't expect the lock to be contended except for some highly unlikely scenarios (eg. segfault while already recording a failure).

// If the recorder is already locked, give up to prevent possible deadlocks.
// For example, an assertion failure while recording a segfault may lead to
// a deadlock if we unconditionally waited for the lock here.
std::unique_lock<ss::util::spinlock> g(_writer_lock, std::try_to_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. why a spinlock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered here: #24854 (comment)

src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved
src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved
src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved
src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved
src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved
src/v/crash_tracker/prepared_writer.cc Outdated Show resolved Hide resolved
src/v/crash_tracker/prepared_writer.cc Show resolved Hide resolved
@pgellert pgellert force-pushed the crashlog/record-startup-exc branch from 09ccff8 to c7e2ba2 Compare January 21, 2025 16:10
@pgellert
Copy link
Contributor Author

force-push: address code review feedback

Defines it under `${datadir}/crash_reports` to store crash reports.

It needs to be special-cased in the `compute_storage.py` script to
ensure it isn't mistaken for a log file directory.
Define a custom exception for `crash_loop_limit_reached` to allow
pattern matching for it in a follow up commit.

We specifically do not want to record `crash_loop_limit_reached` errors
as crashes because they are generally not informative crashes. Recording
them as real crashes would build up garbage on disk, which would lead to
real crash logs expiring on disk earlier.
Introduces the core stateful writer which is used to write out
`crash_description` serde objects to crash report files. It is able to
write out the crash reports in an async-signal-safe way by:
 * using only async signal safe syscalls
 * pre-allocating all the memory necessary to construct, serialize and
   write our `crash_description` objects to disk
 * using only non-allocating methods

The `prepared_writer` is hooked up to the `recorder` in a later commit
of this PR.

Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html
Implement the logic to choose a unique crash log file name, and hook up
the code for preparing the crash recording code and the crash file
cleanup code on clean shutdown.
Implement the recording of startup exceptions into crash files.
Add some simple ducktape tests to assert that crash reports are being
generater (or not generated) as expected.

The contents of the crash files are going to be validated as a follow up
once the contents are actually printed to logs.
@pgellert pgellert force-pushed the crashlog/record-startup-exc branch from c7e2ba2 to cfae5ae Compare January 21, 2025 18:11
@pgellert
Copy link
Contributor Author

force-push: rebased to dev now that the base PR (#24828) merged

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just one remaining question

src/v/crash_tracker/recorder.cc Show resolved Hide resolved

void prepared_writer::write() {
vassert(_state == state::filled, "Unexpected state: {}", _state);
_state = state::written;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this transition correct if try_write_crash() fails? Should failure be represented in the state machine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this transition correct if try_write_crash() fails?

Yes, it's correct though I get that the naming is a bit misleading. Maybe consumed would be a better name to represent this state.

Should failure be represented in the state machine?

We could add a separate failure state in the state machine but at the moment it is identical to the written state so I think it's simpler as is.

Comment on lines +90 to +91
cd.stacktrace.begin() + pos,
cd.stacktrace.size() - pos,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: You could use std::next/prev if there's a lint complaint about pointer arithmetic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you! Let me carry this onto a follow-up PR unless any more changes come up for this PR.

while (written < frag.size()) {
auto res = ::write(
_fd, frag.get() + written, frag.size() - written);
if (res == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you consider retrying on EINTR? it's niche, but still possible. and you did state that you were avoiding seastar I/O because you cared about signal handling...

auto f = co_await ss::open_file_dma(
_crash_report_file_name.c_str(),
ss::open_flags::create | ss::open_flags::rw | ss::open_flags::truncate
| ss::open_flags::exclusive);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my original comment about using exclusive here was actually a note to myself to revisit my comment after finishing the review, but i forgot. the problem with exclusive is that this open will fail if the file already exists on disk. is that going to be an issue, or is exclusive behavior desired? my hunch is that you may not want it, but the note to myself was just to ask if you wanted it.

Comment on lines +141 to +142
co_await ss::sync_directory(
_crash_report_file_name.parent_path().string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visibility is handled by the kernel. process restarts don't need the sync. it doesn't hurt anything, except performance if you were doing many syncs.

@dotnwat
Copy link
Member

dotnwat commented Jan 22, 2025

@pgellert the note about exclusive flag is probably the importnat one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants