Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8685 crash_tracker: crash signal handlers for crash recording #24959

Merged

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented Jan 28, 2025

This PR implements installing custom signal handlers for SIGSEGV and SIGABRT signals in redpanda using the sigaction system call. These signal handlers implement recording information about the crash for later debugging.

Seastar has support for handling signals on the reactor. However, seastar's ss::engine::handle_signal is designed for asynchronously handling user signals (e.g., SIGUSR1, SIGINT, SIGHUP) within the reactor framework. This approach is unsuitable for crash signals where the reactor won't run again after the signal handler is done. In such cases, synchronous signal handlers are required to perform minimal, safe actions (e.g., writing out crash data) immediately upon signal reception, without relying on the reactor's asynchronous scheduling.

This commit introduces synchronous, one-shot signal handlers for the main crash signals (SIGSEGV, SIGABRT, SIGILL). The implementation closely mirrors Seastar's existing signal handling mechanism for the above crash signals with the following difference:

  • Wrapping the existing signal handler: The original signal handler is retrieved and stored before installing redpanda's handler, ensuring it can be restored and triggered after our handler executes. This is to continue to call seastar's handlers as well after our handler.

  • Crash handler action: While seastar's signal handlers print to stderr, our signal handlers instead write a serde-serialized crash description to an open file descriptor.

Note that for SIGILL seastar did not have a custom handler, which means that while for SIGSEGV and SIGABRT the original handler is seastar's, for SIGILL it is the default signal handler SIG_DFL which just terminated the process without printing to stderr. A SIGILL handler is introduced to capture vassert() crashes, which (among other places) uses the __builtin_trap() primitive leading to a SIGILL signal being raised.

Fixes https://redpandadata.atlassian.net/browse/CORE-8685

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@pgellert
Copy link
Contributor Author

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 28, 2025

CI test results

test results on build#61284
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61284#0194ac86-9219-4f34-957a-f8b765bc683e FLAKY 1/4
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61284#0194ac86-9219-4f34-957a-f8b765bc683e FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61284#0194ac86-9217-422a-aa20-3d0d299bd95a FLAKY 1/2
test results on build#61407
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61407#0194b8c8-0101-414b-b36d-f8252018eaf9 FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61407#0194b8cd-3f10-4c9e-a453-a5246f4bbb8b FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61407#0194b8cd-3f10-4c9e-a453-a5246f4bbb8b FLAKY 1/3
test results on build#61578
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61578#0194d27c-d78d-4c27-8614-9428f0cb186a FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61578#0194d2d8-b986-43f8-80ef-5f5b11d4fc63 FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/61578#0194d2dc-3501-439c-8225-3db4aa222dfc FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61578#0194d2dc-3501-439c-8225-3db4aa222dfc FLAKY 1/2

@pgellert pgellert force-pushed the crashlog/crash-signal-recording branch from 5710b4a to 008974e Compare January 28, 2025 15:39
@pgellert pgellert marked this pull request as ready for review January 28, 2025 15:40
@pgellert pgellert requested review from a team, IoannisRP, michael-redpanda, dotnwat, travisdowns and rockwotj and removed request for a team January 28, 2025 15:40
Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice!

src/v/crash_tracker/recorder.cc Outdated Show resolved Hide resolved
src/v/redpanda/application.cc Outdated Show resolved Hide resolved
Comment on lines 1080 to 1084
void sigsegv_action() {
crash_tracker::get_recorder().record_crash_sighandler(SIGSEGV);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that in seastar it ignores SIGSEGV when SEASTAR_ASAN_ENABLED is true: https://github.com/scylladb/seastar/blob/dfc804c506655f32650d44fa2f3dafe30414bcac/src/core/reactor.cc#L3998-L4004

Not sure if we need to handle that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw this. ASAN installs its own sighandler for sigsegv and prints a bunch of useful info (including the backtrace) when there is a segfault. I think seastar doesn't handle sigsegv when ASAN is enabled because they would overwrite the ASAN handler. And even if they wrapped the ASAN handler instead of overwriting it, their handler would just print the same/similar error message including the backtrace again, which is useless/annoying when the ASAN handler prints the same info.

For redpanda, I think we should capture crash records regardless of whether ASAN is enabled or not. When ASAN is enabled, redpanda's sighandler will wrap the ASAN signal handler instead of seastar's. Which is why I believe we don't have to special case for ASAN here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should capture crash records regardless of whether ASAN is enabled or not. When ASAN is enabled, redpanda's sighandler will wrap the ASAN signal handler instead of seastar's. Which is why I believe we don't have to special case for ASAN here.

Have you tested this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is important to test that this works, ideally as a checked in test.

Copy link
Contributor Author

@pgellert pgellert Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this? ... ideally as a checked in test.

Yes, the ducktape test I've added in this PR (crash_loop_checks_test.py::CrashLoopChecksTest.test_crash_report_with_signal{signo=Signals.SIGSEGV}) tests this, specifically in the debug-mode run of ducktape where ASAN is enabled.

...
Recorded crash reason to crash file.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==885==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000375 (pc 0x7f32d0a5eac0 bp 0x7ffc4fe0f750 sp 0x7ffc4fe0f730 T0)
==885==The signal is caused by a READ memory access.
==885==Hint: address points to the zero page.
    #0 0x7f32d0a5eac0 in std::__1::vector<cluster::node_status, std::__1::allocator<cluster::node_status> >::capacity[abi:ne180100]() const /home/gellertnagy/code/redpanda/vbuild/llvm/3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff/install/bin/../include/c++/v1/vector:598:35
...
==885==ABORTING

src/v/redpanda/application.cc Outdated Show resolved Hide resolved
tests/rptest/tests/crash_loop_checks_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/crash_loop_checks_test.py Outdated Show resolved Hide resolved

struct sigaction sa;
sa.sa_sigaction = [](int sig, siginfo_t*, void*) {
std::lock_guard<ss::util::spinlock> g(lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not reentrant. I've deadlocked seastar because of the spinlock they have where printing the stacktrace causes another segfault (this can happen if the stack is corrupted). Hung processes really suck.

Can we remove the need for the spinlock? Can we either make it thread local and check for the lock being held first? Or if not thread local then can we check if we are re-entering using a thread local?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not reentrant. Can we remove the need for the spinlock? Can we either make it thread local and check for the lock being held first? Or if not thread local then can we check if we are re-entering using a thread local?

Yes, we can replace the lock with a std::atomic<bool> handled that we CAS on to decide whether to call Func() or not. I didn't do it initially because I wanted to deviate as little as possible from the existing battle-tested seastar sighandler implementation. But I agree that there is a possibility of a deadlock in the seastar implementation and it is a simple change, so I've made the change now.

this can happen if the stack is corrupted

Do you have an example of how this can happen by any chance? I'm curious + maybe we could add a test for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of how this can happen by any chance

I think it's possible if there is a crash while unwinding the stack, I know this used to be an issue when initially developing the Wasm VM into seastar, but I forget the exact scenario now. I do think it had something to do with us not unblocking SIGILL however, or maybe it was aborts inside the Wasm VM? Sorry not recalling what's going on.

Comment on lines 1090 to 1098
void install_sighandlers() {
install_oneshot_signal_handler<SIGSEGV, sigsegv_action>();
install_oneshot_signal_handler<SIGABRT, sigabrt_action>();
install_oneshot_signal_handler<SIGILL, sigill_action>();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things, outside of shard 0 SIGILL is blocked by default without Wasm on. WebAssembly removes the sigmask, but only if it's on.

co_await ss::smp::invoke_on_all([] {
// wasmtime needs some signals for it's handling, make sure we
// unblock them.
auto mask = ss::make_empty_sigset_mask();
sigaddset(&mask, SIGSEGV);
sigaddset(&mask, SIGILL);
sigaddset(&mask, SIGFPE);
ss::throw_pthread_error(::pthread_sigmask(SIG_UNBLOCK, &mask, nullptr));
});

Have you tested this with Wasm enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outside of shard 0 SIGILL is blocked by default without Wasm on

Good catch, thank you, I did not notice this. I've now added an ducktape test to test with crash signals on both shard 0 and shard 1. These would fail before, so to ensure that SIGILL and SIGABRT signals can be recorded even outside of shard 0, I've changed install_oneshot_signal_handler to unblock the signal on all reactor threads similar to how ss::engine::handle_signal() does it: https://github.com/redpanda-data/seastar/blob/ec806998fbf62386ee4b81ac27358b40335839ca/src/core/reactor.cc#L723-L743

WebAssembly removes the sigmask, but only if it's on. Have you tested this with Wasm enabled?

The wasm ducktape tests test this I believe + I've now done some manual testing and crash recording is working the same with/without wasm enabled (now that I correctly set up the signal masks). Do you have any pointers to why the wasm engine needs these signals unblocked? I'd like to better understand if there is any edge case we should consider here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any pointers

It's a bit involved and in the weeds of the VM implementation, but the code is here: https://github.com/bytecodealliance/wasmtime/blob/main/crates/wasmtime/src/runtime/vm/sys/unix/signals.rs

They use it to handle things like out of bounds memory access and stack overflow inside the VM without needing explicit assembly to do the bounds checking.

Comment on lines 1080 to 1084
void sigsegv_action() {
crash_tracker::get_recorder().record_crash_sighandler(SIGSEGV);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should capture crash records regardless of whether ASAN is enabled or not. When ASAN is enabled, redpanda's sighandler will wrap the ASAN signal handler instead of seastar's. Which is why I believe we don't have to special case for ASAN here.

Have you tested this?

@pgellert pgellert force-pushed the crashlog/crash-signal-recording branch from 008974e to f400776 Compare January 30, 2025 18:39
@pgellert
Copy link
Contributor Author

force-push: address code review feedback

  • single sigaction call to install new and retrieve old handler
  • avoid unreachable code with enum class recorded_signo
  • avoid ss::util::spin_lock with std::atomic<bool>
  • Test: with crash signals on non-shard 0 threads
  • Test: add messages to asserts

struct sigaction sa;
sa.sa_sigaction = [](int sig, siginfo_t*, void*) {
// Reinstall the original signal handler
auto r = ::sigaction(sig, &original_sa, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were not going to crash the process right after this, this has issues overriding the Wasm signal handling (actually I wonder if there is a potential races of reinstalling this then another thread hitting raising a signal).

Yeah thinking more, I actually don't think we want to drop the lock, I think we just need to protect against re-entry. If this function is called concurrently we will probably do the wrong thing or not let the crash handling finish it's work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were not going to crash the process right after this

We are going to crash the process right after this. The signal is re-raised for the current head as the last step of the signal handler, which ensures that we will not return to the main execution of the thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a potential races of reinstalling this then another thread hitting raising a signal. ... I actually don't think we want to drop the lock, I think we just need to protect against re-entry. If this function is called concurrently we will probably do the wrong thing or not let the crash handling finish it's work.

It comes down to how we want to handle the case of concurrent signals on multiple threads. We could:

  1. Spin waiting for the first signal handler to complete. This is what the lock-based implementation does if we install the signal handler after calling Func().
  2. Terminate the process, possibly before a crash report has been written. This is what the current implementation does with installing the signal handler first and then executing Func().

We cannot return from the signal handlers without terminating, since the program cannot continue from where it left off. So we either have to terminate the process or spin when handling a second signal concurrently.

I believe both 1 or 2 are viable. The pros/cons I see are:

  • 1 is what seastar does currently, and we haven't had problems with it.
  • But with 1 there is technically a risk of deadlock. If two concurrent signals are raised on the same core, then the later signal could sping and starve the earlier one and cause a deadlock. Note that these signals must arrive on the same core but different threads (eg. a reactor thread and a non-reactor thread). They cannot arrive on the same thread because the [sa.sa](http://sa.sa/)_mask parameter of the signal handler blocks concurrent signals to the same.
  • 2 is what this PR implements today. It is async signal safe, but in this highly unlikely scenario of concurrent crashes on multiple cores it may not record a crash report.

I am personally happy with either approach because both are safe (2 is clearly safe while 1 is as safe as the status quo seastar signal handler).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a diff that both allows the crash handler to finish but also prevents deadlock: https://gist.github.com/rockwotj/46f9a6236b5d09221fa69d9964bd68e0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yeah that's essentially what I meant by option (2) above. I agree that that seems the best approach because it allows for the crash recorder to complete before terminating the process even in the case of concurrent signals.

From your gist I didn't adopt the thread_local changes, because, as we discussed on Slack, the sa_mask parameter passed to the sigaction call when we install the signal handler ensures that all signal delivery is blocked on the pthread that is currently handling a signal.

sigfillset(&sa.sa_mask);

@pgellert pgellert requested a review from rockwotj February 3, 2025 14:04
src/v/redpanda/application.cc Outdated Show resolved Hide resolved
struct sigaction sa;
sa.sa_sigaction = [](int sig, siginfo_t*, void*) {
// Reinstall the original signal handler
auto r = ::sigaction(sig, &original_sa, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a diff that both allows the crash handler to finish but also prevents deadlock: https://gist.github.com/rockwotj/46f9a6236b5d09221fa69d9964bd68e0

This method is going to be used in signal handler functions to record
information about the crash.

Its implementation is async signal safe to ensure that it is safe to be
called in a signal handler.

Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html
This commit implements installing custom signal handlers for SIGSEGV and
SIGABRT signals in redpanda using the `sigaction` system call. These
signal handlers implement recording information about the crash for
later debugging.

Seastar has support for handling signals on the reactor. However,
seastar's `ss::engine::handle_signal` is designed for asynchronously
handling user signals (e.g., SIGUSR1, SIGINT, SIGHUP) within the reactor
framework. This approach is unsuitable for crash signals where the
reactor won't run again after the signal handler is done. In such cases,
synchronous signal handlers are required to perform minimal, safe
actions (e.g., writing out crash data) immediately upon signal
reception, without relying on the reactor's asynchronous scheduling.

This commit introduces synchronous, one-shot signal handlers for the
main crash signals (SIGSEGV, SIGABRT, SIGILL). The implementation
closely mirrors Seastar's existing signal handling mechanism for the
above crash signals with the following difference:

- Wrapping the existing signal handler: The original signal handler is
  retrieved and stored before installing redpanda's handler, ensuring it
  can be restored and triggered after our handler executes. This is to
  continue to call seastar's handlers as well after our handler.

- Crash handler action: While seastar's signal handlers print to stderr,
  our signal handlers instead write a serde-serialized crash
  description to an open file descriptor.

- SIGILL handler: for SIGILL seastar does not have a custom handler,
  which means that while for SIGSEGV and SIGABRT the original handler is
  seastar's, for SIGILL it is the default signal handler SIG_DFL which
  just terminates the process without printing to stderr. A SIGILL
  handler is introduced to capture `vassert()` crashes, which (among
  other places) uses the `__builtin_trap()` primitive leading to a
  SIGILL signal being raised.

Seastar's current crash signal handlers (v25.1.x-pre): https://github.com/redpanda-data/seastar/blob/ec806998fbf62386ee4b81ac27358b40335839ca/src/core/reactor.cc#L4169-L4207
Allow sending signals to specific shards/threads of redpanda. This is to
help with testing when a crash (eg. segfault) occurs on a shard of
redpanda other than shard 0.
This adds a test to verify that crash reports captured while handling
crash signals (segfaults, sigabrt, sigill) are correctly being reported.
It also verifies that the generated crash reports can be read back later
and are correctly printed when the crash loop limit is reached.

This requires exposing `_tolerate_crashes` from `RedpandaService` to
prevent test failures on crash-detecting checks, since for this test,
the crashes are expected.
@pgellert pgellert force-pushed the crashlog/crash-signal-recording branch from f400776 to c2e59b7 Compare February 4, 2025 19:41
@pgellert
Copy link
Contributor Author

pgellert commented Feb 4, 2025

force-push:

  • Rebase to dev to fix merge conflicts
  • Move the signal handler installation code from application.cc to crash_tracker/signals.cc
  • Move back to the ss::util::spinlock-based implementation based on this discussion above

@rockwotj rockwotj self-requested a review February 4, 2025 19:46
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for working through this with me!

@pgellert pgellert merged commit b586de8 into redpanda-data:dev Feb 5, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants