CORE-8685 crash_tracker: crash signal handlers for crash recording #24959

pgellert · 2025-01-28T08:27:51Z

This PR implements installing custom signal handlers for SIGSEGV and SIGABRT signals in redpanda using the sigaction system call. These signal handlers implement recording information about the crash for later debugging.

Seastar has support for handling signals on the reactor. However, seastar's ss::engine::handle_signal is designed for asynchronously handling user signals (e.g., SIGUSR1, SIGINT, SIGHUP) within the reactor framework. This approach is unsuitable for crash signals where the reactor won't run again after the signal handler is done. In such cases, synchronous signal handlers are required to perform minimal, safe actions (e.g., writing out crash data) immediately upon signal reception, without relying on the reactor's asynchronous scheduling.

This commit introduces synchronous, one-shot signal handlers for the main crash signals (SIGSEGV, SIGABRT, SIGILL). The implementation closely mirrors Seastar's existing signal handling mechanism for the above crash signals with the following difference:

Wrapping the existing signal handler: The original signal handler is retrieved and stored before installing redpanda's handler, ensuring it can be restored and triggered after our handler executes. This is to continue to call seastar's handlers as well after our handler.
Crash handler action: While seastar's signal handlers print to stderr, our signal handlers instead write a serde-serialized crash description to an open file descriptor.

Note that for SIGILL seastar did not have a custom handler, which means that while for SIGSEGV and SIGABRT the original handler is seastar's, for SIGILL it is the default signal handler SIG_DFL which just terminated the process without printing to stderr. A SIGILL handler is introduced to capture vassert() crashes, which (among other places) uses the __builtin_trap() primitive leading to a SIGILL signal being raised.

Fixes https://redpandadata.atlassian.net/browse/CORE-8685

Backports Required

Release Notes

none

pgellert · 2025-01-28T09:00:47Z

/ci-repeat 1

vbotbuildovich · 2025-01-28T12:58:30Z

CI test results

test results on build#61284

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61284#0194ac86-9219-4f34-957a-f8b765bc683e	FLAKY	1/4
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61284#0194ac86-9219-4f34-957a-f8b765bc683e	FLAKY	1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61284#0194ac86-9217-422a-aa20-3d0d299bd95a	FLAKY	1/2

test results on build#61407

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61407#0194b8c8-0101-414b-b36d-f8252018eaf9	FLAKY	1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61407#0194b8cd-3f10-4c9e-a453-a5246f4bbb8b	FLAKY	1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61407#0194b8cd-3f10-4c9e-a453-a5246f4bbb8b	FLAKY	1/3

test results on build#61578

test_id	test_kind	job_url	test_status	passed
gtest_raft_rpunit.gtest_raft_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61578#0194d27c-d78d-4c27-8614-9428f0cb186a	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61578#0194d2d8-b986-43f8-80ef-5f5b11d4fc63	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC	ducktape	https://buildkite.com/redpanda/redpanda/builds/61578#0194d2dc-3501-439c-8225-3db4aa222dfc	FLAKY	1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61578#0194d2dc-3501-439c-8225-3db4aa222dfc	FLAKY	1/2

michael-redpanda

looks nice!

src/v/crash_tracker/recorder.cc

src/v/redpanda/application.cc

michael-redpanda · 2025-01-29T17:17:04Z

src/v/redpanda/application.cc

+void sigsegv_action() {
+    crash_tracker::get_recorder().record_crash_sighandler(SIGSEGV);
+}


Noticed that in seastar it ignores SIGSEGV when SEASTAR_ASAN_ENABLED is true: https://github.com/scylladb/seastar/blob/dfc804c506655f32650d44fa2f3dafe30414bcac/src/core/reactor.cc#L3998-L4004

Not sure if we need to handle that.

Yeah, I saw this. ASAN installs its own sighandler for sigsegv and prints a bunch of useful info (including the backtrace) when there is a segfault. I think seastar doesn't handle sigsegv when ASAN is enabled because they would overwrite the ASAN handler. And even if they wrapped the ASAN handler instead of overwriting it, their handler would just print the same/similar error message including the backtrace again, which is useless/annoying when the ASAN handler prints the same info.

For redpanda, I think we should capture crash records regardless of whether ASAN is enabled or not. When ASAN is enabled, redpanda's sighandler will wrap the ASAN signal handler instead of seastar's. Which is why I believe we don't have to special case for ASAN here.

I think we should capture crash records regardless of whether ASAN is enabled or not. When ASAN is enabled, redpanda's sighandler will wrap the ASAN signal handler instead of seastar's. Which is why I believe we don't have to special case for ASAN here.

Have you tested this?

Yes, it is important to test that this works, ideally as a checked in test.

Have you tested this? ... ideally as a checked in test.

Yes, the ducktape test I've added in this PR (crash_loop_checks_test.py::CrashLoopChecksTest.test_crash_report_with_signal{signo=Signals.SIGSEGV}) tests this, specifically in the debug-mode run of ducktape where ASAN is enabled.

... Recorded crash reason to crash file. AddressSanitizer:DEADLYSIGNAL ================================================================= ==885==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000375 (pc 0x7f32d0a5eac0 bp 0x7ffc4fe0f750 sp 0x7ffc4fe0f730 T0) ==885==The signal is caused by a READ memory access. ==885==Hint: address points to the zero page. #0 0x7f32d0a5eac0 in std::__1::vector<cluster::node_status, std::__1::allocator<cluster::node_status> >::capacity[abi:ne180100]() const /home/gellertnagy/code/redpanda/vbuild/llvm/3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff/install/bin/../include/c++/v1/vector:598:35 ... ==885==ABORTING

src/v/redpanda/application.cc

tests/rptest/tests/crash_loop_checks_test.py

rockwotj · 2025-01-29T21:57:57Z

src/v/redpanda/application.cc

+
+    struct sigaction sa;
+    sa.sa_sigaction = [](int sig, siginfo_t*, void*) {
+        std::lock_guard<ss::util::spinlock> g(lock);


This is not reentrant. I've deadlocked seastar because of the spinlock they have where printing the stacktrace causes another segfault (this can happen if the stack is corrupted). Hung processes really suck.

Can we remove the need for the spinlock? Can we either make it thread local and check for the lock being held first? Or if not thread local then can we check if we are re-entering using a thread local?

This is not reentrant. Can we remove the need for the spinlock? Can we either make it thread local and check for the lock being held first? Or if not thread local then can we check if we are re-entering using a thread local?

Yes, we can replace the lock with a std::atomic<bool> handled that we CAS on to decide whether to call Func() or not. I didn't do it initially because I wanted to deviate as little as possible from the existing battle-tested seastar sighandler implementation. But I agree that there is a possibility of a deadlock in the seastar implementation and it is a simple change, so I've made the change now.

this can happen if the stack is corrupted

Do you have an example of how this can happen by any chance? I'm curious + maybe we could add a test for it.

Do you have an example of how this can happen by any chance

I think it's possible if there is a crash while unwinding the stack, I know this used to be an issue when initially developing the Wasm VM into seastar, but I forget the exact scenario now. I do think it had something to do with us not unblocking SIGILL however, or maybe it was aborts inside the Wasm VM? Sorry not recalling what's going on.

rockwotj · 2025-01-29T22:00:17Z

src/v/redpanda/application.cc

+void install_sighandlers() {
+    install_oneshot_signal_handler<SIGSEGV, sigsegv_action>();
+    install_oneshot_signal_handler<SIGABRT, sigabrt_action>();
+    install_oneshot_signal_handler<SIGILL, sigill_action>();
+}


Two things, outside of shard 0 SIGILL is blocked by default without Wasm on. WebAssembly removes the sigmask, but only if it's on.

redpanda/src/v/wasm/wasmtime.cc

Lines 1410 to 1418 in 2c68bb9

co_await ss::smp::invoke_on_all([] {

// wasmtime needs some signals for it's handling, make sure we

// unblock them.

auto mask = ss::make_empty_sigset_mask();

sigaddset(&mask, SIGSEGV);

sigaddset(&mask, SIGILL);

sigaddset(&mask, SIGFPE);

ss::throw_pthread_error(::pthread_sigmask(SIG_UNBLOCK, &mask, nullptr));

});

Have you tested this with Wasm enabled?

outside of shard 0 SIGILL is blocked by default without Wasm on

Good catch, thank you, I did not notice this. I've now added an ducktape test to test with crash signals on both shard 0 and shard 1. These would fail before, so to ensure that SIGILL and SIGABRT signals can be recorded even outside of shard 0, I've changed install_oneshot_signal_handler to unblock the signal on all reactor threads similar to how ss::engine::handle_signal() does it: https://github.com/redpanda-data/seastar/blob/ec806998fbf62386ee4b81ac27358b40335839ca/src/core/reactor.cc#L723-L743

WebAssembly removes the sigmask, but only if it's on. Have you tested this with Wasm enabled?

The wasm ducktape tests test this I believe + I've now done some manual testing and crash recording is working the same with/without wasm enabled (now that I correctly set up the signal masks). Do you have any pointers to why the wasm engine needs these signals unblocked? I'd like to better understand if there is any edge case we should consider here.

any pointers

It's a bit involved and in the weeds of the VM implementation, but the code is here: https://github.com/bytecodealliance/wasmtime/blob/main/crates/wasmtime/src/runtime/vm/sys/unix/signals.rs

They use it to handle things like out of bounds memory access and stack overflow inside the VM without needing explicit assembly to do the bounds checking.

rockwotj · 2025-01-29T22:02:04Z

src/v/redpanda/application.cc

+void sigsegv_action() {
+    crash_tracker::get_recorder().record_crash_sighandler(SIGSEGV);
+}


I think we should capture crash records regardless of whether ASAN is enabled or not. When ASAN is enabled, redpanda's sighandler will wrap the ASAN signal handler instead of seastar's. Which is why I believe we don't have to special case for ASAN here.

Have you tested this?

pgellert · 2025-01-30T18:42:01Z

force-push: address code review feedback

single sigaction call to install new and retrieve old handler
avoid unreachable code with enum class recorded_signo
avoid ss::util::spin_lock with std::atomic<bool>
Test: with crash signals on non-shard 0 threads
Test: add messages to asserts

rockwotj · 2025-01-31T15:39:20Z

src/v/redpanda/application.cc

+    struct sigaction sa;
+    sa.sa_sigaction = [](int sig, siginfo_t*, void*) {
+        // Reinstall the original signal handler
+        auto r = ::sigaction(sig, &original_sa, nullptr);


If we were not going to crash the process right after this, this has issues overriding the Wasm signal handling (actually I wonder if there is a potential races of reinstalling this then another thread hitting raising a signal).

Yeah thinking more, I actually don't think we want to drop the lock, I think we just need to protect against re-entry. If this function is called concurrently we will probably do the wrong thing or not let the crash handling finish it's work.

If we were not going to crash the process right after this

We are going to crash the process right after this. The signal is re-raised for the current head as the last step of the signal handler, which ensures that we will not return to the main execution of the thread.

I wonder if there is a potential races of reinstalling this then another thread hitting raising a signal. ... I actually don't think we want to drop the lock, I think we just need to protect against re-entry. If this function is called concurrently we will probably do the wrong thing or not let the crash handling finish it's work.

It comes down to how we want to handle the case of concurrent signals on multiple threads. We could:

Spin waiting for the first signal handler to complete. This is what the lock-based implementation does if we install the signal handler after calling Func().

Terminate the process, possibly before a crash report has been written. This is what the current implementation does with installing the signal handler first and then executing Func().

We cannot return from the signal handlers without terminating, since the program cannot continue from where it left off. So we either have to terminate the process or spin when handling a second signal concurrently.

I believe both 1 or 2 are viable. The pros/cons I see are:

1 is what seastar does currently, and we haven't had problems with it.

But with 1 there is technically a risk of deadlock. If two concurrent signals are raised on the same core, then the later signal could sping and starve the earlier one and cause a deadlock. Note that these signals must arrive on the same core but different threads (eg. a reactor thread and a non-reactor thread). They cannot arrive on the same thread because the [sa.sa](http://sa.sa/)_mask parameter of the signal handler blocks concurrent signals to the same.

2 is what this PR implements today. It is async signal safe, but in this highly unlikely scenario of concurrent crashes on multiple cores it may not record a crash report.

I am personally happy with either approach because both are safe (2 is clearly safe while 1 is as safe as the status quo seastar signal handler).

Here is a diff that both allows the crash handler to finish but also prevents deadlock: https://gist.github.com/rockwotj/46f9a6236b5d09221fa69d9964bd68e0

Thanks yeah that's essentially what I meant by option (2) above. I agree that that seems the best approach because it allows for the crash recorder to complete before terminating the process even in the case of concurrent signals.

From your gist I didn't adopt the thread_local changes, because, as we discussed on Slack, the sa_mask parameter passed to the sigaction call when we install the signal handler ensures that all signal delivery is blocked on the pthread that is currently handling a signal.

redpanda/src/v/crash_tracker/signals.cc

Line 57 in c2e59b7

sigfillset(&sa.sa_mask);

src/v/redpanda/application.cc

rockwotj · 2025-02-03T14:58:06Z

src/v/redpanda/application.cc

+    struct sigaction sa;
+    sa.sa_sigaction = [](int sig, siginfo_t*, void*) {
+        // Reinstall the original signal handler
+        auto r = ::sigaction(sig, &original_sa, nullptr);


Here is a diff that both allows the crash handler to finish but also prevents deadlock: https://gist.github.com/rockwotj/46f9a6236b5d09221fa69d9964bd68e0

This method is going to be used in signal handler functions to record information about the crash. Its implementation is async signal safe to ensure that it is safe to be called in a signal handler. Ref: https://man7.org/linux/man-pages/man7/signal-safety.7.html

This commit implements installing custom signal handlers for SIGSEGV and SIGABRT signals in redpanda using the `sigaction` system call. These signal handlers implement recording information about the crash for later debugging. Seastar has support for handling signals on the reactor. However, seastar's `ss::engine::handle_signal` is designed for asynchronously handling user signals (e.g., SIGUSR1, SIGINT, SIGHUP) within the reactor framework. This approach is unsuitable for crash signals where the reactor won't run again after the signal handler is done. In such cases, synchronous signal handlers are required to perform minimal, safe actions (e.g., writing out crash data) immediately upon signal reception, without relying on the reactor's asynchronous scheduling. This commit introduces synchronous, one-shot signal handlers for the main crash signals (SIGSEGV, SIGABRT, SIGILL). The implementation closely mirrors Seastar's existing signal handling mechanism for the above crash signals with the following difference: - Wrapping the existing signal handler: The original signal handler is retrieved and stored before installing redpanda's handler, ensuring it can be restored and triggered after our handler executes. This is to continue to call seastar's handlers as well after our handler. - Crash handler action: While seastar's signal handlers print to stderr, our signal handlers instead write a serde-serialized crash description to an open file descriptor. - SIGILL handler: for SIGILL seastar does not have a custom handler, which means that while for SIGSEGV and SIGABRT the original handler is seastar's, for SIGILL it is the default signal handler SIG_DFL which just terminates the process without printing to stderr. A SIGILL handler is introduced to capture `vassert()` crashes, which (among other places) uses the `__builtin_trap()` primitive leading to a SIGILL signal being raised. Seastar's current crash signal handlers (v25.1.x-pre): https://github.com/redpanda-data/seastar/blob/ec806998fbf62386ee4b81ac27358b40335839ca/src/core/reactor.cc#L4169-L4207

... for later reuse.

Allow sending signals to specific shards/threads of redpanda. This is to help with testing when a crash (eg. segfault) occurs on a shard of redpanda other than shard 0.

This adds a test to verify that crash reports captured while handling crash signals (segfaults, sigabrt, sigill) are correctly being reported. It also verifies that the generated crash reports can be read back later and are correctly printed when the crash loop limit is reached. This requires exposing `_tolerate_crashes` from `RedpandaService` to prevent test failures on crash-detecting checks, since for this test, the crashes are expected.

pgellert · 2025-02-04T19:43:35Z

force-push:

Rebase to dev to fix merge conflicts
Move the signal handler installation code from application.cc to crash_tracker/signals.cc
Move back to the ss::util::spinlock-based implementation based on this discussion above

rockwotj

LGTM. Thanks for working through this with me!

pgellert self-assigned this Jan 28, 2025

github-actions bot added area/build area/redpanda labels Jan 28, 2025

pgellert force-pushed the crashlog/crash-signal-recording branch from 5710b4a to 008974e Compare January 28, 2025 15:39

pgellert marked this pull request as ready for review January 28, 2025 15:40

pgellert requested review from a team, IoannisRP, michael-redpanda, dotnwat, travisdowns and rockwotj and removed request for a team January 28, 2025 15:40

michael-redpanda reviewed Jan 29, 2025

View reviewed changes

rockwotj reviewed Jan 29, 2025

View reviewed changes

pgellert force-pushed the crashlog/crash-signal-recording branch from 008974e to f400776 Compare January 30, 2025 18:39

pgellert requested review from michael-redpanda and rockwotj January 30, 2025 18:42

rockwotj reviewed Jan 31, 2025

View reviewed changes

pgellert requested a review from rockwotj February 3, 2025 14:04

rockwotj reviewed Feb 3, 2025

View reviewed changes

pgellert added 5 commits February 4, 2025 19:07

dt/crash_tracker: factor out expect_crash_count

4f3ec48

... for later reuse.

dt/redpanda: allow sending signals to threads

262a146

Allow sending signals to specific shards/threads of redpanda. This is to help with testing when a crash (eg. segfault) occurs on a shard of redpanda other than shard 0.

pgellert force-pushed the crashlog/crash-signal-recording branch from f400776 to c2e59b7 Compare February 4, 2025 19:41

rockwotj self-requested a review February 4, 2025 19:46

rockwotj approved these changes Feb 4, 2025

View reviewed changes

michael-redpanda approved these changes Feb 4, 2025

View reviewed changes

pgellert merged commit b586de8 into redpanda-data:dev Feb 5, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-8685 crash_tracker: crash signal handlers for crash recording #24959

CORE-8685 crash_tracker: crash signal handlers for crash recording #24959

pgellert commented Jan 28, 2025 •

edited

Loading

pgellert commented Jan 28, 2025

vbotbuildovich commented Jan 28, 2025 •

edited

Loading

michael-redpanda left a comment

michael-redpanda Jan 29, 2025

pgellert Jan 29, 2025

rockwotj Jan 29, 2025

travisdowns Jan 30, 2025

pgellert Jan 30, 2025 •

edited

Loading

rockwotj Jan 29, 2025

pgellert Jan 30, 2025

rockwotj Jan 31, 2025

rockwotj Jan 29, 2025

pgellert Jan 30, 2025

rockwotj Jan 31, 2025

rockwotj Jan 29, 2025

pgellert commented Jan 30, 2025

rockwotj Jan 31, 2025

pgellert Feb 3, 2025

pgellert Feb 3, 2025

rockwotj Feb 3, 2025

pgellert Feb 4, 2025

rockwotj Feb 3, 2025

pgellert commented Feb 4, 2025

rockwotj left a comment

	co_await ss::smp::invoke_on_all([] {
	// wasmtime needs some signals for it's handling, make sure we
	// unblock them.
	auto mask = ss::make_empty_sigset_mask();
	sigaddset(&mask, SIGSEGV);
	sigaddset(&mask, SIGILL);
	sigaddset(&mask, SIGFPE);
	ss::throw_pthread_error(::pthread_sigmask(SIG_UNBLOCK, &mask, nullptr));
	});

CORE-8685 crash_tracker: crash signal handlers for crash recording #24959

CORE-8685 crash_tracker: crash signal handlers for crash recording #24959

Conversation

pgellert commented Jan 28, 2025 • edited Loading

Backports Required

Release Notes

pgellert commented Jan 28, 2025

vbotbuildovich commented Jan 28, 2025 • edited Loading

CI test results

michael-redpanda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert commented Feb 4, 2025

rockwotj left a comment

Choose a reason for hiding this comment

pgellert commented Jan 28, 2025 •

edited

Loading

vbotbuildovich commented Jan 28, 2025 •

edited

Loading

pgellert Jan 30, 2025 •

edited

Loading