CORE-8686 crash_tracker: remove old crash files #24972

pgellert · 2025-01-29T12:33:33Z

The broker will retain the 50 most recent crash files and delete all older crashes on startup. This is to ensure that crash report files do not fill up the disk.

Note that we remove files before creating them to ensure cleanup succeeds before we create new files. This means that technically we will have up to 51 crash reports on disk (50 kept + 1 created). This is acceptable since 50 is a loose limit.

A fixture test is added to verify this behaviour.

Fixes https://redpandadata.atlassian.net/browse/CORE-8686

Backports Required

Release Notes

none

vbotbuildovich · 2025-01-29T16:07:33Z

CI test results

test results on build#61347

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61347#0194b252-7266-4b5d-a16f-ee0ea2f98f96	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61347#0194b257-4304-474e-920a-b44926000731	FLAKY	1/2
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_topic_policy_change.cloud_storage_enabled=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61347#0194b257-4304-4337-92b0-a4a24e744877	FLAKY	1/2
rptest.tests.rpk_registry_test.RpkRegistryTest.test_produce_consume_proto	ducktape	https://buildkite.com/redpanda/redpanda/builds/61347#0194b252-7266-4b5d-a16f-ee0ea2f98f96	FLAKY	1/2
storage_single_thread_rpunit.storage_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61347#0194b20e-b20f-436d-892c-28a796077553	FLAKY	1/2

test results on build#61447

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61447#0194bda2-8b5f-47e8-9fb2-d2b9646da93f	FLAKY	1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61447#0194bda7-f2ce-4452-a59a-8513e9640208	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61447#0194bda2-8b60-48c0-acbe-de77928648a8	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61447#0194bda7-f2ca-4c0d-bb3a-51c814b543c9	FLAKY	1/2
rptest.tests.datalake.datalake_dlq_test.DatalakeDLQTest.test_dlq_table_for_mixed_records.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK.filesystem_catalog_mode=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61447#0194bda2-8b5f-4f33-be9b-1d679930d5a8	FLAKY	1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61447#0194bda7-f2ca-4c0d-bb3a-51c814b543c9	FLAKY	1/2
storage_single_thread_rpunit.storage_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61447#0194bd5e-411a-41c0-b2b2-15c2d2025f44	FLAKY	1/2

test results on build#61516

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5808-4417-935c-7796592dc406	FLAKY	1/6
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5807-44a7-bac8-edbcd90a8919	FLAKY	1/2
rptest.tests.datalake.simple_connect_test.RedpandaConnectIcebergTest.test_translating_avro_serialized_records.cloud_storage_type=CloudStorageType.S3.scenario=simple	ducktape	https://buildkite.com/redpanda/redpanda/builds/61516#0194ccba-c23a-4ed5-bb55-647e3b47d3fd	FLAKY	1/2
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5807-459e-b1fe-5e7798d95c0f	FLAKY	1/3
rptest.tests.full_disk_test.FullDiskReclaimTest.test_full_disk_triggers_gc	ducktape	https://buildkite.com/redpanda/redpanda/builds/61516#0194ccba-c23a-4ed5-bb55-647e3b47d3fd	FLAKY	1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5807-459e-b1fe-5e7798d95c0f	FLAKY	1/3

michael-redpanda

lgtm just a couple questions/comments

michael-redpanda · 2025-01-29T19:22:03Z

src/v/crash_tracker/recorder.cc

+            continue;
+        }
+
+        crash_files.emplace_back(entry.path(), entry.last_write_time());


can we trust the timestamp on the file? Or should we read the file and check its timestamp? I'm wondering if you could just reuse record::get_recorded_crash().

I have refactored this now to reuse get_recorded_crashes(). To be able to handle malformed crash files, I am using sorting the crash files by recorded_crash::timestamp() which uses the timestamp in the file if the file is valid and falls back to the last_write_time if the file could not be parsed.

michael-redpanda · 2025-01-29T19:57:39Z

src/v/crash_tracker/recorder.cc

 ss::future<> recorder::start() {
    co_await ensure_crashdir_exists();
+    co_await remove_old_crashfiles();


I was going to make a comment that maybe we should clean up on a certain frequency rather than whenever RP reboots... but these files only get created when RP crashes so yeah this is fine :)

pgellert · 2025-01-31T17:14:57Z

force-push: address code review feedback (reuse record::get_recorded_crash().)

src/v/crash_tracker/recorder.cc

Extract some functions before adding another to `start()`. Pure refactor, no behaviour change.

The broker will retain the 50 most recent crash files and delete all older crashes on startup. This is to ensure that crash report files do not fill up the disk. Note that we remove files before creating them to ensure cleanup succeeds before we create new files. This means that technically we will have up to 51 crash reports on disk (50 kept + 1 created). This is acceptable since 50 is a loose limit.

Implement a simple test to verify that crash reports are getting removed on `start()` when their number exceeds `crash_files_to_keep`.

pgellert · 2025-02-03T15:38:28Z

force-push:

rebase to dev to fix merge conflicts
Use value_or

michael-redpanda

I realized my request for the value_or was wrong. Wasn't fully awake/processing when I suggested that.

pgellert requested a review from a team January 29, 2025 12:33

pgellert self-assigned this Jan 29, 2025

pgellert requested review from aanthony-rp and michael-redpanda and removed request for a team January 29, 2025 12:33

github-actions bot added area/build area/redpanda labels Jan 29, 2025

michael-redpanda reviewed Jan 29, 2025

View reviewed changes

pgellert force-pushed the crashlog/remove-old-files branch from 4c98bbb to 97aaece Compare January 31, 2025 17:14

pgellert requested a review from michael-redpanda January 31, 2025 17:17

michael-redpanda previously approved these changes Feb 3, 2025

View reviewed changes

src/v/crash_tracker/recorder.cc Outdated Show resolved Hide resolved

pgellert added 3 commits February 3, 2025 14:59

crash_tracker: refactor recorder::start()

fa8a914

Extract some functions before adding another to `start()`. Pure refactor, no behaviour change.

crash_tracker/test: validate file cleanup

e7d5780

Implement a simple test to verify that crash reports are getting removed on `start()` when their number exceeds `crash_files_to_keep`.

pgellert dismissed michael-redpanda’s stale review via e7d5780 February 3, 2025 15:37

pgellert force-pushed the crashlog/remove-old-files branch from 97aaece to e7d5780 Compare February 3, 2025 15:37

pgellert requested a review from michael-redpanda February 3, 2025 15:38

michael-redpanda approved these changes Feb 3, 2025

View reviewed changes

pgellert merged commit 94b85cf into redpanda-data:dev Feb 3, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-8686 crash_tracker: remove old crash files #24972

CORE-8686 crash_tracker: remove old crash files #24972

pgellert commented Jan 29, 2025

vbotbuildovich commented Jan 29, 2025 •

edited

Loading

michael-redpanda left a comment

michael-redpanda Jan 29, 2025

pgellert Jan 31, 2025

michael-redpanda Jan 29, 2025

pgellert commented Jan 31, 2025

pgellert commented Feb 3, 2025

michael-redpanda left a comment

CORE-8686 crash_tracker: remove old crash files #24972

CORE-8686 crash_tracker: remove old crash files #24972

Conversation

pgellert commented Jan 29, 2025

Backports Required

Release Notes

vbotbuildovich commented Jan 29, 2025 • edited Loading

CI test results

michael-redpanda left a comment

Choose a reason for hiding this comment

michael-redpanda Jan 29, 2025

Choose a reason for hiding this comment

pgellert Jan 31, 2025

Choose a reason for hiding this comment

michael-redpanda Jan 29, 2025

Choose a reason for hiding this comment

pgellert commented Jan 31, 2025

pgellert commented Feb 3, 2025

michael-redpanda left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Jan 29, 2025 •

edited

Loading