Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8686 crash_tracker: remove old crash files #24972

Merged
merged 3 commits into from
Feb 3, 2025

Conversation

pgellert
Copy link
Contributor

The broker will retain the 50 most recent crash files and delete all older crashes on startup. This is to ensure that crash report files do not fill up the disk.

Note that we remove files before creating them to ensure cleanup succeeds before we create new files. This means that technically we will have up to 51 crash reports on disk (50 kept + 1 created). This is acceptable since 50 is a loose limit.

A fixture test is added to verify this behaviour.

Fixes https://redpandadata.atlassian.net/browse/CORE-8686

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@pgellert pgellert requested a review from a team January 29, 2025 12:33
@pgellert pgellert self-assigned this Jan 29, 2025
@pgellert pgellert requested review from aanthony-rp and michael-redpanda and removed request for a team January 29, 2025 12:33
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 29, 2025

CI test results

test results on build#61347
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61347#0194b252-7266-4b5d-a16f-ee0ea2f98f96 FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61347#0194b257-4304-474e-920a-b44926000731 FLAKY 1/2
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_topic_policy_change.cloud_storage_enabled=True ducktape https://buildkite.com/redpanda/redpanda/builds/61347#0194b257-4304-4337-92b0-a4a24e744877 FLAKY 1/2
rptest.tests.rpk_registry_test.RpkRegistryTest.test_produce_consume_proto ducktape https://buildkite.com/redpanda/redpanda/builds/61347#0194b252-7266-4b5d-a16f-ee0ea2f98f96 FLAKY 1/2
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61347#0194b20e-b20f-436d-892c-28a796077553 FLAKY 1/2
test results on build#61447
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61447#0194bda2-8b5f-47e8-9fb2-d2b9646da93f FLAKY 1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61447#0194bda7-f2ce-4452-a59a-8513e9640208 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61447#0194bda2-8b60-48c0-acbe-de77928648a8 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61447#0194bda7-f2ca-4c0d-bb3a-51c814b543c9 FLAKY 1/2
rptest.tests.datalake.datalake_dlq_test.DatalakeDLQTest.test_dlq_table_for_mixed_records.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK.filesystem_catalog_mode=False ducktape https://buildkite.com/redpanda/redpanda/builds/61447#0194bda2-8b5f-4f33-be9b-1d679930d5a8 FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61447#0194bda7-f2ca-4c0d-bb3a-51c814b543c9 FLAKY 1/2
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61447#0194bd5e-411a-41c0-b2b2-15c2d2025f44 FLAKY 1/2
test results on build#61516
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5808-4417-935c-7796592dc406 FLAKY 1/6
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5807-44a7-bac8-edbcd90a8919 FLAKY 1/2
rptest.tests.datalake.simple_connect_test.RedpandaConnectIcebergTest.test_translating_avro_serialized_records.cloud_storage_type=CloudStorageType.S3.scenario=simple ducktape https://buildkite.com/redpanda/redpanda/builds/61516#0194ccba-c23a-4ed5-bb55-647e3b47d3fd FLAKY 1/2
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5807-459e-b1fe-5e7798d95c0f FLAKY 1/3
rptest.tests.full_disk_test.FullDiskReclaimTest.test_full_disk_triggers_gc ducktape https://buildkite.com/redpanda/redpanda/builds/61516#0194ccba-c23a-4ed5-bb55-647e3b47d3fd FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61516#0194ccd6-5807-459e-b1fe-5e7798d95c0f FLAKY 1/3

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm just a couple questions/comments

continue;
}

crash_files.emplace_back(entry.path(), entry.last_write_time());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we trust the timestamp on the file? Or should we read the file and check its timestamp? I'm wondering if you could just reuse record::get_recorded_crash().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored this now to reuse get_recorded_crashes(). To be able to handle malformed crash files, I am using sorting the crash files by recorded_crash::timestamp() which uses the timestamp in the file if the file is valid and falls back to the last_write_time if the file could not be parsed.

ss::future<> recorder::start() {
co_await ensure_crashdir_exists();
co_await remove_old_crashfiles();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to make a comment that maybe we should clean up on a certain frequency rather than whenever RP reboots... but these files only get created when RP crashes so yeah this is fine :)

@pgellert pgellert force-pushed the crashlog/remove-old-files branch from 4c98bbb to 97aaece Compare January 31, 2025 17:14
@pgellert
Copy link
Contributor Author

force-push: address code review feedback (reuse record::get_recorded_crash().)

src/v/crash_tracker/recorder.cc Outdated Show resolved Hide resolved
Extract some functions before adding another to `start()`.

Pure refactor, no behaviour change.
The broker will retain the 50 most recent crash files and delete all
older crashes on startup. This is to ensure that crash report files do
not fill up the disk.

Note that we remove files before creating them to ensure cleanup
succeeds before we create new files. This means that technically we will
have up to 51 crash reports on disk (50 kept + 1 created). This is
acceptable since 50 is a loose limit.
Implement a simple test to verify that crash reports are getting removed
on `start()` when their number exceeds `crash_files_to_keep`.
@pgellert
Copy link
Contributor Author

pgellert commented Feb 3, 2025

force-push:

  • rebase to dev to fix merge conflicts
  • Use value_or

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized my request for the value_or was wrong. Wasn't fully awake/processing when I suggested that.

@pgellert pgellert merged commit 94b85cf into redpanda-data:dev Feb 3, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants