Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix corrupted wal number when predecessor wal corrupts + minor cleanup #13359

Closed
wants to merge 1 commit into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Jan 31, 2025

Context/Summary:

02b4197 recently added the ability to detect WAL hole presents in the predecessor WAL. It forgot to update the corrupted wal number to point to the predecessor WAL in that corruption case. This PR fixed it.

As a bonus, this PR also (1) fixed the FragmentBufferedReader() constructor API to expose less parameters as they are never explicitly passed in in the codebase (2) a INFO log wording (3) a parameter naming typo (4) the reporter naming

Test:

  1. Manual printing to ensure the corrupted wal number is set to the right number
  2. Existing UTs

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested a review from pdillinger February 4, 2025 04:32
Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in review. I haven't looked into the details of PredecessorWALInfo creation and handling. Let me know if my comments are infeasible.

db/log_reader.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_open.cc Outdated Show resolved Hide resolved
db/log_reader.cc Outdated
@@ -403,6 +409,7 @@ void Reader::MaybeVerifyPredecessorWALInfo(
std::to_string(observed_predecessor_wal_info_.GetSizeBytes()) +
" bytes.";
ReportCorruption(fragment.size(), reason.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if we can use Reporter as the source of truth for the corrupted wal number, we can also update this ReportCorruption API to take an extra wal_number argument. It will either get the current wal number or the predecessor wal number passed depending on the type error.

Copy link
Contributor Author

@hx235 hx235 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to chat more offline!

I'm thinking if we can use Reporter as the source of truth for the corrupted wal numbe

While it's a good direction to minimize the place we have "corrupted wal number", I found it difficult to use reporter as the source of truth for the corrupted_wal_number for the following reasons:

(1) Reporter, as the new name "DBOpenLogRecordReadReporter" suggests, it actually only tracks corruption per reading log record (e.g, physically corrupted log record, found corruption after reading the predecessor info record). But the corrupted_wal_number should be set even when there is log file level corruption e.g, fail to UpdatePredecessorWALInfo().

(2) Due to the messiness we still have with having multiple unconsolidated Status in ProcessLogFile(), I found it not easy to verify the log number at the time of ReportCorruption() will the be the same number as how we currently set corrupted_wal_number. This is to ensure no behavior change for all the corruption cases reported. That means it's better to accompany the change you suggested with status consolidation tech debt work.

Therefore, I decide to keep the "ugly" if-else statement in setting "corrupted_wal_number = ...." in HandleNonOkStatusOrOldLogRecord() for now.

we can also update this ReportCorruption API to take an extra wal_number argument

Done with some notes.

I do agree with using Reporter to keep track of the corrupted log number though I eventually decided to not using the Reporter::Corruption() to take another wal_number, but another API to deal with the special case where it's the prior WAL being corrupted.

This is because most of the corruption doesn't need to track log number inside of the reporter because the reporter will be automatically associated with the corrupted log file if any corruption happens. So I feel the wal_number parameter is mostly useless and ends up with something like below

 .../rocksdb/db/repair.cc
 Status ConvertLogToTable(const std::string& wal_dir, uint64_t log) {
    struct LogReporter : public log::Reader::Reporter {
      Env* env;
      std::shared_ptr<Logger> info_log;
      uint64_t lognum;
      void Corruption(size_t bytes, const Status& s, uint64_t log_number /* another log number!!!!! */) override {
        // We print error messages for corruption, but continue repairing.
        ROCKS_LOG_ERROR(info_log, "Log #%" PRIu64 ": dropping %d bytes; %s",
                        lognum, static_cast<int>(bytes), s.ToString().c_str());
      }
    .....
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this one open for people who are interested in learning about the context for the part we decided to change vs not change.

@hx235 hx235 force-pushed the fix_actual_wal_num_corrupted branch from 26b7aad to 73aa9ee Compare February 11, 2025 02:07
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title Fix corrupted wal number when predecessor wal corrupts Fix corrupted wal number when predecessor wal corrupts + some cleanup Feb 11, 2025
@hx235 hx235 changed the title Fix corrupted wal number when predecessor wal corrupts + some cleanup Fix corrupted wal number when predecessor wal corrupts + minor cleanup Feb 11, 2025
@hx235 hx235 changed the title Fix corrupted wal number when predecessor wal corrupts + minor cleanup Fix corrupted wal number when predecessor wal corrupts + minor cleanup Feb 11, 2025
@hx235 hx235 force-pushed the fix_actual_wal_num_corrupted branch from 73aa9ee to da73398 Compare February 11, 2025 02:12
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -162,15 +162,25 @@ class Directories {
std::unique_ptr<FSDirectory> wal_dir_;
};

struct DBOpenLogReporter : public log::Reader::Reporter {
struct DBOpenLogRecordReadReporter : public log::Reader::Reporter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to better indicate what level of corruption it reports. The level may change in the future once we figure out how to consolidate all the error status in ProcessLogFile() (see existing TODOs)

Copy link
Contributor

@jowlyzhang jowlyzhang Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alternatively, we can have a Reporter initialized with the wal_number_ it is opened for. If a Reporter::Corruption(, corrupted_wal_number=kMaxSequence) call is not providing a specific wal number, it will track its own wal_number_ as the corrupted_wal_number_. I think that can help eliminate the need of a dedicated PredecessorLogCorruption API. But it's up to you, not a big deal.

This may help us with removing the if else in HandleNonOkStatusOrOldLogRecord, but I agree with you that before we can consolidate the status handling, it's safer to keep it like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see let me think more and fix it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed excluding the constructor part, which I found is better done with the status consolidation.

db/log_reader.h Show resolved Hide resolved
@hx235 hx235 requested a review from jowlyzhang February 11, 2025 02:49
@hx235 hx235 force-pushed the fix_actual_wal_num_corrupted branch from da73398 to f244289 Compare February 11, 2025 23:46
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Contributor Author

hx235 commented Feb 12, 2025

Fix internal formatting issue

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making this changes!

@hx235 hx235 force-pushed the fix_actual_wal_num_corrupted branch from f244289 to e5dbf29 Compare February 13, 2025 00:41
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 merged this pull request in affcad0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants