Skip to content

Atomic commit

Heyang Zhou edited this page Sep 21, 2022 · 1 revision

Detecting transaction boundaries

There is no concept of “transaction” on the VFS level, but transaction boundaries can be detected via the call to xLock and xUnlock VFS methods.

  • When xLock is called with a lock kind that is not None, it creates a transaction.
  • When xUnlock is called and the lock is downgraded from >=Reserved to <Reserved, it commits the current transaction, and re-creates a transaction with the committed version. If commit failed, it returns SQLITE_BUSY. If the lock is downgraded to None, the current transaction is cleared.

Commit

The page-level commit logic lives in mvstore/src/commit.rs. The commit request types have the following structures:

#[derive(Deserialize)]
pub struct CommitGlobalInit<'a> {
    #[serde(with = "serde_bytes")]
    pub idempotency_key: &'a [u8],

    #[serde(default)]
    pub allow_skip_idempotency_check: bool,

    pub num_namespaces: usize,
}

#[derive(Deserialize)]
pub struct CommitNamespaceInit<'a> {
    pub version: &'a str,
    pub ns_key: &'a str,
    pub ns_key_hashproof: Option<&'a str>,
    pub metadata: Option<&'a str>,
    pub num_pages: u32,
    pub read_set: Option<HashSet<u32>>,
}

#[derive(Deserialize)]
pub struct CommitRequest<'a> {
    pub page_index: u32,
    #[serde(with = "serde_bytes")]
    pub hash: &'a [u8],
}

A client initiates a commit by sending a CommitGlobalInit first. Then, for each namespace (database) included in the commit, it sends a CommitNamespaceInit followed by num_pages of CommitRequest.

After the commit completes successfully, mvstore returns a CommitResponse with the following structure:

#[derive(Serialize)]
pub struct CommitResponse {
    pub changelog: HashMap<String, Vec<u32>>,
}

Commit modes

There are three commit modes:

  1. Page-level conflict check (PLCC) single-phase commit. This is the most efficient mode, and is used when the size of the transaction’s read set does not exceed PLCC_READ_SET_SIZE_THRESHOLD (2000 by default).
  2. Database-level conflict check (DLCC) single-phase commit. This mode does not allow concurrent write transactions, and is less efficient during high write concurrency.
  3. Two-phase/multi-phase commit (MPC). This is a naming conflict with the standard 2PC protocol and these two should not be confused. This mode is used for large transactions (number of written pages >= COMMIT_MULTI_PHASE_THRESHOLD (1000 by default)) .

Step 1: Check page existence

This step validates that all page hashes referenced in the commit request refer to existing pages in the content store. There should never be bad references in the page index.

If commit mode is MPC, step 1 and other steps are separated into two transactions. After the check in step 1 succeeds:

  • A random database-scoped commit token is generated and written to FDB.
  • The transaction is committed.
  • A new transaction is created.
  • The commit token is re-validated in the new transaction.

Step 2: Conflict check

  1. If commit mode is not PLCC:

    1. Read the last-write-version (LWV) of the database.
    2. If LWV is greater than the client-provided read version, abort with a conflict error.
  2. Write the new LWV.

  3. If commit mode is PLCC:

    1. Read the current versions of all pages included in the read set.
    2. If LWV is greater than any of the current versions, abort with a conflict error.

Step 3: Write page index

This step inserts the page hashes into the page index, and refreshes the content index for each referenced page.

Step 4: Write changelog

This step appends the list of changed pages to the changelog store. This list has a max size of INTERVAL_ENTRY_MAX_SIZE pages (500 by default). If exceeded, the list becomes saturated and is treated as an infinite list.

The transaction is committed after this step.

Step 5: Read changelog

This step reads all changelog entries in the half-open interval [client-read-version, commit-versionstamp). Pages seen in these entries are returned to the client, and the client should flush all such pages out of its cache.