-
Notifications
You must be signed in to change notification settings - Fork 40
Atomic commit
There is no concept of “transaction” on the VFS level, but transaction boundaries can be detected via the call to xLock
and xUnlock
VFS methods.
- When
xLock
is called with a lock kind that is notNone
, it creates a transaction. - When
xUnlock
is called and the lock is downgraded from>=Reserved
to<Reserved
, it commits the current transaction, and re-creates a transaction with the committed version. If commit failed, it returnsSQLITE_BUSY
. If the lock is downgraded toNone
, the current transaction is cleared.
The page-level commit logic lives in mvstore/src/commit.rs
. The commit request types have the following structures:
#[derive(Deserialize)]
pub struct CommitGlobalInit<'a> {
#[serde(with = "serde_bytes")]
pub idempotency_key: &'a [u8],
#[serde(default)]
pub allow_skip_idempotency_check: bool,
pub num_namespaces: usize,
}
#[derive(Deserialize)]
pub struct CommitNamespaceInit<'a> {
pub version: &'a str,
pub ns_key: &'a str,
pub ns_key_hashproof: Option<&'a str>,
pub metadata: Option<&'a str>,
pub num_pages: u32,
pub read_set: Option<HashSet<u32>>,
}
#[derive(Deserialize)]
pub struct CommitRequest<'a> {
pub page_index: u32,
#[serde(with = "serde_bytes")]
pub hash: &'a [u8],
}
A client initiates a commit by sending a CommitGlobalInit
first. Then, for each namespace (database) included in the commit, it sends a CommitNamespaceInit
followed by num_pages
of CommitRequest
.
After the commit completes successfully, mvstore returns a CommitResponse
with the following structure:
#[derive(Serialize)]
pub struct CommitResponse {
pub changelog: HashMap<String, Vec<u32>>,
}
There are three commit modes:
-
Page-level conflict check (PLCC) single-phase commit. This is the most efficient mode, and is used when the size of the transaction’s read set does not exceed
PLCC_READ_SET_SIZE_THRESHOLD
(2000 by default). - Database-level conflict check (DLCC) single-phase commit. This mode does not allow concurrent write transactions, and is less efficient during high write concurrency.
-
Two-phase/multi-phase commit (MPC). This is a naming conflict with the standard 2PC protocol and these two should not be confused. This mode is used for large transactions (number of written pages >=
COMMIT_MULTI_PHASE_THRESHOLD
(1000 by default)) .
This step validates that all page hashes referenced in the commit request refer to existing pages in the content store. There should never be bad references in the page index.
If commit mode is MPC, step 1 and other steps are separated into two transactions. After the check in step 1 succeeds:
- A random database-scoped commit token is generated and written to FDB.
- The transaction is committed.
- A new transaction is created.
- The commit token is re-validated in the new transaction.
-
If commit mode is not PLCC:
- Read the last-write-version (LWV) of the database.
- If LWV is greater than the client-provided read version, abort with a conflict error.
-
Write the new LWV.
-
If commit mode is PLCC:
- Read the current versions of all pages included in the read set.
- If LWV is greater than any of the current versions, abort with a conflict error.
This step inserts the page hashes into the page index, and refreshes the content index for each referenced page.
This step appends the list of changed pages to the changelog store. This list has a max size of INTERVAL_ENTRY_MAX_SIZE
pages (500 by default). If exceeded, the list becomes saturated and is treated as an infinite list.
The transaction is committed after this step.
This step reads all changelog entries in the half-open interval [client-read-version, commit-versionstamp)
. Pages seen in these entries are returned to the client, and the client should flush all such pages out of its cache.