-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: added wal/snapshot doc #25856
Open
praveen-influx
wants to merge
1
commit into
main
Choose a base branch
from
praveen/wal-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
# WAL (write-ahead-log) and Snapshot process | ||
|
||
All writes get buffered into a wal buffer and then every flush interval all the contents are flushed to disk at which point the writes become durable and | ||
the clients are sent the confirmation that their writes are successful. At each flush interval there is also a check to see if queryable buffer need to | ||
evict data to disk, this process is called snapshotting. The rest of this doc discusses all the moving parts in wal flush and snapshotting. | ||
|
||
|
||
## Overview | ||
|
||
``` | ||
|
||
|
||
|
||
┌────────────┐ ┌────────────┐ | ||
│flush buffer│──────────►│ wal buffer │ | ||
└────────────┘ └────────────┘ | ||
▲ | ||
│ | ||
(takes everything from wal buffer and writes to wal file) | ||
│ | ||
│ (actual background_wal_flush uses wal trait - which wal obj store impls) | ||
│ ┌──────────────(background create)───────────────────────────┐ | ||
│ │ ▼ | ||
┌─────┴──────┐ │ ┌────────┐ | ||
┌────►│wal objstore├─┴──(remove wal,need each wal file num / drop semaphore)─►│wal file│ | ||
│ └─────┬──────┘ └────────┘ | ||
│ │ | ||
│ │ | ||
│ │ | ||
│ │ | ||
│ │ | ||
│ │ | ||
│ (notifies wal | ||
│ and snapshot optionally & snapshot semaphore taken) ┌─────────────┐ | ||
│ │ ┌───────────────►│update buffer│ (whatever wal flushed into chunks) | ||
│ ▼ │ └─────────────┘ | ||
│ ┌────────────┐ │ ┌─────────────┐ | ||
│ │query buffer├───────────────────────┼───────────────►│parquet file │ | ||
│ └─────┬──────┘ │ └─────────────┘ | ||
│ │ │ | ||
│ (run snapshot) │ ┌─────────────┐ | ||
└───────────┘ ├───────────────►│snapshot file│ (holds last wal seq number) | ||
│ └─────────────┘ | ||
│ ┌────────────┐ | ||
└───────────────►│clear buffer│ (whatever snapshotted is removed) | ||
└────────────┘ | ||
``` | ||
|
||
|
||
#### Steps | ||
|
||
1. When _writes_ comes in, they go into a write batch in wal buffer. These batches are held per database and the batches keep track of min | ||
and max times within each batch. These batches further hold per table chunks. This chunk is created by taking incoming rows and pinning | ||
them to a period. It is done by `t - (t % gen_1_duration)`. If `gen_1_duration` is 10 mins, then all of the rows will be divided into | ||
10 min chunks. As an example if there are rows for 10.29 and 10.35 then they both go into 2 separate chunks (10.20 and 10.30). And this | ||
10.20 and 10.30 are used later as the key in queryable buffer. | ||
2. Every flush interval, the wal buffer is flushed and all batches are written to to wal file (converts to wal content and gets min/max | ||
times from all batches) and every time wal file contents are written, we add wal period to snapshot tracker that tracks min/max times across | ||
batches that went into single wal file | ||
3. Then snapshot tracker is checked to see if there are enough wal periods to trigger snapshot. There are different conditions checked, the | ||
common scenarios are, | ||
- wal periods > (1.5 * snapshot size). e.g default settings will lead to snapshot size = 600, if we have 900 wal periods trigger snapshot | ||
- force snapshot, irrespective of sizes go ahead and run snapshot | ||
|
||
If going ahead with force snapshotting, pick all the wal periods in the tracker and find the max time from most recent wal period. This will be | ||
used as the `end_time_marker` to evict data from query buffer. Because forcing a snapshot can be triggered when wal buffer is empty (even though | ||
queryable buffer is full), we need to add `Noop` (a no-op WalOp) to the wal file to hold the snapshot details in wal file. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a header on this diagram, e.g., ##### Forced snapshot |
||
``` | ||
|
||
Snapshot (all, emptying wal periods in tracker) | ||
▲ ▲ | ||
│◄───wal in snapshot───────►│ | ||
┌──────┬──────┬──────┬──────┤ | ||
│ 0 │ 1 │ 2 │ 3 │ | ||
│ │ │ │ │ | ||
────┴──────┴──────┴──────┴──────┴─────► (time) | ||
|
||
``` | ||
|
||
If it is a normal snapshot, then leave one wal period (`3` in eg below) and pick the last one (`2` in eg below) max time used as `end_time_marker` | ||
and every wal period with max time is less than this `end_time_marker` will be removed from snapshot tracker. | ||
|
||
``` | ||
Snapshot (wal max < last wal period max, | ||
3 is max here that is left behind) | ||
▲ ▲ ▲ | ||
│◄─ wal in snapshot─►│ │ | ||
┌──────┬──────┬──────┬──────┤ | ||
│ 0 │ 1 │ 2 │ 3 │ | ||
│ │ │ │ │ | ||
────┴──────┴──────┴──────┴──────┴─────────────► (time) | ||
|
||
``` | ||
At this point we may or may not have to snapshot, but the wal buffer will still need to be emptied into the wal file. The clients that have been | ||
writing data are held up at most of flush interval time (defaults to 1s). The clients get notified together that the writes have been successful | ||
when writes are persisted to wal file. | ||
|
||
4. Then query buffer is notified to update the buffer, and query buffer does the following things in sequence, | ||
- it updates the buffer with incoming rows | ||
- writes to parquet file, | ||
- writes snapshot summary file | ||
- clears the buffer (using the `end_time_marker` that has been passed along) | ||
|
||
It is useful to visualize how query buffer holds data internally to understand how the buffer is cleared. | ||
- query buffer holds data as mapping between chunk time and MutableTableChunks (these are col id -> arrow array builder mappings with | ||
min/max times for that chunk), looks roughly like below | ||
``` | ||
|
||
│ | ||
├───10.20───────────►┌────────────────────────────┐ | ||
│ │ chunk 10.20 - 10.29 │ | ||
│ │ │ | ||
├───10.30───────────►├────────────────────────────┤ | ||
│ │ chunk 10.30 - 10.39 │ | ||
│ │ │ | ||
├───10.40───────────►└────────────────────────────┘ | ||
│ | ||
▼ | ||
time | ||
|
||
``` | ||
The crucial thing to note here is, `10.20` and `10.30` are the keys and they're taken from the write batches that have been derived | ||
from newly added rows. These are added to query buffer which manages an arrow backed buffer that holds all the newly added rows. | ||
|
||
- Above mutable chunks are added each time a wal flush happens, when snapshotting it uses the `end_time_marker` to evict data. Say, | ||
10.30 is the `end_time_marker` to evict data from queryable buffer, then query buffer evicts all data before 10.30 and holds it as | ||
snapshot chunk which has converted the arrow arrays to a record batch that's ready to be passed to sort/dedupe before writing to | ||
parquet file. Then a summary snapshot file is written which tracks what wal file number was last snapshotted along with pointers to | ||
parquet files created. This data is updated in persisted files so that any new query which spans that time can find data from the | ||
parquet files as buffer will not have them anymore. | ||
|
||
5. Once snapshotting process is complete, now deletes all the wal files up to a configured number of wal files to retain. When replaying | ||
the last snapshotted file is looked up from snapshot summary file that has been created so none of the wal files that've already been | ||
snapshotted is loaded into the buffer. | ||
6. If server crashed and restarted wal replay happens. But before that all snapshots are loaded, but even though snapshot file is written | ||
out it doesn't guarantee that wal files relevant to snapshot has been removed. | ||
|
||
## Example | ||
|
||
Below section walks through some of the nuances touched in the overall process described above | ||
|
||
- These are the wal files (1-4) with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods | ||
``` | ||
1[20 - 50] | ||
2[31 - 70] | ||
3[51 - 90] | ||
4[45 - 110] - snapshot! <= 2 (wal file num), <= 70 (end_time_marker). snapshot details has a file number and end_time_marker. | ||
``` | ||
|
||
- This is a mapping between queryable buffer's chunk time and what wal files the data comes from. This is not how queryable buffer | ||
maps the data internally, it doesn't know about wal files - however this mapping helps visualise the dependency with wal file better. | ||
The data for time slice 20 originates purely from wal file 1, for 40 it holds rows from wal files 1, 2 and 4 etc. | ||
``` | ||
20 - 1 | ||
30 - 1 | ||
40 - 1, 2, 4 | ||
50 - 2, 4 | ||
60 - 2, 3, 4 | ||
70 - 2, 3, 4 | ||
80 - 3, 4 | ||
90 - 3, 4 | ||
``` | ||
|
||
- At this point, lets say we need to snapshot as per the details above (`snapshot! <= 2 (wal file num), <= 70 (end_time_marker)`) file is written out | ||
& parquet files removing all items in queryable buffer upto time 70. This is the left over in queryable buffer - | ||
mainly the data that was in wal files 3, 4 are now in parquet files already. | ||
``` | ||
80 - 3, 4 | ||
90 - 3, 4 | ||
``` | ||
|
||
- If there's been a restart at this point data is loaded from wal files (due to restart), times 60 and 70 will be loaded into memory. so it may look like this, | ||
``` | ||
60 - 3, 4 | ||
70 - 3, 4 | ||
80 - 3, 4 | ||
90 - 3, 4 | ||
``` | ||
|
||
- And if further snapshot is kicked off by replaying earlier snapshot from wal file 4, because the `end_time_marker`(=70) is still there in wal file | ||
we'd end up removing it from the query buffer leaving the query buffer in state as expected. | ||
``` | ||
80 - 3, 4 | ||
90 - 3, 4 | ||
``` | ||
|
||
Say instead we force snapshotted instead of normal snapshotting, | ||
|
||
- Everything will be removed from query buffer. These are the wal files with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods | ||
``` | ||
1[20 - 50] | ||
2[31 - 70] | ||
3[51 - 90] | ||
4[45 - 110] - snapshot! <= 4 (wal file num), <= 110 (end_time_marker) | ||
``` | ||
|
||
- And query buffer looks like, | ||
``` | ||
20 - 1 | ||
30 - 1 | ||
40 - 1, 2, 4 | ||
50 - 2, 4 | ||
60 - 2, 3, 4 | ||
70 - 2, 3, 4 | ||
80 - 3, 4 | ||
90 - 3, 4 | ||
100 - 4 | ||
110 - 4 | ||
``` | ||
|
||
- At this point, snapshot file is written out & parquet files removing all items in queryable buffer upto time 110. So, all files 1, 2, 3, 4 are ready for deletion | ||
- There should be nothing in queryable buffer as end time marker is 110 and if there's been a restart at this point there are no wal files to load and everything is in parquet files (snapshotted) | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A useful addition to this diagram would be to show the entry point for writes from user, i.e., where do writes go from the user (
wal buffer
?), via an arrow. Otherwise, it is not clear on the order of operations. If you could connect the numbers from the steps described below to locations / arrows on the diagram, that would be helpful.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I'll try to link the steps to the diagram and add the incoming writes as well.