-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Audit transactions which may have incorrect serialization assumptions #6694
Comments
Address Lot deletion uses a transaction to read from |
BGP configuration deletion uses a transaction to check for entries in the If it's possible for entries to be added to |
BGP announce set creation and update both perform the following sequence of updates in a transaction:
In the case where |
Blueprint deletion performs the following operations:
It seems possible that a blueprint is made the active blueprint (by updating |
Updating DNS seems to use a pattern of:
There's even a comment about a TOCTTOU not being an issue on "read before write" because of the SERIALIZABLE level:
I've tried to repro this with two CockroachDB shells:
I am seeing this result in a serialization error, so I think this transaction is actually working as-is. I think this may be due to us INSERT-ing into the same table we SELECT-ed from? EDIT: I also tried writing a unit test for this case, and am not observing a serialization error, there, which indicates this TOCTTOU could be possible? But I'm not sure why the unit test and two CRDB shells are exhibiting different behavior here. |
Instance Update Network Interface seems to do the following operations:
I think in this case, the instance record returned by the EDIT: This also applies to the other variant where a primary is not being updated -- there's still a check on the result of
|
Physical Disk Adoption checks that a Sled is "in service" before inserting a disk and zpool. The guard here is intended to verify that the Sled hasn't been expunged, but it's possible that this sled becomes expunged mid-transaction, and the physical disk adoption happens anyway. |
Silo Group Deletion does the following:
Auditing this, I was concerned that a concurrent writer could add an entry to
However, this does seem to throw a serialization error as expected |
Loopback Address Creation does the following:
This problem looks like it might appear if the read from |
After spending some quality time with various docs and blog posts, I think we misunderstood the behavior here. Going back to the example that kicked all this off:
Let's call I believe we were wrong in inferring that the behavior described above violates our application-level invariant. We've been operating from the idea that: "we only want to write these records if the current target blueprint is what we think it is". Equivalently, "we want to avoid writing these records if the current target blueprint has changed from what we think it is". But this isn't sufficiently precise because in an MVCC world (like pretty much any modern relational database), there isn't really an objective truth about which blueprint is the target at a given point in time -- there's only the question of what value a particular transaction would see. Instead of talking about whether "the blueprint has changed during this transaction", we need to phrase the invariant in terms of what another observer might see. The behavior we're really trying to avoid is that some other transaction
All of this is consistent with the application-level invariant we're trying to uphold: at no point was an observer able to see blueprint "X + 1" and not the data that Another way to phrase all this: what we observed here was CockroachDB re-ordering things so that the transaction that set the target blueprint, although it completed first in wall-clock time, was logically after the transaction that checked the blueprint state and wrote the data. And there's nothing wrong with that. Using sqldance, I also went through the example in #6712. That example is trying to show a case where the DNS update code would be broken by another transaction arriving and changing the DNS generation in the middle of the transaction that normally tries to update it. The end result in the test does violate the assumed invariant that there can only be one version. But that's because of the the transaction that inserts a new version without checking the current one first. If I instead change the sqldance example to do what the real DNS update code does, which is to read the current DNS version and then try to update it, then one of the two attempted updates always fails. So I think that in the end the DNS update code is correct, and for the reason mentioned in the comment about the transactions being serializable. I know we all (myself included) have been confused at various points in the last few days and I'd love someone to verify this reasoning and it'd also be good to work through any other examples that we've found confusing. But from the cases I've looked at, the code we have seems correct and the concurrency model is as strong as we've been assuming. |
Dave, I spent a while this weekend going through this as well and I came up with essentially the same answer as you. I'm going to write my thinking down, not because I don't think you realize this but to confirm we are on the same page and to provide some additional notes for others. I think you made a mistake in this line here though, which you correct in prose below that:
The above states the physical time of transaction completion, but the logical serialization schedule is I think the introduction of Now, on the other hand if 2 years later, Now coming back to your example with FWIW, the CMU intro to db lectures are very good and cover concurrency control well. Starting here and continuing for the next few lectures as needed may help us all. |
Agreed! |
Just to hammer this home, I figured I'd draw out the serialization conflicts via a graph for the blueprint case within sqldance's recent-exmaples.txt Looking specifically for "RW, WR, or WW" conflicts, using the lens @andrewjstone mentioned from the notes on DB concurrency control:
There is a dependency between the two transactions, but no cycles, so according to conflict serializability, it's valid for both of these transactions to commit.
A cycle exists here, which explains the serializability error -- there is no ordering of transactions that is valid where we don't violate RW/WR semantics. Finally, for the blocking case:
From the graph (dotted line showing a dependency that only emerges if we actually try the From a data-dependency point of view, it's clear that C3's read of |
Thanks a ton for the notes on this, @davepacheco and @andrewjstone - I read through these materials, and found them immensely useful. I believe that the question of "can you find a circular dependency in a graph of the nodes" is kinda the crux of the question "does CRDB consider these transactions conflicting or not?" I bring this up because I think there are cases where the "third observer matters", and cases where the scenario can be collapsed into two transactions. @davepacheco , for example, with your sqldance snippet which introduces a "third party" to observe the state of the world in the blueprint example, and trigger a serialization error:
I don't believe it's actually necessary to have three observers to cause this situation to happen. It can also be triggered with two observers, as long as a circular dependency is forced between RW/WR/WW operations:
(This also results in a serialization error, as there is no valid ordering of C1 and C2 if they're both operating optimistically) |
(Addressing some of the earlier examples I dug up) A lot of the prior examples had the form:
Where a concurrent transaction does the following:
What can go wrong: It's possible for the check in txn 1 to pass, and then immediately become invalidated by a concurrent transaction (e.g., a "pointer row" exists to another row that is actively getting deleted). For example, this is the case for blueprint deletion. I came up with an example of this behavior using sqldance -- note that this does not fail with a retry error:
This case ends up with a
In this scenario I believe that prevention of this case is possible, but must be taken by
By including this line, we have the following orderings:
Therefore, the latter transaction ( |
There is actually a 2020 SIGMOD paper about CRDB that I didn't know existed until yesterday, and which I have yet to read. However, Murat summarizes it well as usual. The important bits for us are the concurrency control section. Interestingly, it appears that the only time CRDB blocks is in the
This is fairly odd in terms of optimistic concurrency control, but CRDB is also an odd, and very original beast when it comes to MVCC. They only use a single timestamp due to the reliance on hybrid logical clocks, which gives them some real time capability across nodes. So sometimes they re-order transactions or restart them, sometimes they wait apparently! |
Agreed -- I think at one point we talked about having it look like |
The CockroachDB architecture "Transaction Layer" docs do talk about this some, culminating in the "Transaction conflicts" section, though I had to go back and read the earlier sections about write intents, timestamps, etc. to make sense of it. (Reading this made me want to go find a talk that walks through how writes work in CockroachDB.) Here are some excerpts I found helpful:
Then under "Transaction conflicts":
It goes on to say that in the cases we care about (i.e., where transaction priority is not set and where the writer is still alive), "TxnB enters the TxnWaitQueue to wait for TxnA to complete." I think the summary of all of this is that if SERIALIZABLE transactions can possibly complete successfully by blocking, they do. The docs for the latest stable (24.2.3) talk a lot more about the
I'm assuming the bits about SERIALIZABLE transactions are valid in 22.1 because it's consistent with our understanding and seems like it wouldn't be something you could just change in a software update without breaking people's applications... but I did not check the release notes to verify that. |
I'm still working through the question of "When is there a Read-Write dependency from Cockroach's point-of-view". I'm finding this surprisingly subtle. Re-using my example from the delete case, I'll walk through some examples. All of these have the setup:
Bad: No Circular dependency
(No transaction retry) To recap, in this case, we do the wrong thing, and "successfully delete the blueprint that's the current target", because All the follow-up queries I'm making try to answer: "what is considered an overlap of Good: Read exactly the row we delete
In this example, we see a transaction retry error, presumably because there is a Bad: Read exactly a different row
In this example, we do not see a transaction retry error, and we again delete the current target blueprint. This happens because there is no Good: Read exactly the same row, but through a more generic query
In this example, we see a transaction retry error: this is similar to directly Good: Read exactly a different row, but through a more generic query?
In this example, we do see a transaction retry error, even though we read the row where bp has id = 1. This is worth comparing to the "Read exactly a different row" case -- even though we read a non-overlapping row from EDIT: Kudos to @davepacheco , this last query also conditionally causes a retry error, depending on the version of CockroachDB being used. On version |
Building off of: - #6229 (comment) - #6694 - https://github.com/oxidecomputer/sqldance This PR acknowledges that `SELECT FOR UPDATE` is a performance "optimization" in the situation where a blueprint's value is checked before performing a subsequent operation, as we do with setting network resources. It arguably could make performance worse in certain cases, as it locks out concurrent read operations from accessing the database. This PR removes the usage of `SELECT FOR UPDATE`, and significantly overhauls the `test_ensure_external_networking_bails_on_bad_target` test to account for the concurrent control that CockroachDB may be performing with respect to operation re-ordering. --------- Co-authored-by: John Gallagher <[email protected]>
This makes sense to me. The ordering should result in c1 < c2 so you'll wind up with the invariant violated (target blueprint is not present). It's worth noting that c2's transaction is broken even without any concurrency because it never checked that blueprint 2 was present.
Makes sense. This seems like applying the straightforward "fix" of having c2 check that the blueprint is present.
It makes sense to me that this has the same behavior as the one above since it hasn't changed anything about the relationship between the two transactions. (This version still has the problem that it can set a blueprint to the current target even if it doesn't exist, even without any concurrency.)
I think this is working (arguably) by accident. It happens to read the same row, and so the transactions end up related, but I think that's basically a coincidence here right? Also this one will still set a blueprint to the target even if it doesn't exist, even without concurrency, unless the one we're setting to the target happens to be also the latest one by id.
It surprises me that this produces an error. I discovered by accident that it doesn't produce an error in 23.2.4 on my Mac. Looking at it more closely:
I think these are related in that if c1 < c2, then c1 reads bp_target 1, then deletes bp 2; and then c2 reads bp_target 1 and then inserts bp_target 2. As in the above examples, c2's transaction is wrong even without concurrency since it doesn't check if bp2 still exists. If c2 < c1, things would be different: c2 sees bp_target 1 and sets a new target of 2; but this time c1 sees bp_target 2 and then deletes it [anyway, which the application presumably wouldn't actually do]. However, I don't think this ordering is possible if the transactions are interactive like this because Cockroach would have already reported bp_target 1 to c1. So I think that rules out c2 < c1. So I'm back to wondering why this ever produces an error. Offline, @smklein sent me the output:
I'm curious what |
Building off of: - #6229 (comment) - #6694 - https://github.com/oxidecomputer/sqldance This PR acknowledges that `SELECT FOR UPDATE` is a performance "optimization" in the situation where a blueprint's value is checked before performing a subsequent operation, as we do with setting network resources. It arguably could make performance worse in certain cases, as it locks out concurrent read operations from accessing the database. This PR removes the usage of `SELECT FOR UPDATE`, and significantly overhauls the `test_ensure_external_networking_bails_on_bad_target` test to account for the concurrent control that CockroachDB may be performing with respect to operation re-ordering. --------- Co-authored-by: John Gallagher <[email protected]>
As highlighted by #6229 (comment) , the following situation may be a problem with CockroachDB:
SELECT
for one or more rows from the databaseUPDATE
,DELETE
, etc).In this scenario, a row was read by a transaction, and modified before the transaction committed. According to Cockroachdb's documentation, this is legal:
Note that in this situation, the read transaction started before the subsequent write transaction, so the read was not considered stale. That being said, the value read by the transaction did get modified before it was committed.
This issue tracks taking a look at our usage of interactive transactions, for the following pattern:
SELECT
statement is issued without a correspondingFOR UPDATE
lockSELECT
statement influences the rest of the transaction, in some way. Examples could be: Influencing a conditional in Rust code, using the result to populate a subsequentUPDATE
orINSERT
, etc.SELECT
-ed are not themselves modified by the transaction.SELECT
-ed rows must not be modified before the transaction callsCOMMIT
, then this is a potential problem.An example of this issue was that highlighted within #6229, where we had roughly the following problem:
SELECT
the latest blueprint target, confirms it is equal to value X. Perform a database modification, assuming that theSELECT
-ed blueprint target has not been modified.In this example, the following may occur:
SELECT
the blueprint target, sees value "X"COMMIT
. No error is observed, it is not known that the target changed mid-operation.The text was updated successfully, but these errors were encountered: