-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eleveldb version #1
Conversation
…t attempt of using antidote_utils, not working 100% yet.
Was there a particular reason that eleveldb was chosen over something like RocksDB? |
The main idea of antidote_db, is to abstract the behaviour of the backend data store, allowing developers to try out different possibilities, something not possible today in Antidote. So we could clearly try RocksDB and compare the benchmarks with this LevelDB implementation. At first sight, I liked the additions RocksDB does to LevelDB, but I thought that having Basho maintaining the code of ElevelDB, was a plus I didn't see in the other erlang projects that provide bindings to the different leveldb modifications. I also think that if the API we end up getting is the one we aim to, trying out different backends should be easy, so this implementation could be just the first one, not necesarry the best backend. @cmeiklejohn Do you think that any of the additions made to RocksDB could make a difference compared to LevelDB in the Antidote scenario? |
AccIn; | ||
false -> | ||
%% check its an op and its commit time is in the required range | ||
case not vectorclock:lt(VC1Dict, VCFromDict) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to read. Why not leave the negation out and switch the branch order of the case?
@bieniusa Thanks for the comments Annette! I fixed the first two already. |
@bieniusa I have rebar3 working on a separate branch. We can take a look at that after this PR! |
Nice! :) Since @aletomsic and @tcrain are assigned for the pull request, do you guys wanna pull? |
I'll be back to work tomorrow. Alejandro Tomsic
|
I'd investigate something like |
One of the nice things about eleveldb is that is was straightforward(ish) for Santiago to write a custom key comparator. If we can do that in lets though it does seem nicer to be able to use different backends. I think Santiago is trying to make most of the changes at a high level though so (apart from the key comparator I guess) it should be possible to move to something like lets without too much trouble. What do you think Santiago? Do you know how the key comparison works in lets? |
@cmeiklejohn the complete justification of all this, is in a long email I sent to the technical list on January 17. If that's not clear, let me know and we continue talking about it! I had never heard about lets. You mean this repo? What we would need in that case, is to replace the leveldb dependency in lets, to use our fork so we have the custom comparator @tcrain mentioned. What I saw in their repo is that they use leveldb, not eleveldb, so I don't know how that would work. The other thing I noticed, is that the last commit they have is 5 months old and the Travis CI didn't even pass. Is that repo maintained? |
antidote_db:put(AntidoteDB, {binary_to_atom(Key), SnapshotTimeList, snap}, Snapshot). | ||
|
||
%% Returns a list of operations that have commit time in the range [VCFrom, VCTo] | ||
-spec get_ops(antidote_db:antidote_db(), key(), vectorclock(), vectorclock()) -> list(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the return be
[#log_record{}] instead of list()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Forgot to include that in the updates, I'll update it now.
@tcrain @santialvarezcolombo we ended up dropping the key comparator in bigset, instead we used binaries composed of null-byte terminated strings. We we're able to get the same sorting and we get to ditch the c++ code. I can share the |
@santialvarezcolombo I tried your code, but had some problems. Maybe I am not understanding the API correctly. I have the following test:
The vectorclock of the operation is at time 1 on If you want to have a look at it: I have this test at https://github.com/peterzeller/antidote_db/blob/eleveldb_version_test/test/db_wrapper_SUITE.erl#L57-L59 (run with |
@peterzeller I've talked this example over with @tcrain and @aletomsic.
Same thing happens with Please let me know if I understood something wrong, no problem in fixing the logic if the range isn't correctly calculated. |
Ok, I think I had a wrong understanding of what the meaning of the range is. I assumed that get_ops(Db, Key, VCFrom, VCTo) would return all operations which come after VCFrom and before VCTo ( It is still not clear what the range [VCFrom, VCTo] is supposed to contain. If you are including concurrent operations, then the range would be all operations with
I guess that it is not necessary to change the code. I am just suggesting that you make the documentation for PS: Just noticed that this problem might come from the confusing function names in the vectorclock package. I created a ticket for that: SyncFree/antidote_utils#2 |
Sure, I will definitely improve the documentation once we are clear on what the range should contain. What do you think @bieniusa @tcrain @aletomsic? |
Just so we are on the same page, here is my understanding: The get ops is being used to get the list of update ops to generate a new snapshot from an older one. A snapshot is defined by a vector of DCs and clock times. Each update op is defined by a vector that is taken from the snapshot that it was generated from, but with the entry for its local DC replaced by the commit time of the op's transaction. A snapshot includes all ops whose vector is smaller or equal for all entries of the snapshot vector. So to go from an old snapshot to a new one, you need to take all ops that have vectors concurrent or larger than the old snapshot, and smaller or equal for all entries of the new snapshot. This means then that the get_ops range function uses a different comparison function for the lower bound vs the upper bound. So yea this is confusing, if you want to make it return something else, then handle removing the duplicates elsewhere, then no problem. However you want to structure it or name it is fine for me, but yea definitely include enough documentation to make it clear. |
Thanks for that explanation @tcrain! I just made the changes so now the get_ops method returns "all ops that have vectors concurrent or larger than the old snapshot, and smaller or equal for all entries of the new snapshot.". I also added a new method I took the idea of @peterzeller with the Let me know what you think. |
The clock for logEntry1 is less than Maybe it is possible to assume, that all snapshots are totally ordered, but I think for operations it is not possible to assume any restrictions on the order. Since the last element of the fold could always be a concurrent one, I think there is no valid condition where you could break. Maybe it would be possible to change the vectorclock-comparator to sort the clocks by minimum timestamp in the clock or something like that. |
@peterzeller I did what you say for the first comment. For the second one, I don't think changing the comparator in that way is the best solution. I'm even not sure how that would work. Do you think we can add a method to the Vectorclock module or something like that to fix it? The hack that works, making your test pass, is replacing the |
Hmm, this seems to be a difficult problem. With the current ordering you only know that the following values will have a smaller value for the first datacenter. If you would first sort by the maximum entry in the vector clock and after that by the lexicographic ordering, then you would know that the vector [1,20] can never come after [10,10] so there would be a valid condition to break the fold. Another option might be to group the operations by the originating replicas (it looks like this information is stored in the I am not sure if anything of the above makes sense, it's just some ideas. |
Yes, it seems a difficult problem. I think those ideas seam reasonable, but will add a lot of complex logic, so I think we have to find something simpler. I'll try to keep thinking if adding some logic to the Vectorclock module can help. For now, I commited the hack I mentioned in the last comment.. |
The problem seems to be fixed. With @aletomsic we thought that saving the last VC seen, and checking for concurrent ones, could fix the problem, so that's the code I added in the last commit. I also added a new test, very similar to the one Peter posted here. @peterzeller what do you think of this solution? Do you see any other counter example? |
I think But even with this fixed, Proper found a counterexample after 5362 random tests. Translated as a unit test:
My random tests can produce some combinations of clocks, which would not be possible in real executions. I think the above is such an example, but I have to think about it some more. |
@peterzeller I think you might be correct with the change to Regarding the second comment, I think its a bug and a difficult problem. I don't get why that case is not possible in real executions... Can you tell me why you think that? Maybe there's something I'm not seeing... |
If [2,0,1] is a clock of an operation, then [2,1,1] can only be a clock of an operation, if the operation was executed on dc2. Since there is a vectorclock [1,3,3] there must have been another operation from dc2 executed at time 3, so after [2,1,1]. But this means that the value for dc1 cannot be 1 in [1,3,3]. So if my thinking above is correct, there is no execution of antidote, that would produce operations with these 3 clocks. However, this is only relevant if your algorithms depend on how the vector-clocks are generated. |
I looked over again the Regarding those VCs, I think you are correct saying that specific example that doesn't apply, I wasn't looking at it like that. What I'm trying to think, is if that case can happen with other values... What my algorithm may have a bug is in what we talked before regarding the comparator function, and how that may cause problems in folding keys and breaking with the incorrect condition. But if that case can't happen, I think the algorithm looks OK for all other cases now. |
As I said here, the
I changed my test-data-generator to produce more realistic examples: I think the following would be a realistic test case and it fails:
|
I haven’t followed the details so far, but it seems to me that instead of discussing examples you should be discussing principles. Vector clocks represent a partial order (<_vc), whereas the LevelDB timestamps are totally ordered (<_db). For correctness, the following must be true: The converse is not necessarily true, i.e. there are values that are ordered by the leveldb timestamp but are not comparable by VC. The set of timestamps x such that x <_db y may contain vectors for which is not true that x <_vc y; when iterating you must filter them out. Does this help the discussion?
|
Thanks for the comment, Marc. I think it is a good idea to discuss the idea/algorithm first, before implementing and testing things.
From what we discussed so far, it seems that this is not enough to get an efficient implementation. For a given lower bound l, and upper bound u, we want to find operations x, such that ¬(x ≤_vc l) and (x ≤_vc u). For the upper bound the db-order is useful: I think for the lower bound the db-order is not so useful. |
OK now I understand the problem better. What if we keep a cache of recent VCs and their corresponding DB timestamp. When scanning, instead of starting from l, start from the previous DB timestamp whose VC timestamp is less than l. Would that work? PS If we don’t find a suitable point in the cache, use a dichotomic search.
|
Status? |
Hi everyone! I had to focus the last couple of weeks in a final exam I had yesterday, so there wasn't much progress regarding this topic. As a summary of what Peter found, it's an unlikely scenario in a real Antidote execution, but if it happens, the get_ops method doesn't return the correct values. This is because how keys get sorted in leveldb and the condition we have to satisfy to stop the fold. As @peterzeller said, the db order for the lower bound isn't useful to know when to stop the fold. Folding over all the keys is an obvious solution, but extremely inefficient since we have potentially lots of keys because we are using leveldb as logging module as well. Today I'm starting to think about what @marc-shapiro suggested, and also seeing another possibility we talked with @aletomsic, about using the heartbeats from other DCs, to know if this case can happen. If anyone has any other idea, I'm more than happy in hearing and evaluating them! |
I'm closing this PR since a new approach has been made in #2 |
This first PR, contains the initial implementation of the elevledb version of antidote_db. In the future, other PRs, will include the possibility to use the old logging vnode and disk_log implementation.
The antidote_db module contains a basic API to interact with eleveldb. The leveldb_wrapper module, contains a wrapper of the API provided by eleveldb, with added code containing Antidote logic.
Here is an overview of the new architecture proposed: