Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eleveldb version #1

Closed
wants to merge 23 commits into from
Closed

Eleveldb version #1

wants to merge 23 commits into from

Conversation

santialvarezcolombo
Copy link
Contributor

@santialvarezcolombo santialvarezcolombo commented Aug 26, 2016

This first PR, contains the initial implementation of the elevledb version of antidote_db. In the future, other PRs, will include the possibility to use the old logging vnode and disk_log implementation.

The antidote_db module contains a basic API to interact with eleveldb. The leveldb_wrapper module, contains a wrapper of the API provided by eleveldb, with added code containing Antidote logic.

Here is an overview of the new architecture proposed:

antidotedb future

@cmeiklejohn
Copy link
Member

Was there a particular reason that eleveldb was chosen over something like RocksDB?

@santialvarezcolombo
Copy link
Contributor Author

The main idea of antidote_db, is to abstract the behaviour of the backend data store, allowing developers to try out different possibilities, something not possible today in Antidote. So we could clearly try RocksDB and compare the benchmarks with this LevelDB implementation.

At first sight, I liked the additions RocksDB does to LevelDB, but I thought that having Basho maintaining the code of ElevelDB, was a plus I didn't see in the other erlang projects that provide bindings to the different leveldb modifications. I also think that if the API we end up getting is the one we aim to, trying out different backends should be easy, so this implementation could be just the first one, not necesarry the best backend.

@cmeiklejohn Do you think that any of the additions made to RocksDB could make a difference compared to LevelDB in the Antidote scenario?

AccIn;
false ->
%% check its an op and its commit time is in the required range
case not vectorclock:lt(VC1Dict, VCFromDict) of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to read. Why not leave the negation out and switch the branch order of the case?

@santialvarezcolombo
Copy link
Contributor Author

@bieniusa Thanks for the comments Annette! I fixed the first two already.
It's great Antidote moved to rebar3! I'll tackle it in the next PR, so I can keep this one clean with only this first version of the leveldb stuff. Sounds good?

@santialvarezcolombo
Copy link
Contributor Author

@bieniusa I have rebar3 working on a separate branch. We can take a look at that after this PR!

@bieniusa
Copy link

Nice! :)

Since @aletomsic and @tcrain are assigned for the pull request, do you guys wanna pull?

@aletomsic
Copy link

I'll be back to work tomorrow.
Might be able to join today's mumble if there is one.

Alejandro Tomsic

On Aug 29, 2016, at 21:41, Annette Bieniusa [email protected] wrote:

Nice! :)

Since @aletomsic and @tcrain are assigned for the pull request, do you guys wanna pull?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@cmeiklejohn
Copy link
Member

I'd investigate something like lets that has an interface that allows you to swap between different backends. I'm just curious why you picked eleveldb off the shelf: if it was just done for a random reason or there was some motivation behind it.

@tcrain
Copy link

tcrain commented Aug 31, 2016

One of the nice things about eleveldb is that is was straightforward(ish) for Santiago to write a custom key comparator.

If we can do that in lets though it does seem nicer to be able to use different backends. I think Santiago is trying to make most of the changes at a high level though so (apart from the key comparator I guess) it should be possible to move to something like lets without too much trouble. What do you think Santiago? Do you know how the key comparison works in lets?

@santialvarezcolombo
Copy link
Contributor Author

@cmeiklejohn the complete justification of all this, is in a long email I sent to the technical list on January 17. If that's not clear, let me know and we continue talking about it!

I had never heard about lets. You mean this repo? What we would need in that case, is to replace the leveldb dependency in lets, to use our fork so we have the custom comparator @tcrain mentioned. What I saw in their repo is that they use leveldb, not eleveldb, so I don't know how that would work. The other thing I noticed, is that the last commit they have is 5 months old and the Travis CI didn't even pass. Is that repo maintained?

antidote_db:put(AntidoteDB, {binary_to_atom(Key), SnapshotTimeList, snap}, Snapshot).

%% Returns a list of operations that have commit time in the range [VCFrom, VCTo]
-spec get_ops(antidote_db:antidote_db(), key(), vectorclock(), vectorclock()) -> list().
Copy link

@tcrain tcrain Sep 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the return be
[#log_record{}] instead of list()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Forgot to include that in the updates, I'll update it now.

@russelldb
Copy link
Member

@tcrain @santialvarezcolombo we ended up dropping the key comparator in bigset, instead we used binaries composed of null-byte terminated strings. We we're able to get the same sorting and we get to ditch the c++ code. I can share the bigset_keys.erl module with you if you are interested in taking the same approach.

@peterzeller
Copy link

@santialvarezcolombo I tried your code, but had some problems. Maybe I am not understanding the API correctly.

I have the following test:

    ok = antidote_db:put_op(Db, d, vectorclock:from_list([{dc1,1},{dc3,2}]), logEntry),
    Records = antidote_db:get_ops(Db, d, vectorclock:from_list([{dc1,2},{dc3,1}]),
                                         vectorclock:from_list([{dc1,3},{dc3,3}])),
    ?assertEqual([], lists:sort(Records)),

The vectorclock of the operation is at time 1 on dc1, so I would assume that it should not be included in the get_ops call below, which only starts from time 2 for dc1. However, this test fails -- the logEntry is included in the returned Records.

If you want to have a look at it: I have this test at https://github.com/peterzeller/antidote_db/blob/eleveldb_version_test/test/db_wrapper_SUITE.erl#L57-L59 (run with rebar3 ct). I also have a property based test in the same folder (run with rebar3 proper).

@santialvarezcolombo
Copy link
Contributor Author

@peterzeller I've talked this example over with @tcrain and @aletomsic.
From that talk, I understood that[{dc1,1},{dc3,2}] and [{dc1,2},{dc3,1}] are considered concurrent. You can see this with the vectorclock methods used today in antidote:

?assertEqual(false, vectorclock:gt(vectorclock:from_list([{dc1,2},{dc3,1}]), vectorclock:from_list([{dc1,1},{dc3,2}]))),
?assertEqual(false, vectorclock:gt(vectorclock:from_list([{dc1,1},{dc3,2}]), vectorclock:from_list([{dc1,2},{dc3,1}]))),

Same thing happens with lt... That's the reason why the logic I wrote returns that operation, since it's included in the range you are asking for...

Please let me know if I understood something wrong, no problem in fixing the logic if the range isn't correctly calculated.

@peterzeller
Copy link

peterzeller commented Sep 21, 2016

Ok, I think I had a wrong understanding of what the meaning of the range is. I assumed that get_ops(Db, Key, VCFrom, VCTo) would return all operations which come after VCFrom and before VCTo (VCOp >= VCFrom && VCOp <= VCTo).

It is still not clear what the range [VCFrom, VCTo] is supposed to contain. If you are including concurrent operations, then the range would be all operations with !(VCOp <= VCFrom) && !(VCOp >= VCTo). But then the following test should work:

    ok = antidote_db:put_op(Db, d, vectorclock:from_list([{dc1,4},{dc2,1},{dc3,4}]), logEntry),
    Records = antidote_db:get_ops(Db, d, vectorclock:from_list([{dc1,14},{dc2,1},{dc3,4}]), vectorclock:from_list([{dc1,16},{dc2,15},{dc3,7}])),
    ?assertEqual([], lists:sort(Records)),

I guess that it is not necessary to change the code. I am just suggesting that you make the documentation for get_ops more precise, since there is no standard way to interpret a range on a partial order. It would also be nice to document the preconditions (e.g.: is it necessary that VCFrom is less or equal to VCTo? ).

PS: Just noticed that this problem might come from the confusing function names in the vectorclock package. I created a ticket for that: SyncFree/antidote_utils#2

@santialvarezcolombo
Copy link
Contributor Author

Sure, I will definitely improve the documentation once we are clear on what the range should contain. What do you think @bieniusa @tcrain @aletomsic?

@tcrain
Copy link

tcrain commented Sep 22, 2016

Just so we are on the same page, here is my understanding:

The get ops is being used to get the list of update ops to generate a new snapshot from an older one.

A snapshot is defined by a vector of DCs and clock times.

Each update op is defined by a vector that is taken from the snapshot that it was generated from, but with the entry for its local DC replaced by the commit time of the op's transaction.

A snapshot includes all ops whose vector is smaller or equal for all entries of the snapshot vector.

So to go from an old snapshot to a new one, you need to take all ops that have vectors concurrent or larger than the old snapshot, and smaller or equal for all entries of the new snapshot.

This means then that the get_ops range function uses a different comparison function for the lower bound vs the upper bound. So yea this is confusing, if you want to make it return something else, then handle removing the duplicates elsewhere, then no problem. However you want to structure it or name it is fine for me, but yea definitely include enough documentation to make it clear.

@santialvarezcolombo
Copy link
Contributor Author

Thanks for that explanation @tcrain!

I just made the changes so now the get_ops method returns "all ops that have vectors concurrent or larger than the old snapshot, and smaller or equal for all entries of the new snapshot.".

I also added a new method get_ops_applicable_to_snapshot after talking with @aletomsic. Given a key and a VC, this method returns the most suitable snapshot committed before the VC with a list of operations in the range [SnapshotCommitTime, VC) to be applied to the mentioned snapshot to generate a new one in Antidote. This method was added so in only one fold, we can generate a new snapshot instead of calling get_snapshot and then get_ops.

I took the idea of @peterzeller with the withFreshDb method, and refactored my tests and added comments.

Let me know what you think.

@peterzeller
Copy link

  1. Now you check vectorclock:strict_ge(VC1Dict, VCToDict) == false in the code for get_ops. But this is not the same as "smaller or equal than VCTo". So I would replace this with vectorclock:le(VC1Dict, VCToDict) and reverse the condition.
  2. I don't think it is save to break the fold, if you find a clock, which is less than or equal to From. For example consider the following test:
test6(_Config) ->
  withFreshDb(fun(Db) -> 
    ok = antidote_db:put_op(Db, d, vectorclock:from_list([{dc1,9},{dc2,12},{dc3,1}]), logEntry1),
    ok = antidote_db:put_op(Db, d, vectorclock:from_list([{dc1,7},{dc2,2},{dc3,12}]), logEntry2),
    Records = antidote_db:get_ops(Db, d, vectorclock:from_list([{dc1,10},{dc2,17},{dc3,2}]), vectorclock:from_list([{dc1,12},{dc2,20},{dc3,18}])),
    ?assertEqual([logEntry2], lists:sort(Records)),
    true
  end).

The clock for logEntry1 is less than From, but the clock for logEntry2 is concurrent to From and thus logEntry2 should be included in the result.

Maybe it is possible to assume, that all snapshots are totally ordered, but I think for operations it is not possible to assume any restrictions on the order. Since the last element of the fold could always be a concurrent one, I think there is no valid condition where you could break.

Maybe it would be possible to change the vectorclock-comparator to sort the clocks by minimum timestamp in the clock or something like that.

@santialvarezcolombo
Copy link
Contributor Author

@peterzeller I did what you say for the first comment. For the second one, I don't think changing the comparator in that way is the best solution. I'm even not sure how that would work. Do you think we can add a method to the Vectorclock module or something like that to fix it?

The hack that works, making your test pass, is replacing the throw({break, AccIn}) with AccIn. But I'm not sure if that's a good idea, because the break condition now is when the key changes, so we may be iterating lots of unnecessary keys. Do you see any other work around?

@peterzeller
Copy link

Hmm, this seems to be a difficult problem. With the current ordering you only know that the following values will have a smaller value for the first datacenter.

If you would first sort by the maximum entry in the vector clock and after that by the lexicographic ordering, then you would know that the vector [1,20] can never come after [10,10] so there would be a valid condition to break the fold.

Another option might be to group the operations by the originating replicas (it looks like this information is stored in the log_record.op_number), so that the order would be:
snapshots, operations from dc1, operations from dc2, operations from dc3, ....
Then there would be multiple starting points to collect the operations, but finding start and end position for each should be fast if you have a list of replicas. For example to get the operations from vectorclock [2,6,1] to vectorclock [10,6,8] you would search replica1 from 3-10 and replica3 from 2-8.

I am not sure if anything of the above makes sense, it's just some ideas.

@santialvarezcolombo
Copy link
Contributor Author

Yes, it seems a difficult problem. I think those ideas seam reasonable, but will add a lot of complex logic, so I think we have to find something simpler. I'll try to keep thinking if adding some logic to the Vectorclock module can help.

For now, I commited the hack I mentioned in the last comment..

@santialvarezcolombo
Copy link
Contributor Author

The problem seems to be fixed. With @aletomsic we thought that saving the last VC seen, and checking for concurrent ones, could fix the problem, so that's the code I added in the last commit. I also added a new test, very similar to the one Peter posted here.

@peterzeller what do you think of this solution? Do you see any other counter example?

@peterzeller
Copy link

I think case vectorclock:lt(VC1Dict, VCFromDict) of should be changed to use vectorclock:le.

But even with this fixed, Proper found a counterexample after 5362 random tests. Translated as a unit test:

test7(_Config) ->
  withFreshDb(fun(Db) -> 
    ok = antidote_db:put_op(Db, d, vectorclock:from_list([{dc1,1},{dc2,3},{dc3,3}]), logEntry1),
    ok = antidote_db:put_op(Db, d, vectorclock:from_list([{dc1,2},{dc2,0},{dc3,1}]), logEntry2),
    ok = antidote_db:put_op(Db, d, vectorclock:from_list([{dc1,2},{dc2,1},{dc3,1}]), logEntry3),
    Records = antidote_db:get_ops(Db, d, vectorclock:from_list([{dc1,3},{dc2,2},{dc3,2}]), vectorclock:from_list([{dc1,10},{dc2,10},{dc3,10}])),
    ?assertEqual([logEntry1], lists:sort(Records)),
    true
  end).

My random tests can produce some combinations of clocks, which would not be possible in real executions. I think the above is such an example, but I have to think about it some more.

@santialvarezcolombo
Copy link
Contributor Author

@peterzeller I think you might be correct with the change to le.

Regarding the second comment, I think its a bug and a difficult problem. I don't get why that case is not possible in real executions... Can you tell me why you think that? Maybe there's something I'm not seeing...

@peterzeller
Copy link

peterzeller commented Sep 29, 2016

Regarding the second comment, I think its a bug and a difficult problem. I don't get why that case is not possible in real executions... Can you tell me why you think that? Maybe there's something I'm not seeing...

If [2,0,1] is a clock of an operation, then [2,1,1] can only be a clock of an operation, if the operation was executed on dc2. Since there is a vectorclock [1,3,3] there must have been another operation from dc2 executed at time 3, so after [2,1,1]. But this means that the value for dc1 cannot be 1 in [1,3,3].

So if my thinking above is correct, there is no execution of antidote, that would produce operations with these 3 clocks. However, this is only relevant if your algorithms depend on how the vector-clocks are generated.

@santialvarezcolombo
Copy link
Contributor Author

I looked over again the le thing and lt is OK. As I reversed the condition, If I use le I'm not including the op with the same VC (in case it exists) than the lower range. You can change it in your fork and see that some of the tests fail, because that op isn't included, when it should. So I conclude lt is OK.

Regarding those VCs, I think you are correct saying that specific example that doesn't apply, I wasn't looking at it like that. What I'm trying to think, is if that case can happen with other values...

What my algorithm may have a bug is in what we talked before regarding the comparator function, and how that may cause problems in folding keys and breaking with the incorrect condition. But if that case can't happen, I think the algorithm looks OK for all other cases now.

@peterzeller
Copy link

I looked over again the le thing and lt is OK. As I reversed the condition, If I use le I'm not including the op with the same VC (in case it exists) than the lower range.

As I said here, the lt operation on vectorclocks does not make much sense. If you want to include the lower range, you can use strict_le instead of lt. But as I understood it, the lower range should not be included, as it would already be included in the snapshot. It also says so in your documentation: "it returns all ops which have a VectorClock concurrent or larger than VCFrom ...".

Regarding those VCs, I think you are correct saying that specific example that doesn't apply, I wasn't looking at it like that. What I'm trying to think, is if that case can happen with other values...

I changed my test-data-generator to produce more realistic examples: I think the following would be a realistic test case and it fails:

ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc3,1}]), {logEntry,4}),
ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc2,4},{dc3,1}]), {logEntry,3}),
ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc2,2}]), {logEntry,2}),
ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc2,1}]), {logEntry,1}),
Records = antidote_db:get_ops(Db, key, vectorclock:from_list([{dc2,3}]), 
                                       vectorclock:from_list([{dc2,3},{dc3,1}])),
?assertEqual([{logEntry,3},{logEntry,4}], lists:sort(Records)),

@marc-shapiro
Copy link

I haven’t followed the details so far, but it seems to me that instead of discussing examples you should be discussing principles.

Vector clocks represent a partial order (<_vc), whereas the LevelDB timestamps are totally ordered (<_db).

For correctness, the following must be true:
∀ x,y: x <_vc y ⟹ x <_db y

The converse is not necessarily true, i.e. there are values that are ordered by the leveldb timestamp but are not comparable by VC. The set of timestamps x such that x <_db y may contain vectors for which is not true that x <_vc y; when iterating you must filter them out.

Does this help the discussion?

                                        Marc

Le 1 oct. 2016 à 16h09, Peter Zeller [email protected] a écrit :

I looked over again the le thing and lt is OK. As I reversed the condition, If I use le I'm not including the op with the same VC (in case it exists) than the lower range.

As I said here SyncFree/antidote_utils#2, the lt operation on vectorclocks does not make much sense. If you want to include the lower range, you can use strict_le instead of lt. But as I understood it, the lower range should not be included, as it would already be included in the snapshot. It also says so in your documentation: "it returns all ops which have a VectorClock concurrent or larger than VCFrom ...".

Regarding those VCs, I think you are correct saying that specific example that doesn't apply, I wasn't looking at it like that. What I'm trying to think, is if that case can happen with other values...

I changed my test-data-generator to produce more realistic examples: I think the following would be a realistic test case and it fails:

ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc3,1}]), {logEntry,4}),
ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc2,4},{dc3,1}]), {logEntry,3}),
ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc2,2}]), {logEntry,2}),
ok = antidote_db:put_op(Db, key, vectorclock:from_list([{dc2,1}]), {logEntry,1}),
Records = antidote_db:get_ops(Db, key, vectorclock:from_list([{dc2,3}]),
vectorclock:from_list([{dc2,3},{dc3,1}])),
?assertEqual([{logEntry,3},{logEntry,4}], lists:sort(Records)),

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #1 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AGWtn3h3l2G6IBwppYAzFDdYpKXYlNcrks5qvmmAgaJpZM4JuU52.

@peterzeller
Copy link

Thanks for the comment, Marc. I think it is a good idea to discuss the idea/algorithm first, before implementing and testing things.

For correctness, the following must be true:
∀ x,y: x <_vc y ⟹ x <_db y

From what we discussed so far, it seems that this is not enough to get an efficient implementation.

For a given lower bound l, and upper bound u, we want to find operations x, such that ¬(x ≤_vc l) and (x ≤_vc u).

For the upper bound the db-order is useful:
If there is an operation y with (y ≤_vc u), we know that (y ≤_db u) holds as well.

I think for the lower bound the db-order is not so useful.
If we have an operation y with ¬(y ≤_vc l), for example an operation that is concurrent with all other operations, then this could be at any position in the database-order, if we don't have more information.

@marc-shapiro
Copy link

OK now I understand the problem better.

What if we keep a cache of recent VCs and their corresponding DB timestamp. When scanning, instead of starting from l, start from the previous DB timestamp whose VC timestamp is less than l. Would that work?

PS If we don’t find a suitable point in the cache, use a dichotomic search.

                Marc

Le 1 oct. 2016 à 18h29, Peter Zeller [email protected] a écrit :

Thanks for the comment, Marc. I think it is a good idea to discuss the idea/algorithm first, before implementing and testing things.

For correctness, the following must be true:
∀ x,y: x <_vc y ⟹ x <_db y

From what we discussed so far, it seems that this is not enough to get an efficient implementation.

For a given lower bound l, and upper bound u, we want to find operations x, such that ¬(x ≤_vc l) and (x ≤_vc u).

For the upper bound the db-order is useful:
If there is an operation y with (y ≤_vc u), we know that (y ≤_db u) holds as well.

I think for the lower bound the db-order is not so useful.
If we have an operation y with ¬(y ≤_vc l), for example an operation that is concurrent with all other operations, then this could be at any position in the database-order, if we don't have more information.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #1 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AGWtn7k7S-oOFaicTou3SOHupwBgtjuiks5qvop4gaJpZM4JuU52.

@bieniusa
Copy link

Status?

@santialvarezcolombo
Copy link
Contributor Author

Hi everyone! I had to focus the last couple of weeks in a final exam I had yesterday, so there wasn't much progress regarding this topic.

As a summary of what Peter found, it's an unlikely scenario in a real Antidote execution, but if it happens, the get_ops method doesn't return the correct values. This is because how keys get sorted in leveldb and the condition we have to satisfy to stop the fold. As @peterzeller said, the db order for the lower bound isn't useful to know when to stop the fold.

Folding over all the keys is an obvious solution, but extremely inefficient since we have potentially lots of keys because we are using leveldb as logging module as well.

Today I'm starting to think about what @marc-shapiro suggested, and also seeing another possibility we talked with @aletomsic, about using the heartbeats from other DCs, to know if this case can happen.

If anyone has any other idea, I'm more than happy in hearing and evaluating them!

@santialvarezcolombo
Copy link
Contributor Author

I'm closing this PR since a new approach has been made in #2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants