-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBase version querying #47
Comments
First I would need to know a reason for that filtering. Do you need multiple versions of triples stored and query across them, or do you want to implement a basic transaction system based on a timestamp, or is there another reason? Thanks, |
Both. I want to be able to filter triples by the timestamp they were inserted, e.g. all news items in the last 6 months, or all records released between 2015 and 2017. Also, I want to be able to limit a query up to a timestamp so I don't get results from an update I might be running, or new data currently being added. |
I'm afraid architecture of the internal HBase timestamp system is not so robust so it can allow requested functionality without significant performance drawback. Let me give you an example: Practically it might make sense to use the timestamps as a limited transactional support (as you described "don't get results from an update I might be running"). However even for that purpose there might be drawbacks - the server compactions. In case your actual update is directly writing to HBase, compaction may already remove the older records and your query (restricted by the timestamp) may get incomplete data. We may still elaborate on this solution (as you are interested) and find some boundaries where it may work without race conditions. However I would rather recommend you to use Halyard bulk updates and bulk loads, which are transactional. Bulk operation data are indexed into a standalone HBase files and bulk-loaded to HBase at the final stage as almost atomic operation. Running Halyard bulk load or bulk update is not producing any "dirty" data and not affecting actual queries (except for multi-stage SPARQL updates, where each stage represents standalone bulk operation). If I can summarise it - Halyard model is not designed to support time (or version) as another queryable dimension (next to the subjects, predicates, objects, and named graphs). In order to do so - it would require significant architecture and implementation effort. |
Thanks for your response. Given the temperature example, I imagine that working by only inserting data, no deletions, and using timestamp filtering to scope it. And, assuming the min versions for a column family is set to something like 10, e.g. "alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 10". |
Hi, sorry for the late response. Your proposal sounds interesting and I will try to figure out all consequences. Thanks, |
The clash with stored named graphs could be avoided by using SERVICE instead of GRAPH. The SERVICE url could take the form of a REST request for the versioned triples, something like:
Notice, I've extended the behaviour of your existing federated queries. Don't know if it is necessary to have something like halyard:_self to refer to the current dataset. Concat() function can be used to build up the service url dynamically based on vars. |
Here is my branch https://github.com/pulquero/Halyard/tree/versioning, hopefully I haven't overlooked anything. |
Adding service parameters to request specific HBase timestamp might be possible, however it is still far from making Halyard a kind of versioned RDF store. Original intent of the TimeAwareHBaseSail was to support Change Data Capture faster paralel replay of the events backlog using SPARQL Update, without prior sorting and bottleneck of in-time-order processing, or without custom complex MapReduce tool. For that case both (event backlog and final graph) are still represented as standard RDF, there are no data persistent in any hidden dimension, and everything can be exported to any other RDF system. And I would recommend anyone to model the data into standard RDF, instead of start with proprietary tweaking of the system to support another time/version dimension.
Regarding the implementation - you propose to change the way how delete and insert HBase key-values are diferentiated in time. Actual Halyard implementation exclusively uses least significant bit, where original timestamp is shifted by one bit.This assures deterministic order of deletes and inserts with the same timestamp in any situation. As the timestamp might be also just sequence of versions, your simplified solution would produce conflicting timestamps between insert and delete of two subsequent triple versions.
…-------- Původní zpráva --------Od: pulquero <[email protected]> Datum: 28.01.19 9:06 (GMT+01:00) Komu: Merck/Halyard <[email protected]> Cc: Adam Sotona <[email protected]>, Comment <[email protected]> Předmět: Re: [Merck/Halyard] HBase version querying (#47)
Here is my branch https://github.com/pulquero/Halyard/tree/versioning, hopefully I haven't overlooked anything.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/Merck/Halyard","title":"Merck/Halyard","subtitle":"GitHub repository","main_image_url":"https://github.githubassets.com/images/email/message_cards/header.png","avatar_image_url":"https://github.githubassets.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/Merck/Halyard"}},"updates":{"snippets":[{"icon":"PERSON","message":"@pulquero in #47: Here is my branch https://github.com/pulquero/Halyard/tree/versioning, hopefully I haven't overlooked anything."}],"action":{"name":"View Issue","url":"#47 (comment)"}}}
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "#47 (comment)",
"url": "#47 (comment)",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]
|
Yes, service parameters allow finer-grained scoping of the data to be queried so should probably be considered in its own right, independently from a versioned RDF store. I can fix the bitshifting, just need to bitshift back-and-forth between user provided timestammps and halyard timestamps. For versioned RDF then, essentially we want to write something like |
I now have a working prototype that includes support for INSERT/DELETE on the above mentioned branch. TLDR, relevant tests are https://github.com/pulquero/Halyard/blob/versioning/sail/src/test/java/com/msd/gin/halyard/sail/HBaseSailVersionTest.java and Change Data Capture looks like
|
A function and/or magic property to enable HBase time range filtering, i.e. restrict queries to triples within a specified time range.
The text was updated successfully, but these errors were encountered: