HBase version querying #47

pulquero · 2018-08-22T23:38:47Z

A function and/or magic property to enable HBase time range filtering, i.e. restrict queries to triples within a specified time range.

asotona · 2018-08-23T13:33:53Z

First I would need to know a reason for that filtering. Do you need multiple versions of triples stored and query across them, or do you want to implement a basic transaction system based on a timestamp, or is there another reason?

Thanks,
Adam

pulquero · 2018-08-23T15:20:56Z

Both. I want to be able to filter triples by the timestamp they were inserted, e.g. all news items in the last 6 months, or all records released between 2015 and 2017. Also, I want to be able to limit a query up to a timestamp so I don't get results from an update I might be running, or new data currently being added.

asotona · 2019-01-02T11:59:18Z

I'm afraid architecture of the internal HBase timestamp system is not so robust so it can allow requested functionality without significant performance drawback.
HBase timestamps are used in Halyard in a very limited case (mainly within one bulk operation), where it is expected that HBase (with configuration to retain just the latest triple states) will remove old versions of the triples during compaction process and recover the performance ASAP.
Theoretically it might be possible to configure HBase to "retain all" records and somehow model a time-constrained queries, however the performance will be very poor and there I see no option to mode time ranges.

Let me give you an example:
You are for example recording one changing value in time - let say temperature, every hour:
37, 38, 37, 35, 34, 35, 37
In this case you'll have to store 7 triple insertions and 6 triple deletions.
Whenever you will query for the temperature, HBase will have to crawl through all the values and filter out what you need (based on the requested timestamp).

Practically it might make sense to use the timestamps as a limited transactional support (as you described "don't get results from an update I might be running"). However even for that purpose there might be drawbacks - the server compactions. In case your actual update is directly writing to HBase, compaction may already remove the older records and your query (restricted by the timestamp) may get incomplete data.

We may still elaborate on this solution (as you are interested) and find some boundaries where it may work without race conditions.

However I would rather recommend you to use Halyard bulk updates and bulk loads, which are transactional. Bulk operation data are indexed into a standalone HBase files and bulk-loaded to HBase at the final stage as almost atomic operation. Running Halyard bulk load or bulk update is not producing any "dirty" data and not affecting actual queries (except for multi-stage SPARQL updates, where each stage represents standalone bulk operation).

If I can summarise it - Halyard model is not designed to support time (or version) as another queryable dimension (next to the subjects, predicates, objects, and named graphs). In order to do so - it would require significant architecture and implementation effort.

pulquero · 2019-01-02T13:15:37Z

Thanks for your response. Given the temperature example, I imagine that working by only inserting data, no deletions, and using timestamp filtering to scope it. And, assuming the min versions for a column family is set to something like 10, e.g. "alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 10".
I think by default, hbase returns only the latest version, so "select ?t where {?s rdf:value ?t}" would just see 37. To retrieve older values, I imagine a query like
"select ?t where {?s rdf:value ?t filter(timerange(30, 50000, 4)}" where the timerange function would get translated to scan.setMaxVersions(4); scan.setTimeRange(30, 5000);
This is what I sort of have in mind, though it doesn't allow the actual timestamps to be returned. Maybe they could be returned as named graphs, say "select ?g ?t where {graph ?g {?s rdf:value ?t} filter(timerange(30, 50000, 4)}" where ?g could have values like halyard:/timestamp/4221. Or better for ever inserted triple "graph ?g {?s ?p ?o}", there is also a virtual triple "graph ?v {?s ?p ?o}" where ?v=strcat(?g, ';timestamp=', insertionTimestamp).

asotona · 2019-01-23T12:25:46Z

Hi, sorry for the late response. Your proposal sounds interesting and I will try to figure out all consequences.
First let's clarify that exposing timestamp using named graphs directly clashes with stored named graphs and in one query there suppose to be clearly stated in which "mode" the query suppose to run.
As for passing the maxVersions and timeRange to the scan through custom function there must be clear what statement patterns (in more complex query with more joins, unions, subqueries...) are affected. Unfortunately functions do not work with statement patterns (but with individual variables and values). The simplest solution could be that presence of such function would affect all statement patterns to scan mapping within the whole query, however I'm not 100% sure about real usability of such solution.
It definitely gives me a lot to think about.
However it looks like you already did some prototypes or experiments, would you be able to share it?

Thanks,
Adam

pulquero · 2019-01-24T15:16:05Z

The clash with stored named graphs could be avoided by using SERVICE instead of GRAPH. The SERVICE url could take the form of a REST request for the versioned triples, something like:

# find triples that exists in the current version and also in some version specified by a SERVICE call

?s ?p ?o.
.....
SERVICE halyard:dataset?graph=....&minVersion=&maxVersion {
?s ?p ?o.
......
}

Notice, I've extended the behaviour of your existing federated queries. Don't know if it is necessary to have something like halyard:_self to refer to the current dataset. Concat() function can be used to build up the service url dynamically based on vars.

pulquero · 2019-01-28T08:06:26Z

Here is my branch https://github.com/pulquero/Halyard/tree/versioning, hopefully I haven't overlooked anything.

asotona · 2019-02-08T21:54:22Z

Adding service parameters to request specific HBase timestamp might be possible, however it is still far from making Halyard a kind of versioned RDF store. Original intent of the TimeAwareHBaseSail was to support Change Data Capture faster paralel replay of the events backlog using SPARQL Update, without prior sorting and bottleneck of in-time-order processing, or without custom complex MapReduce tool. For that case both (event backlog and final graph) are still represented as standard RDF, there are no data persistent in any hidden dimension, and everything can be exported to any other RDF system. And I would recommend anyone to model the data into standard RDF, instead of start with proprietary tweaking of the system to support another time/version dimension. Regarding the implementation - you propose to change the way how delete and insert HBase key-values are diferentiated in time. Actual Halyard implementation exclusively uses least significant bit, where original timestamp is shifted by one bit.This assures deterministic order of deletes and inserts with the same timestamp in any situation. As the timestamp might be also just sequence of versions, your simplified solution would produce conflicting timestamps between insert and delete of two subsequent triple versions.

…

-------- Původní zpráva --------Od: pulquero <[email protected]> Datum: 28.01.19 9:06 (GMT+01:00) Komu: Merck/Halyard <[email protected]> Cc: Adam Sotona <[email protected]>, Comment <[email protected]> Předmět: Re: [Merck/Halyard] HBase version querying (#47) Here is my branch https://github.com/pulquero/Halyard/tree/versioning, hopefully I haven't overlooked anything. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/Merck/Halyard","title":"Merck/Halyard","subtitle":"GitHub repository","main_image_url":"https://github.githubassets.com/images/email/message_cards/header.png","avatar_image_url":"https://github.githubassets.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/Merck/Halyard"}},"updates":{"snippets":[{"icon":"PERSON","message":"@pulquero in #47: Here is my branch https://github.com/pulquero/Halyard/tree/versioning, hopefully I haven't overlooked anything."}],"action":{"name":"View Issue","url":"#47 (comment)"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "#47 (comment)", "url": "#47 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

pulquero · 2019-02-09T00:00:22Z

Yes, service parameters allow finer-grained scoping of the data to be queried so should probably be considered in its own right, independently from a versioned RDF store. I can fix the bitshifting, just need to bitshift back-and-forth between user provided timestammps and halyard timestamps.

For versioned RDF then, essentially we want to write something like
?s [halyard:timestamp ?t; ...]
halyard:timestamp would be a magic property that does something similar to my TimestampFunction. Every triple is versioned through the timestamp of its subject. For performance, filters involving ?t could be identified and converted to time ranges on scans involving ?s as a subject. Export is possible with
construct {
?s ?p ?o.
?s halyard:timestamp ?t.
} where {
?s ?p ?o.
?s halyard:timestamp ?t.
}

pulquero · 2019-04-12T13:05:10Z

I now have a working prototype that includes support for INSERT/DELETE on the above mentioned branch. TLDR, relevant tests are https://github.com/pulquero/Halyard/blob/versioning/sail/src/test/java/com/msd/gin/halyard/sail/HBaseSailVersionTest.java and Change Data Capture looks like

                    "PREFIX : <http://whatever/> " +
                    "PREFIX halyard: <http://merck.github.io/Halyard/ns#> " +
                    "DELETE {" +
                    "  GRAPH ?targetGraph {" +
                    "    ?deleteSubj ?deletePred ?deleteObj ." +
                    "  }" +
                    "}" +
                    "WHERE {" +
                    "  ?change :context   ?targetGraph ;" +
                    "          :timestamp ?t ." +
                    "  OPTIONAL {" +
                    "    ?change :deleteGraph ?delGr ." +
                    "    GRAPH ?delGr {" +
                    "      ?deleteSubj ?deletePred ?deleteObj ." +
                    "    }" +
                    "    (?deleteSubj ?deletePred ?deleteObj ?targetGraph) halyard:timestamp ?t ." +
                    "  }" +
                    "}"

                    "PREFIX : <http://whatever/> " +
                    "PREFIX halyard: <http://merck.github.io/Halyard/ns#> " +
                    "INSERT {" +
                    "  GRAPH ?targetGraph {" +
                    "    ?insertSubj ?insertPred ?insertObj ." +
                    "  }" +
                    "  (?insertSubj ?insertPred ?insertObj ?targetGraph) halyard:timestamp ?t ." +
                    "}" +
                    "WHERE {" +
                    "  ?change :context   ?targetGraph ;" +
                    "          :timestamp ?t ." +
                    "  FILTER (halyard:forkAndFilterBy(2, ?change))" +
                    "  OPTIONAL {" +
                    "    ?change :insertGraph ?insGr ." +
                    "    GRAPH ?insGr {" +
                    "      ?insertSubj ?insertPred ?insertObj ." +
                    "    }" +
                    "  }" +
                    "}"

asotona added the enhancement label Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBase version querying #47

HBase version querying #47

pulquero commented Aug 22, 2018

asotona commented Aug 23, 2018

pulquero commented Aug 23, 2018

asotona commented Jan 2, 2019

pulquero commented Jan 2, 2019 •

edited

Loading

asotona commented Jan 23, 2019

pulquero commented Jan 24, 2019

pulquero commented Jan 28, 2019

asotona commented Feb 8, 2019 via email

pulquero commented Feb 9, 2019 •

edited

Loading

pulquero commented Apr 12, 2019

HBase version querying #47

HBase version querying #47

Comments

pulquero commented Aug 22, 2018

asotona commented Aug 23, 2018

pulquero commented Aug 23, 2018

asotona commented Jan 2, 2019

pulquero commented Jan 2, 2019 • edited Loading

asotona commented Jan 23, 2019

pulquero commented Jan 24, 2019

pulquero commented Jan 28, 2019

asotona commented Feb 8, 2019 via email

pulquero commented Feb 9, 2019 • edited Loading

pulquero commented Apr 12, 2019

pulquero commented Jan 2, 2019 •

edited

Loading

pulquero commented Feb 9, 2019 •

edited

Loading