Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBase version querying #47

Open
pulquero opened this issue Aug 22, 2018 · 10 comments
Open

HBase version querying #47

pulquero opened this issue Aug 22, 2018 · 10 comments

Comments

@pulquero
Copy link

A function and/or magic property to enable HBase time range filtering, i.e. restrict queries to triples within a specified time range.

@asotona
Copy link
Collaborator

asotona commented Aug 23, 2018

First I would need to know a reason for that filtering. Do you need multiple versions of triples stored and query across them, or do you want to implement a basic transaction system based on a timestamp, or is there another reason?

Thanks,
Adam

@pulquero
Copy link
Author

Both. I want to be able to filter triples by the timestamp they were inserted, e.g. all news items in the last 6 months, or all records released between 2015 and 2017. Also, I want to be able to limit a query up to a timestamp so I don't get results from an update I might be running, or new data currently being added.

@asotona
Copy link
Collaborator

asotona commented Jan 2, 2019

I'm afraid architecture of the internal HBase timestamp system is not so robust so it can allow requested functionality without significant performance drawback.
HBase timestamps are used in Halyard in a very limited case (mainly within one bulk operation), where it is expected that HBase (with configuration to retain just the latest triple states) will remove old versions of the triples during compaction process and recover the performance ASAP.
Theoretically it might be possible to configure HBase to "retain all" records and somehow model a time-constrained queries, however the performance will be very poor and there I see no option to mode time ranges.

Let me give you an example:
You are for example recording one changing value in time - let say temperature, every hour:
37, 38, 37, 35, 34, 35, 37
In this case you'll have to store 7 triple insertions and 6 triple deletions.
Whenever you will query for the temperature, HBase will have to crawl through all the values and filter out what you need (based on the requested timestamp).

Practically it might make sense to use the timestamps as a limited transactional support (as you described "don't get results from an update I might be running"). However even for that purpose there might be drawbacks - the server compactions. In case your actual update is directly writing to HBase, compaction may already remove the older records and your query (restricted by the timestamp) may get incomplete data.

We may still elaborate on this solution (as you are interested) and find some boundaries where it may work without race conditions.

However I would rather recommend you to use Halyard bulk updates and bulk loads, which are transactional. Bulk operation data are indexed into a standalone HBase files and bulk-loaded to HBase at the final stage as almost atomic operation. Running Halyard bulk load or bulk update is not producing any "dirty" data and not affecting actual queries (except for multi-stage SPARQL updates, where each stage represents standalone bulk operation).

If I can summarise it - Halyard model is not designed to support time (or version) as another queryable dimension (next to the subjects, predicates, objects, and named graphs). In order to do so - it would require significant architecture and implementation effort.

@pulquero
Copy link
Author

pulquero commented Jan 2, 2019

Thanks for your response. Given the temperature example, I imagine that working by only inserting data, no deletions, and using timestamp filtering to scope it. And, assuming the min versions for a column family is set to something like 10, e.g. "alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 10".
I think by default, hbase returns only the latest version, so "select ?t where {?s rdf:value ?t}" would just see 37. To retrieve older values, I imagine a query like
"select ?t where {?s rdf:value ?t filter(timerange(30, 50000, 4)}" where the timerange function would get translated to scan.setMaxVersions(4); scan.setTimeRange(30, 5000);
This is what I sort of have in mind, though it doesn't allow the actual timestamps to be returned. Maybe they could be returned as named graphs, say "select ?g ?t where {graph ?g {?s rdf:value ?t} filter(timerange(30, 50000, 4)}" where ?g could have values like halyard:/timestamp/4221. Or better for ever inserted triple "graph ?g {?s ?p ?o}", there is also a virtual triple "graph ?v {?s ?p ?o}" where ?v=strcat(?g, ';timestamp=', insertionTimestamp).

@asotona
Copy link
Collaborator

asotona commented Jan 23, 2019

Hi, sorry for the late response. Your proposal sounds interesting and I will try to figure out all consequences.
First let's clarify that exposing timestamp using named graphs directly clashes with stored named graphs and in one query there suppose to be clearly stated in which "mode" the query suppose to run.
As for passing the maxVersions and timeRange to the scan through custom function there must be clear what statement patterns (in more complex query with more joins, unions, subqueries...) are affected. Unfortunately functions do not work with statement patterns (but with individual variables and values). The simplest solution could be that presence of such function would affect all statement patterns to scan mapping within the whole query, however I'm not 100% sure about real usability of such solution.
It definitely gives me a lot to think about.
However it looks like you already did some prototypes or experiments, would you be able to share it?

Thanks,
Adam

@pulquero
Copy link
Author

The clash with stored named graphs could be avoided by using SERVICE instead of GRAPH. The SERVICE url could take the form of a REST request for the versioned triples, something like:

# find triples that exists in the current version and also in some version specified by a SERVICE call

?s ?p ?o.
.....
SERVICE halyard:dataset?graph=....&minVersion=&maxVersion {
?s ?p ?o.
......
}

Notice, I've extended the behaviour of your existing federated queries. Don't know if it is necessary to have something like halyard:_self to refer to the current dataset. Concat() function can be used to build up the service url dynamically based on vars.

@pulquero
Copy link
Author

Here is my branch https://github.com/pulquero/Halyard/tree/versioning, hopefully I haven't overlooked anything.

@asotona
Copy link
Collaborator

asotona commented Feb 8, 2019 via email

@pulquero
Copy link
Author

pulquero commented Feb 9, 2019

Yes, service parameters allow finer-grained scoping of the data to be queried so should probably be considered in its own right, independently from a versioned RDF store. I can fix the bitshifting, just need to bitshift back-and-forth between user provided timestammps and halyard timestamps.

For versioned RDF then, essentially we want to write something like
?s [halyard:timestamp ?t; ...]
halyard:timestamp would be a magic property that does something similar to my TimestampFunction. Every triple is versioned through the timestamp of its subject. For performance, filters involving ?t could be identified and converted to time ranges on scans involving ?s as a subject. Export is possible with
construct {
?s ?p ?o.
?s halyard:timestamp ?t.
} where {
?s ?p ?o.
?s halyard:timestamp ?t.
}

@pulquero
Copy link
Author

I now have a working prototype that includes support for INSERT/DELETE on the above mentioned branch. TLDR, relevant tests are https://github.com/pulquero/Halyard/blob/versioning/sail/src/test/java/com/msd/gin/halyard/sail/HBaseSailVersionTest.java and Change Data Capture looks like

                    "PREFIX : <http://whatever/> " +
                    "PREFIX halyard: <http://merck.github.io/Halyard/ns#> " +
                    "DELETE {" +
                    "  GRAPH ?targetGraph {" +
                    "    ?deleteSubj ?deletePred ?deleteObj ." +
                    "  }" +
                    "}" +
                    "WHERE {" +
                    "  ?change :context   ?targetGraph ;" +
                    "          :timestamp ?t ." +
                    "  OPTIONAL {" +
                    "    ?change :deleteGraph ?delGr ." +
                    "    GRAPH ?delGr {" +
                    "      ?deleteSubj ?deletePred ?deleteObj ." +
                    "    }" +
                    "    (?deleteSubj ?deletePred ?deleteObj ?targetGraph) halyard:timestamp ?t ." +
                    "  }" +
                    "}"

                    "PREFIX : <http://whatever/> " +
                    "PREFIX halyard: <http://merck.github.io/Halyard/ns#> " +
                    "INSERT {" +
                    "  GRAPH ?targetGraph {" +
                    "    ?insertSubj ?insertPred ?insertObj ." +
                    "  }" +
                    "  (?insertSubj ?insertPred ?insertObj ?targetGraph) halyard:timestamp ?t ." +
                    "}" +
                    "WHERE {" +
                    "  ?change :context   ?targetGraph ;" +
                    "          :timestamp ?t ." +
                    "  FILTER (halyard:forkAndFilterBy(2, ?change))" +
                    "  OPTIONAL {" +
                    "    ?change :insertGraph ?insGr ." +
                    "    GRAPH ?insGr {" +
                    "      ?insertSubj ?insertPred ?insertObj ." +
                    "    }" +
                    "  }" +
                    "}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants