Performance of graph.query over large local triple store #1563

ReshmaDangol · 2017-11-02T12:11:41Z

ReshmaDangol
Nov 2, 2017

I have a local triplestore (Sleepycat) with 6,147,995 triples.
Specs of my laptop:
Memory : 8GB
Processor : Intel® Core i7-5500U CPU @2.4G0Hz x 4
Graphics : Intel® HD Graphics 5500 (Broadwell GT2)
OS type : 64 bit
OS : Ubuntu

When I run a query, it gets stuck. The memory usage is under 50% and my CPU usage is also around 60% approx. So, I am wondering whether it is the issue of RDFlib or my machine.

Example query:

 SELECT (count(?instanceOfClassA) as ?count) ?prop 
            WHERE {
                ?instanceOfClassA a <http://data.linkedmdb.org/resource/movie/actor> . 
                ?instanceOfClassB a <http://data.linkedmdb.org/resource/movie/writer> . 
                ?instanceOfClassA ?prop ?instanceOfClassB .
            } 
        GROUP BY ?prop 
        ORDER BY DESC(?count) limit 1

Code Snippet:

qres = graph.query(query)
print qres
for row in qres:
    print row

It gets stuck after printing <rdflib.plugins.sparql.processor.SPARQLResult object at 0x7f14df033f50>
I may need to work with even larger datadumps, so I need to know whether RDFlib has any limit on the size of the triplestore.

Answered by KonradHoeffner

Aug 19, 2021

Could this be an issue due to the order of triple patterns?
As far as I know, databases such as the one behind a SPARQL endpoint like Virtuoso may reorder triple patterns in order to prevent a large number of intermediate results.

?instanceOfClassA a http://data.linkedmdb.org/resource/movie/actor .
?instanceOfClassB a http://data.linkedmdb.org/resource/movie/writer .
?instanceOfClassA ?prop ?instanceOfClassB .

I don't know the inner workings of rdflib but if it doesn't perform this kind of reordering, the intermediate result of calculating the cross product between actors and writers could become extremely large. However if you have a database with clever optimization techniques, it cou…

View full answer

gromgull · 2017-11-02T12:29:56Z

gromgull
Nov 2, 2017
Maintainer

RDFLib has no limit as such, the limit is your patience :D You don't say what store you use, I assume memory? RDFLib isn't really build for speed, I would say that 6M triples in the memory store should be fine, but it always depends on the queries you run (and you latency requirements). That query will do the cartesian product of all actors and all writers and find the properties between them, that is quite a few triples we check for. If you want better performance I use Jena Fuseki wrapped in a SPARQLUpdateStore. - Gunnar

…

On 2 November 2017 at 13:11, Reshma ***@***.***> wrote: I have a local triplestore with 6,147,995 triples. Specs of my laptop: Memory : 8GB Processor : Intel® Core i7-5500U CPU @2.4G0Hz x 4 Graphics : Intel® HD Graphics 5500 (Broadwell GT2) OS type : 64 bit OS : Ubuntu When I run a query, it gets stuck. The memory usage is under 50% and my CPU usage is also around 60% approx. So, I am wondering whether it is the issue of RDFlib or my machine. Example query: SELECT (count(?instanceOfClassA) as ?count) ?prop WHERE { ?instanceOfClassA a <http://data.linkedmdb.org/resource/movie/actor> . ?instanceOfClassB a <http://data.linkedmdb.org/resource/movie/writer> . ?instanceOfClassA ?prop ?instanceOfClassB . } GROUP BY ?prop ORDER BY DESC(?count) limit 1 Code Snippet: qres = graph.query(query) print qres for row in qres: print row It gets stuck after printing <rdflib.plugins.sparql.processor.SPARQLResult object at 0x7f14df033f50> I may need to work with even larger datadumps, so I need to know whether RDFlib has any limit on the size of the triplestore. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#787>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAK3bHro4etibKfxWPIC2QfRydLi5AZ-ks5sybF_gaJpZM4QPlo2> .

-- http://gromgull.net

0 replies

ReshmaDangol · 2017-11-02T12:37:54Z

ReshmaDangol
Nov 2, 2017
Author

@gromgull I am using the Sleepycat triplestore, not memory.

I thought once the query result is assigned to qres, it meant the query execution is complete. But to my surprise it gets stuck in the for loop.

0 replies

ReshmaDangol · 2017-11-02T12:53:14Z

ReshmaDangol
Nov 2, 2017
Author

@gromgull Do mean setting up a local endpoint using Jena Fuseki and then accessing it through RDFlib like we would do with other external endpoints?

I am very new to linked data (even python) technologies. So, any help would be highly appreciated.

0 replies

ReshmaDangol · 2017-11-03T03:06:14Z

ReshmaDangol
Nov 3, 2017
Author

@gromgull I left the script running for around 2 hours and it still stayed stuck and then I ended up stopping it. As I understand Sleepycat is a persistent storage, so if RDFlib does not have such limitation then may be I am doing something wrong?

This is how I created the store:

graph = ConjunctiveGraph('Sleepycat')
rt = graph.open(path_to_triplestore, create=False)

if rt == NO_STORE:
    graph.open(path_to_triplestore, create=True)
else:
    assert rt == VALID_STORE, 'The underlying store is corrupt'

for i in range(0, 201):  #the source dump was broken down into smaller files
    g = Graph()
    folder = "temp_" + source
    path = "../data/" + folder + "/file" + str(i)+".nt"
    g.parse(path, format="ttl") 
    for t in g:
        graph.add(t)      
graph.close()

And this is how I am querying:

graph = ConjunctiveGraph('Sleepycat')
graph.open(path_to_triplestore, create = False)
qres = graph.query(query)
print qres
for row in qres:
    print row

0 replies

gromgull · 2017-11-03T08:25:49Z

gromgull
Nov 3, 2017
Maintainer

I mean: RDFLib doesn't have a technical limitation where it say "no I wont process this" - it will do it, but it is slow, and since your query is quite demanding it may take hours, days, weeks or months to finish. But assuming you don't run out of memory it will get there. Assuming your dataset has 1M actors and 1M directors, you have to do 1M*1M SPO queries for that query. That will take a while. Someone made a blogpost using the SPARQLUpdateStore against Stardog here: https://lawlesst.github.io/notebook/rdflib-stardog.html using stardog or fuseki or similar will give better performance, but your query is still very complex. - Gunnar

…

On 3 November 2017 at 04:06, Reshma ***@***.***> wrote: @gromgull <https://github.com/gromgull> I left the script running for around 2 hours and it still stayed stuck and then I ended up stopping it. As I understand Sleepycat is a persistent storage, so if RDFlib does not have such limitation then may be I am doing something wrong? This is how I created the store: graph = ConjunctiveGraph('Sleepycat') rt = graph.open(path_to_triplestore, create=False) if rt == NO_STORE: graph.open(path_to_triplestore, create=True) else: assert rt == VALID_STORE, 'The underlying store is corrupt' for i in range(0, 201): #the source dump was broken down into smaller files g = Graph() folder = "temp_" + source path = "../data/" + folder + "/file" + str(i)+".nt" g.parse(path, format="ttl") for t in g: graph.add(t) graph.close() And this is how I am querying: graph = ConjunctiveGraph('Sleepycat') graph.open(path_to_triplestore, create = False) qres = graph.query(query) print qres for row in qres: print row — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#787 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAK3bIDNvWgSEo4A4nQ39mNGEpu_fbrkks5syoMngaJpZM4QPlo2> .

-- http://gromgull.net

0 replies

ReshmaDangol · 2017-11-03T10:57:20Z

ReshmaDangol
Nov 3, 2017
Author

@gromgull Thank you for clarifying.
As you suggested, I tried jena fuseki and used it's TDB to store the data. This time I loaded data dump from archiveshub , which has over 1 million triples.
When I run similar query in my local jena fuseki endpoint, I get the same performance issue. This time though, the CPU usage remains constantly high.

Here is the query:

 SELECT (count(?instanceOfClassA) as ?count) ?prop
            WHERE {
                ?instanceOfClassA a <http://data.archiveshub.ac.uk/def/Creation> .
                ?instanceOfClassB a <http://www.w3.org/2006/time#Interval> .
                ?instanceOfClassA ?prop ?instanceOfClassB .
            }
        GROUP BY ?prop
        ORDER BY DESC(?count) limit 10

However, when I run the same query using Archiveshub's endpoint, the query executes in about a second. How can I achieve such performance in my local endpoint?

0 replies

mospring · 2020-02-10T12:34:56Z

mospring
Feb 10, 2020

Hi there,
there is still no answer to the question of @ReshmaDangol

I thought once the query result is assigned to qres, it meant the query execution is complete. But to my surprise it gets stuck in the for loop.

Does anyone have an explanation on why this is the case?

0 replies

KonradHoeffner · 2021-08-19T15:32:30Z

KonradHoeffner
Aug 19, 2021

Could this be an issue due to the order of triple patterns?
As far as I know, databases such as the one behind a SPARQL endpoint like Virtuoso may reorder triple patterns in order to prevent a large number of intermediate results.

?instanceOfClassA a http://data.linkedmdb.org/resource/movie/actor .
?instanceOfClassB a http://data.linkedmdb.org/resource/movie/writer .
?instanceOfClassA ?prop ?instanceOfClassB .

I don't know the inner workings of rdflib but if it doesn't perform this kind of reordering, the intermediate result of calculating the cross product between actors and writers could become extremely large. However if you have a database with clever optimization techniques, it could reorder the triple patterns to 3-2-1 at which point the possible values for 1 and 2 could be much smaller.

P.S.: Using rdflib 6 and above, you can use the service keyword to query another triplestore instead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of graph.query over large local triple store #1563

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance of graph.query over large local triple store #1563

ReshmaDangol Nov 2, 2017

Replies: 8 comments

gromgull Nov 2, 2017 Maintainer

ReshmaDangol Nov 2, 2017 Author

ReshmaDangol Nov 2, 2017 Author

ReshmaDangol Nov 3, 2017 Author

gromgull Nov 3, 2017 Maintainer

ReshmaDangol Nov 3, 2017 Author

mospring Feb 10, 2020

KonradHoeffner Aug 19, 2021

ReshmaDangol
Nov 2, 2017

gromgull
Nov 2, 2017
Maintainer

ReshmaDangol
Nov 2, 2017
Author

ReshmaDangol
Nov 2, 2017
Author

ReshmaDangol
Nov 3, 2017
Author

gromgull
Nov 3, 2017
Maintainer

ReshmaDangol
Nov 3, 2017
Author

mospring
Feb 10, 2020

KonradHoeffner
Aug 19, 2021