Performance of graph.query over large local triple store #1563
-
I have a local triplestore (Sleepycat) with 6,147,995 triples. When I run a query, it gets stuck. The memory usage is under 50% and my CPU usage is also around 60% approx. So, I am wondering whether it is the issue of RDFlib or my machine. Example query:
Code Snippet:
It gets stuck after printing |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
RDFLib has no limit as such, the limit is your patience :D
You don't say what store you use, I assume memory?
RDFLib isn't really build for speed, I would say that 6M triples in the
memory store should be fine, but it always depends on the queries you run
(and you latency requirements).
That query will do the cartesian product of all actors and all writers and
find the properties between them, that is quite a few triples we check for.
If you want better performance I use Jena Fuseki wrapped in a
SPARQLUpdateStore.
- Gunnar
…On 2 November 2017 at 13:11, Reshma ***@***.***> wrote:
I have a local triplestore with 6,147,995 triples.
Specs of my laptop:
Memory : 8GB
Processor : Intel® Core i7-5500U CPU @2.4G0Hz x 4
Graphics : Intel® HD Graphics 5500 (Broadwell GT2)
OS type : 64 bit
OS : Ubuntu
When I run a query, it gets stuck. The memory usage is under 50% and my
CPU usage is also around 60% approx. So, I am wondering whether it is the
issue of RDFlib or my machine.
Example query:
SELECT (count(?instanceOfClassA) as ?count) ?prop
WHERE {
?instanceOfClassA a <http://data.linkedmdb.org/resource/movie/actor> .
?instanceOfClassB a <http://data.linkedmdb.org/resource/movie/writer> .
?instanceOfClassA ?prop ?instanceOfClassB .
}
GROUP BY ?prop
ORDER BY DESC(?count) limit 1
Code Snippet:
qres = graph.query(query)
print qres
for row in qres:
print row
It gets stuck after printing <rdflib.plugins.sparql.processor.SPARQLResult
object at 0x7f14df033f50>
I may need to work with even larger datadumps, so I need to know whether
RDFlib has any limit on the size of the triplestore.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#787>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAK3bHro4etibKfxWPIC2QfRydLi5AZ-ks5sybF_gaJpZM4QPlo2>
.
|
Beta Was this translation helpful? Give feedback.
-
@gromgull I am using the Sleepycat triplestore, not memory. I thought once the query result is assigned to |
Beta Was this translation helpful? Give feedback.
-
@gromgull Do mean setting up a local endpoint using Jena Fuseki and then accessing it through RDFlib like we would do with other external endpoints? I am very new to linked data (even python) technologies. So, any help would be highly appreciated. |
Beta Was this translation helpful? Give feedback.
-
@gromgull I left the script running for around 2 hours and it still stayed stuck and then I ended up stopping it. As I understand Sleepycat is a persistent storage, so if RDFlib does not have such limitation then may be I am doing something wrong? This is how I created the store:
And this is how I am querying:
|
Beta Was this translation helpful? Give feedback.
-
I mean: RDFLib doesn't have a technical limitation where it say "no I wont
process this" - it will do it, but it is slow, and since your query is
quite demanding it may take hours, days, weeks or months to finish. But
assuming you don't run out of memory it will get there.
Assuming your dataset has 1M actors and 1M directors, you have to do 1M*1M
SPO queries for that query. That will take a while.
Someone made a blogpost using the SPARQLUpdateStore against Stardog here:
https://lawlesst.github.io/notebook/rdflib-stardog.html
using stardog or fuseki or similar will give better performance, but your
query is still very complex.
- Gunnar
…On 3 November 2017 at 04:06, Reshma ***@***.***> wrote:
@gromgull <https://github.com/gromgull> I left the script running for
around 2 hours and it still stayed stuck and then I ended up stopping it.
As I understand Sleepycat is a persistent storage, so if RDFlib does not
have such limitation then may be I am doing something wrong?
This is how I created the store:
graph = ConjunctiveGraph('Sleepycat')
rt = graph.open(path_to_triplestore, create=False)
if rt == NO_STORE:
graph.open(path_to_triplestore, create=True)
else:
assert rt == VALID_STORE, 'The underlying store is corrupt'
for i in range(0, 201): #the source dump was broken down into smaller files
g = Graph()
folder = "temp_" + source
path = "../data/" + folder + "/file" + str(i)+".nt"
g.parse(path, format="ttl")
for t in g:
graph.add(t)
graph.close()
And this is how I am querying:
graph = ConjunctiveGraph('Sleepycat')
graph.open(path_to_triplestore, create = False)
qres = graph.query(query)
print qres
for row in qres:
print row
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#787 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAK3bIDNvWgSEo4A4nQ39mNGEpu_fbrkks5syoMngaJpZM4QPlo2>
.
|
Beta Was this translation helpful? Give feedback.
-
@gromgull Thank you for clarifying. Here is the query:
However, when I run the same query using Archiveshub's endpoint, the query executes in about a second. How can I achieve such performance in my local endpoint? |
Beta Was this translation helpful? Give feedback.
-
Hi there,
Does anyone have an explanation on why this is the case? |
Beta Was this translation helpful? Give feedback.
-
Could this be an issue due to the order of triple patterns?
I don't know the inner workings of rdflib but if it doesn't perform this kind of reordering, the intermediate result of calculating the cross product between actors and writers could become extremely large. However if you have a database with clever optimization techniques, it could reorder the triple patterns to 3-2-1 at which point the possible values for 1 and 2 could be much smaller. P.S.: Using rdflib 6 and above, you can use the service keyword to query another triplestore instead. |
Beta Was this translation helpful? Give feedback.
Could this be an issue due to the order of triple patterns?
As far as I know, databases such as the one behind a SPARQL endpoint like Virtuoso may reorder triple patterns in order to prevent a large number of intermediate results.
I don't know the inner workings of rdflib but if it doesn't perform this kind of reordering, the intermediate result of calculating the cross product between actors and writers could become extremely large. However if you have a database with clever optimization techniques, it cou…