-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling greedy full text indexes #1064
Comments
We already do so for We could also add a third query variable besides |
I think it would indeed be very useful to add to add another variable like that. Treating the phrase as "OR"ed words seems to be default behaviour of the Jena Full Text Search. We have a similar problem with the current Geonames implementation. Searching for "Sas van Gent" returns 195 results, all most all of them are irrelevant. Rewriting the query to Note: The GTAA also uses the Jena Full Text Search but does not have this problem, searching for Wim de Bie returns the correct results. Maybe we should approach @wmelder to asks for details on his config settings? |
Thanks for signalling this @EnnoMeijers. I agree, using an AND clause for multiple terms should preferrably be standard. @ddeboer what would be the optimal solution in your opinion? Would adding |
Sounds familiar, the problem of too many hits with full text search. The GTAA settings for the Lucene index here. @prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb1: <http://jena.hpl.hp.com/2008/tdb#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
@prefix text: <http://jena.apache.org/text#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix : <#> .
[] rdf:type fuseki:Server .
<#service_public> rdf:type fuseki:Service ;
fuseki:name "public" ;
fuseki:label "Service Layer Triple Store (for public access)" ;
fuseki:serviceQuery "query" , "sparql" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset <#text_public> ;
fuseki:serviceUpdate "update" ;
fuseki:serviceUpload "upload" ;
fuseki:serviceReadWriteGraphStore "data" ;
.
<#text_public> rdf:type text:TextDataset ;
text:dataset <#tdb_public> ;
text:index <#index_public> ;
.
<#index_public> rdf:type text:TextIndexLucene ;
text:directory "/opt/fuseki/databases/public_text" ;
text:entityMap <#entity_map_public> ;
text:analyzer [
rdf:type text:ConfigurableAnalyzer ;
text:tokenizer text:StandardTokenizer ;
text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
] ;
.
<#entity_map_public> rdf:type text:EntityMap ;
text:defaultField "prefLabel" ;
text:entityField "uri" ;
text:uidField "uid" ;
text:map (
[
text:field "prefLabel" ;
text:predicate <http://www.w3.org/2004/02/skos/core#prefLabel>
]
[
text:field "altLabel" ;
text:predicate <http://www.w3.org/2004/02/skos/core#altLabel>
]
[
text:field "hiddenLabel" ;
text:predicate <http://www.w3.org/2004/02/skos/core#hiddenLabel>
]
) ;
.
<#tdb_public> rdf:type tdb1:DatasetTDB ;
tdb1:location "/opt/fuseki/databases/public" ;
ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "120000" ] ;
.
Note that the GTAA service layer also provides score information, so that results can be relevance ranked. Please let me know if you need more information. |
Ah, thanks for sharing @wmelder! It looks quite similar to the Geonames configuration that we currently use. I can't explain the differences in behaviour for searching between these two, can you @ddeboer? |
Using the |
I ran into a case were the full text index is extremely greedy, see https://api.triplydb.com/s/45wCpVY7r . The only way to work with this set is by using and AND clause on the searchs terms like this https://api.triplydb.com/s/KVXEVl14Y.
Using queryMode=RAW in the GraphQL API enables this way of querying but it this is no option when using the Reconciliation Service API or the Demonstrator. Could this be a default setting in the dataset definition? For reconcilation it should expand the search query with an AND clause when multiple terms are used.
Any ideas how to handle this?
The text was updated successfully, but these errors were encountered: