handling greedy full text indexes #1064

EnnoMeijers · 2023-08-07T17:52:57Z

I ran into a case were the full text index is extremely greedy, see https://api.triplydb.com/s/45wCpVY7r . The only way to work with this set is by using and AND clause on the searchs terms like this https://api.triplydb.com/s/KVXEVl14Y.

Using queryMode=RAW in the GraphQL API enables this way of querying but it this is no option when using the Reconciliation Service API or the Demonstrator. Could this be a default setting in the dataset definition? For reconcilation it should expand the search query with an AND clause when multiple terms are used.

Any ideas how to handle this?

ddeboer · 2023-08-07T19:02:57Z

it should expand the search query with an AND clause when multiple terms are used.

We already do so for ?virtuosoQuery; you can try using that in your SPARQL query from the NoT, although the backend seems to be Fuseki rather than Virtuoso.

We could also add a third query variable besides ?query and ?virtuosoQuery. Something like ?intersectionQuery, which joins all query words with AND but does not quoting like ?virtuosoQuery does.

EnnoMeijers · 2023-08-08T05:48:54Z

I think it would indeed be very useful to add to add another variable like that. Treating the phrase as "OR"ed words seems to be default behaviour of the Jena Full Text Search. We have a similar problem with the current Geonames implementation. Searching for "Sas van Gent" returns 195 results, all most all of them are irrelevant. Rewriting the query to "(sas AND gent AND van)" returns the two relevant ones, see https://api.triplydb.com/s/8DBt_82Qy.

Note: The GTAA also uses the Jena Full Text Search but does not have this problem, searching for Wim de Bie returns the correct results. Maybe we should approach @wmelder to asks for details on his config settings?

rschalkrce · 2023-08-08T08:00:23Z

Thanks for signalling this @EnnoMeijers. I agree, using an AND clause for multiple terms should preferrably be standard.

@ddeboer what would be the optimal solution in your opinion? Would adding ?virtuosoQuery to the queries work for all datasets?

wmelder · 2023-08-08T08:48:09Z

Sounds familiar, the problem of too many hits with full text search. The GTAA settings for the Lucene index here.
The service points to the text datastore instead of the triple datastore. The text datastore points to an index and the triple datastore.

@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb1:    <http://jena.hpl.hp.com/2008/tdb#> .
@prefix tdb2:    <http://jena.apache.org/2016/tdb#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix :        <#> .

[] rdf:type fuseki:Server .

<#service_public> rdf:type fuseki:Service ;
    fuseki:name "public" ;
    fuseki:label "Service Layer Triple Store (for public access)" ;
    fuseki:serviceQuery "query" , "sparql" ;
    fuseki:serviceReadGraphStore "get" ;
    fuseki:dataset <#text_public> ;
    fuseki:serviceUpdate "update" ;
    fuseki:serviceUpload "upload" ;
    fuseki:serviceReadWriteGraphStore "data" ;
.


<#text_public> rdf:type     text:TextDataset ;
    text:dataset   <#tdb_public> ;
    text:index     <#index_public> ;
.

<#index_public> rdf:type text:TextIndexLucene ;
    text:directory "/opt/fuseki/databases/public_text" ;
    text:entityMap <#entity_map_public> ;
    text:analyzer [
	rdf:type        text:ConfigurableAnalyzer ;
	text:tokenizer  text:StandardTokenizer ;
	text:filters    (text:ASCIIFoldingFilter text:LowerCaseFilter)
    ] ;
.

<#entity_map_public> rdf:type text:EntityMap ;
    text:defaultField     "prefLabel" ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
        text:map (
            [
	        text:field "prefLabel" ;
		text:predicate <http://www.w3.org/2004/02/skos/core#prefLabel>
	    ]
	    [
		text:field "altLabel" ;
		text:predicate <http://www.w3.org/2004/02/skos/core#altLabel>
	    ]
	    [
		text:field "hiddenLabel" ;
		text:predicate <http://www.w3.org/2004/02/skos/core#hiddenLabel>
	    ]
        ) ;
.

<#tdb_public> rdf:type tdb1:DatasetTDB ;
    tdb1:location "/opt/fuseki/databases/public" ;
    ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "120000" ] ;
.

Note that the GTAA service layer also provides score information, so that results can be relevance ranked.

Please let me know if you need more information.

EnnoMeijers · 2023-08-08T09:17:29Z

Ah, thanks for sharing @wmelder! It looks quite similar to the Geonames configuration that we currently use. I can't explain the differences in behaviour for searching between these two, can you @ddeboer?

EnnoMeijers · 2023-08-08T11:02:11Z

Using the ?virtuosoQuery var instead of the regular ?query var in the sparql query seems to fix the problems with the jena full text search. Updated the geonames.rq accordingly, see #1065

ddeboer · 2023-08-08T18:25:06Z

Note that the GTAA service layer also provides score information, so that results can be relevance ranked.

Thanks, @wmelder, that’s the hint that solved our issue: the GeoNames query wasn’t sorting results by relevance score yet. Fixed in #1066.

EnnoMeijers added the discuss label Aug 7, 2023

EnnoMeijers assigned ddeboer Aug 7, 2023

ddeboer mentioned this issue Aug 8, 2023

feat: Sort GeoNames by relevance score #1066

Merged

rschalkrce mentioned this issue Sep 26, 2023

Fewer results when using bif:contains text indexing #1118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling greedy full text indexes #1064

handling greedy full text indexes #1064

EnnoMeijers commented Aug 7, 2023

ddeboer commented Aug 7, 2023

EnnoMeijers commented Aug 8, 2023

rschalkrce commented Aug 8, 2023

wmelder commented Aug 8, 2023

EnnoMeijers commented Aug 8, 2023

EnnoMeijers commented Aug 8, 2023

ddeboer commented Aug 8, 2023

handling greedy full text indexes #1064

handling greedy full text indexes #1064

Comments

EnnoMeijers commented Aug 7, 2023

ddeboer commented Aug 7, 2023

EnnoMeijers commented Aug 8, 2023

rschalkrce commented Aug 8, 2023

wmelder commented Aug 8, 2023

EnnoMeijers commented Aug 8, 2023

EnnoMeijers commented Aug 8, 2023

ddeboer commented Aug 8, 2023