Not able to search for just numbers in lunr.de #66

cadamini · 2020-09-12T21:45:46Z

Probem

In my German and English test documents I have content with the term Port 1234, but searching for 1234 does not work.

Has someone seen the same or a similar problem? Any ideas?

More tests

Searching for Port 1234 works fine.
Searching for 1234 in an English document works fine, using the base lunr.js version 2.3.8
Using this.use(lunr.de) makes it possible to find German umlauts but no numbers anymore.

Test Code

// The required JS files are correctly inserted in the sites head

var idx = lunr(function () {
  this.use(lunr.de)
  this.ref('id')
  this.field('text')

  this.add({
    id: 1,
    text: "Port 1234 is a good port for testing a problem"
  })
})

console.log(idx.search('1234'));
console.log(idx.search('Port 1234'));

Result

The text was updated successfully, but these errors were encountered:

andrewzola · 2020-10-06T11:14:25Z

My Russian docs have the same problem (I use mkdosc, if it have mean)

cadamini · 2020-12-08T12:16:49Z

Related to the trimmer. If I remove the trimmer completely, it works.

The defined word character defined in line 74 were really strange:

lunr.de.wordCharacters = "A-Za-z\xAA\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02B8\u02E0-\u02E4\u1D00-\u1D25\u1D2C-\u1D5C\u1D62-\u1D65\u1D6B-\u1D77\u1D79-\u1DBE\u1E00-\u1EFF\u2071\u207F\u2090-\u209C\u212A\u212B\u2132\u214E\u2160-\u2188\u2C60-\u2C7F\uA722-\uA787\uA78B-\uA7AD\uA7B0-\uA7B7\uA7F7-\uA7FF\uAB30-\uAB5A\uAB5C-\uAB64\uFB00-\uFB06\uFF21-\uFF3A\uFF41-\uFF5A";

Translates to:

ʸˠ
ˤᴀ
ᴥᴬ
ᵜᵢ
ᵥᵫ
ᵷᵹ
ᶾḀ
ỿⁱⁿₐ
ₜKÅℲⅎⅠ
ↈⱠ
ⱿꜢ
ꞇꞋ
ꞭꞰ
ꞷꟷ
ꟿꬰ
ꭚꭜ
ꭤﬀ
ﬆＡ
Ｚａ
ｚ

Potential solution:

lunr.de.wordCharacters = "A-Za-züÜÄäÖöß0-9";

khawkins98 · 2021-05-26T10:03:11Z

lunr.de.wordCharacters = "A-Za-züÜÄäÖöß0-9";

I noticed the German support was also breaking * wildcard support, this also fixes that.

pizaranha · 2022-08-19T19:43:07Z

I was facing the same issue No results for numeric searches. Then I found that adding '\0-9' at the end of line 74 that will include numeric searching.

lunr.es.wordCharacters = "A-Za-z\xAA\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02B8\u02E0-\u02E4\u1D00-\u1D25\u1D2C-\u1D5C\u1D62-\u1D65\u1D6B-\u1D77\u1D79-\u1DBE\u1E00-\u1EFF\u2071\u207F\u2090-\u209C\u212A\u212B\u2132\u214E\u2160-\u2188\u2C60-\u2C7F\uA722-\uA787\uA78B-\uA7AD\uA7B0-\uA7B7\uA7F7-\uA7FF\uAB30-\uAB5A\uAB5C-\uAB64\uFB00-\uFB06\uFF21-\uFF3A\uFF41-\uFF5A\0-9";

I think it could be a config option in the future.

blackwidow207 · 2022-11-17T02:24:19Z

As of ES6 regexp in JavaScript now supports the unicode flag, so pretty sure this can be used to simplify the trimmer function for all languages when creating the search index. Some of the language implementations seem to use the trimmer during search too, so it may not work for that.
[Here is an example in regex101] (https://regex101.com/r/pQMvFL/1) , it works for latin and non-latin character languages. Have just implemented this in an Angular 14 site to clean the start and end of the search term before executing search.
I hardcoded it into the lunr.trimmerSupport.generateTrimmer function and ran the tests and all of them seem to pass, so that's a sign it will work.

@MihaiValentin I can put this into a PR if you like, but obviously being ES6 it is probably not as backwards compatible as what is currently there

dhdaines · 2024-07-06T16:10:31Z

Yes, a lot of the trimmers have this problem (French one too).

You can sometimes get away with just replacing the language specific one with the default one but as noted above \w in JavaScript doesn't mean the same thing as it does in other regex implementations, and the trimmer isn't included in the search pipeline (see olivernn/lunr.js#532) so you may encounter unexpected behaviour (and low recall) with words beginning and ending with non-ASCII characters

dhdaines · 2024-07-06T21:28:50Z

In #115 this is fixed in a more systematic way than mentioned above, by using the Unicode definitions, so try that out if you like.

khawkins98 mentioned this issue May 26, 2021

lunr.de demo with unexpected result for umlauts #41

Closed

jonex2 mentioned this issue Oct 15, 2021

lunr.de fails with umlaute in wildcard search #80

Open

This was referenced Jul 6, 2024

feat: extract wordchars from lunr-languages yeraydiazdiaz/lunr.py#150

Closed

Include digits and update unicode regex generation #115

Open

Ynote mentioned this issue Oct 15, 2024

Amélioration de la recherche par texte sur le tableau de suivi betagouv/pitchou#103

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to search for just numbers in lunr.de #66

Not able to search for just numbers in lunr.de #66

cadamini commented Sep 12, 2020 •

edited

Loading

andrewzola commented Oct 6, 2020 •

edited

Loading

cadamini commented Dec 8, 2020 •

edited

Loading

khawkins98 commented May 26, 2021

pizaranha commented Aug 19, 2022

blackwidow207 commented Nov 17, 2022

dhdaines commented Jul 6, 2024

dhdaines commented Jul 6, 2024

Not able to search for just numbers in lunr.de #66

Not able to search for just numbers in lunr.de #66

Comments

cadamini commented Sep 12, 2020 • edited Loading

Probem

More tests

Test Code

Result

andrewzola commented Oct 6, 2020 • edited Loading

cadamini commented Dec 8, 2020 • edited Loading

khawkins98 commented May 26, 2021

pizaranha commented Aug 19, 2022

blackwidow207 commented Nov 17, 2022

dhdaines commented Jul 6, 2024

dhdaines commented Jul 6, 2024

cadamini commented Sep 12, 2020 •

edited

Loading

andrewzola commented Oct 6, 2020 •

edited

Loading

cadamini commented Dec 8, 2020 •

edited

Loading