Solr 7 - Highlighting #4836

mheppler · 2018-07-12T21:19:36Z

[Note: this issue # has been changed to only capture highlighting. The information on search result older is being kept for historical reasons and future development]

I will preface this by saying I am searching production as a super user for the first time in a long while, so maybe I am not familiar with the type of results I should expect... that said...

When searching for "murray" on production -- which I do quite regularly on production as a guest -- I usually expect to find the MRA dataverse up near the top of the results. Currently 4th in this set of results...

As a super user, the MRA dataverse comes in at a cool 39, on the 4th page of results...

I had wrongfully accused @landreev of breaking indexing, but he assures me that all is right in our config settings to bump dataverses. Maybe this relates to recent updates to Solr. I am not sure.

There appear to be a lot of unpublished datasets from the Robert M. Townsend Dataverse in the first four pages of results. There is nothing to indicate why those results return for "murray".

Also, highlighting in the search results was not added to the new Solr settings and needs to be returned ASAP.

landreev · 2018-07-13T14:23:03Z

FWIW, the way I understand the issue, the fact that the superuser is seeing results different from regular users is NOT the main problem. (Admins/users with more permissions are supposed to see different search results by design).
The main issue is that our system of "bumping" certain hits up in the sort order - so that dataverses would appear first, then datasets, and then files - is no longer working.
We supply the configuration that's supposed to achieve this in the file solrconfig.xml (we also explain this in the solr install guide), as follows:

<str name="qf">
dvName^170
dvSubject^160
dvDescription^150
dvAffiliation^140
title^130
subject^120
keyword^110
topicClassValue^100
dsDescriptionValue^90
authorName^80
authorAffiliation^70
publicationCitation^60
producerName^50
fileName^40
fileDescription^30
variableLabel^20
variableName^10
text^1.0
</str>

This used to work back when we were using solr 4* (the config lines are taken verbatim from our old solrconfig). But, for whatever reason, they appear not to be producing the desired effect under solr 7.

The fact that the superuser is seeing things in a different order most likely means that the order is simply random for all users.

For clarity, I would rename the issue to something like "Ordering of solr search results is broken under solr 7".

(there may be something super simple we are missing; maybe it just needs another piece of configuration somewhere else we are still missing...)

mheppler · 2018-07-13T14:50:01Z

@landreev thank you for investigating and clarifying. :dataverseman: emoji

pdurbin · 2018-07-13T15:43:43Z

Are we sure boosting isn't working under Solr 7? @matthew-a-dunlap wrote "Also, confirm the solr boosting is working as expected. I did a simple test, taking it out and putting it back, and it seemed to work." over at #4158 (comment) and he documented how to remove the boosting (if installations don't like it) in f90e00a.

matthew-a-dunlap · 2018-07-13T15:47:14Z

I know when I did my test it was mainly to see if there was any impact. I did not have a great understanding of what the desired outcome was so I did not test deeply. I don't mind looking into this more but I won't be able to do so before my appointment today.

mheppler · 2018-07-13T15:56:18Z

I am quite sure that as a super user, I found it rather frustrating that the dataverse I was looking for by searching "murray" was four pages of results deep because unpublished and deaccessioned datasets were being returned in the top 10 results just because the MRA is listed as a distributor.

The MRA and the datasets and dataverses that are it's children should be bumped higher than a dataset that gets a hit on the distributor field.

I will again bring up that the highlighting from Solr needs to be turned back on, which would make it a lot easier to determine why these results are being returned by displaying values with a bold styling, right in the results card.

matthew-a-dunlap · 2018-07-15T21:39:13Z

I spent a bit of time investigating the solr highlighting issue. We had punted on it during the upgrade because the problems seems amorphous and we didn't want to hold up the release.

The highlighting is on in some form, but weird and inconsistent. For example, the words "test" and "test1" get highlighted in the description, but "Murray" and "murray" do not. "murrayz" does though. Maybe a dictionary is in play?

It looks like we just use the default configuration at 4.2.1 (or earlier?). When I remove the whole section about it from solrconfig.xml some form of highlighting happens. I don't think a reindex is needed for highlighting config changes.

My guess is that a default configuration exists outside our solrconfig.xml and that configuration differs from the defaults we expected back in 4.2.1. But just a guess. I wouldn't be surprised if this is also part of what's happening with the superuser search results.

The newer documentation does not discuss the xml configuration files, but looking back this section is of help: https://wiki.apache.org/solr/SolrConfigXml#The_Highlighter_plugin_configuration_section . We may need to go about using the new managed schema approach for solr, as no one is documenting the xml configurations for the newer versions (even though they are supported).

Hopefully this'll be of help when we pick up this work.

mheppler · 2018-07-16T14:34:43Z

Thanks for the input @matthew-a-dunlap. I have added highlighting to the issue title. We need this feature back. The combination of the two problems makes for some confusing results.

djbrooke · 2018-07-16T14:39:43Z

Thanks for the investigation!

Note to self for backlog grooming, we should estimate this with and without the highlighting piece and consider smaller batches. I'm OK with no highlighting (I made the call to not include it earlier and we haven't heard any feedback aside from @mheppler) but I'm not OK with no boosting.

TaniaSchlatter · 2018-07-16T15:58:20Z

Is there a description of what "working as expected" is that we can use for a baseline shared understanding, and to make judgements against?

djbrooke · 2018-07-18T19:09:25Z

We used to have rules about boosting and highlighting, we should investigate why these are no longer being followed (solr upgrade related or otherwise) and reapply them
There are some things happening in the search (ex. highlighting of "king") that we should investigate and document

There was some discussion of re-evaluating how we rank search results, but we'll not do this now because this would be a large effort. Instead, we'll plan to restore how it was.

mheppler · 2018-07-18T19:14:05Z

The "super user" aspect of this story maybe a red herring. After demoing this issue in our sprint planning mtg, we saw questionable results returned for a guest.

The last three results for a guest searching "murray" perfectly illustrate this issue. The 8th and 9th results have "murray" hits in the distributor field (which are not highlighted in bold) and the 10th result has a hit in the title (also not highlighted).

And yes, the top three results are no better. Three files with "murray" in the name are returned higher than a dataverse name hit.

So there appears to be not only issues with dataverse vs dataset vs file bumping, but also an issue with title/name vs distributor bumping.

matthew-a-dunlap · 2018-07-30T22:18:12Z

While waiting on things for other stories, I took another look at the configs around this and touched base with folks on the solr irc. Their recommendations were to start over with a new solrconfig.xml file out of 7.3.0 and customize that as we need it. It seems like a good path as we mostly used defaults before anyways.

[17:04] matthew-dv: Question for y'all! The open source java project I'm a part of (https://dataverse.org/) recently upgraded our solr version from 4.2.1 to 7.3.0. We also updated the java libraries. We opted to use our solrconfig.xml instead of the managed schema approach. After the upgrade most functionality is working but we have noticed that our highlighting is acting differently.
[17:04] matthew-dv: It seems the configuration we have in solrconfig.xml does not have an impact for if I remove the whole section and reload http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1 the highlighting is in place. Where should I look next to understand this and fix our highlighting? Fwiw it looks like we used the default highlighting that came with the sample 4.2.1 configuration.
[17:06] elyograg: i have never used highlighting, don't know how. Probably what you should do is start with the solrconfigl.xml file in the examples for 7.3.0 and build up a config that does what you need.
[17:31] matthew-dv: Thanks @elyoqrag! I think looking over a few various solrconfig.xml's that we may have missed some of the defined default. I'll look to those next
[17:52] ctargett: matthew-dv - between Solr 4.2.1 and 7.3.0, highlighting underwent a big transformation. I think some of the classes that might have been default in 4.2.1 have been removed and replaced. I think it will be a lot easier to re-implement it as new instead of trying to figure out what changed.
[18:13] matthew-dv: Thanks @ctargett, I think that's possibly the best approach. I tried doublechecking how we ported over functionality from before but nothing in that seems too off. I guess we'll have to dig into the new configurations even more.

Also storing two other versions of the config in the project for dev work.

matthew-a-dunlap · 2018-08-06T21:38:29Z

We are not using collections, those are only a part of SolrCloud. This is the first half of fixing our installation steps via recommendation on solr's IRC

Before we were creating a folder with our configs and then installing, but the installer itself expects the folder passed with -d to be a reference template. It did not seem to break anything but is bad practice and came up when asking for help from folks at solr

matthew-a-dunlap · 2018-08-06T22:46:46Z

Discussing with the folks in the solr IRC, I learned that if you do not provide configuration for highlighting but your queries to solr have highlighting params, solr use a system default. This happens with other aspects of configuration as well. This is why removing the highlighting section from our configs had no effect, as either way it was the same configuration.

The current solrconfig.xml we have in develop is not actually much different than the default. Tomorrow I'll start modifying the solrconfig.xml section for highlighting to get it back to a more acceptable form. What exactly we want in the end is vague, but if anything I'll try to understand why "Murray" does not highlight but "Murra*" does.

matthew-a-dunlap · 2018-08-07T22:04:06Z

Looks like the highlighting problem is related to how the schema field type text_en has changed between solr 4.6 and 7.3 . Switching dsDescription & title to text_general causes the exact matches to show up correctly for highlighting. We switched away from text_general for the better english language support #444 .

Next step is to alter text_en's configuration or switch to a newer type if one is available. Though we may want to think of a more holistic approach as we are looking to support other languages better

…ghting #4836

matthew-a-dunlap · 2018-08-08T21:22:04Z

I have created a pull request with just the fix for solr highlighting. My "best practice" solr fixes and my start on fixing the boosting are not in this branch.

For this fix to take effect, the schema.xml file in dataverse needs to be added to solr we must reindex.

pdurbin · 2018-08-09T15:56:44Z

Pull request #4937 looks good so I moved it to QA in https://waffle.io/IQSS/dataverse

Here's a copy and paste from my review:

Looks good. I'm glad to see the solution ("Solution was to ensure original word is kept by stemmer") and the link to the answer on Stack Overflow.

Please note that if you want to see fewer lines in the diff (it's mostly whitespace changes), you should add ?w=1 like this: https://github.com/IQSS/dataverse/pull/4937/files?w=1 . I mentioned to @matthew-a-dunlap that I posted some thoughts about whitespace and such back in #3418.

mheppler added Feature: Search/Browse Type: Bug a defect labels Jul 12, 2018

djbrooke changed the title ~~Search - Super User vs Guest seeing different results?~~ Ordering of solr search results is broken under solr 7 Jul 13, 2018

djbrooke added the Status: Backlog label Jul 13, 2018

mheppler changed the title ~~Ordering of solr search results is broken under solr 7~~ Solr 7 - Highlighting + ordering of search results is broken/wacky Jul 16, 2018

matthew-a-dunlap added Status: This/Next Sprint and removed Status: Backlog labels Aug 6, 2018

matthew-a-dunlap self-assigned this Aug 6, 2018

matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018

Start from fresh solrconfig (with classic schema turned on) #4836

23dab03

Also storing two other versions of the config in the project for dev work.

matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018

One more xml change for a "fresh" solrconfig #4836

f616a5a

matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018

... no this is actually fresh #4836

369e02f

matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018

Switch core name to core1 #4836

c7f56ea

We are not using collections, those are only a part of SolrCloud. This is the first half of fixing our installation steps via recommendation on solr's IRC

matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018

Revert change for core1 change #4836

bb64f83

pdurbin mentioned this issue Aug 7, 2018

Solr Container Scaling #4762

Closed

matthew-a-dunlap added a commit that referenced this issue Aug 7, 2018

removed outdated doc issue ref #4836

facb46f

matthew-a-dunlap added a commit that referenced this issue Aug 8, 2018

Fix highlighting stemmer to hold original word, needed for our highli…

30ba2b4

…ghting #4836

matthew-a-dunlap added a commit that referenced this issue Aug 8, 2018

Add back 4.9 solrconfig.xml differences #4836

d6b5d56

matthew-a-dunlap added a commit that referenced this issue Aug 8, 2018

Fix highlighting stemmer to hold original word, needed for our highli…

bb5f6b7

…ghting #4836

matthew-a-dunlap mentioned this issue Aug 8, 2018

Fix Solr Highlighting #4937

Merged

5 tasks

matthew-a-dunlap added Status: Code Review and removed Status: Development labels Aug 8, 2018

matthew-a-dunlap removed their assignment Aug 8, 2018

matthew-a-dunlap changed the title ~~Solr 7 - Highlighting + ordering of search results is broken/wacky~~ Solr 7 - Highlighting Aug 8, 2018

matthew-a-dunlap mentioned this issue Aug 8, 2018

Solr search result ordering broken #4938

Closed

djbrooke assigned pdurbin Aug 9, 2018

pdurbin added Status: QA and removed Status: Code Review labels Aug 9, 2018

pdurbin removed their assignment Aug 9, 2018

kcondon self-assigned this Aug 9, 2018

kcondon closed this as completed Aug 9, 2018

kcondon removed the Status: QA label Aug 9, 2018

djbrooke added this to the 4.10 - Additional Data Transfer Options milestone Aug 13, 2018

djbrooke modified the milestones: 4.10 - Additional Data Transfer Options, 4.9.3 - Optional File PIDs, Initial Internationalization Work Sep 18, 2018

matthew-a-dunlap mentioned this issue Sep 21, 2018

4938 solr search order #5080

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr 7 - Highlighting #4836

Solr 7 - Highlighting #4836

mheppler commented Jul 12, 2018 •

edited by matthew-a-dunlap

Loading

landreev commented Jul 13, 2018 •

edited

Loading

mheppler commented Jul 13, 2018

pdurbin commented Jul 13, 2018

matthew-a-dunlap commented Jul 13, 2018 •

edited

Loading

mheppler commented Jul 13, 2018 •

edited

Loading

matthew-a-dunlap commented Jul 15, 2018 •

edited

Loading

mheppler commented Jul 16, 2018

djbrooke commented Jul 16, 2018

TaniaSchlatter commented Jul 16, 2018

djbrooke commented Jul 18, 2018

mheppler commented Jul 18, 2018 •

edited

Loading

matthew-a-dunlap commented Jul 30, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 6, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 6, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 7, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 8, 2018 •

edited

Loading

pdurbin commented Aug 9, 2018

Solr 7 - Highlighting #4836

Solr 7 - Highlighting #4836

Comments

mheppler commented Jul 12, 2018 • edited by matthew-a-dunlap Loading

landreev commented Jul 13, 2018 • edited Loading

mheppler commented Jul 13, 2018

pdurbin commented Jul 13, 2018

matthew-a-dunlap commented Jul 13, 2018 • edited Loading

mheppler commented Jul 13, 2018 • edited Loading

matthew-a-dunlap commented Jul 15, 2018 • edited Loading

mheppler commented Jul 16, 2018

djbrooke commented Jul 16, 2018

TaniaSchlatter commented Jul 16, 2018

djbrooke commented Jul 18, 2018

mheppler commented Jul 18, 2018 • edited Loading

matthew-a-dunlap commented Jul 30, 2018 • edited Loading

matthew-a-dunlap commented Aug 6, 2018 • edited Loading

matthew-a-dunlap commented Aug 6, 2018 • edited Loading

matthew-a-dunlap commented Aug 7, 2018 • edited Loading

matthew-a-dunlap commented Aug 8, 2018 • edited Loading

pdurbin commented Aug 9, 2018

mheppler commented Jul 12, 2018 •

edited by matthew-a-dunlap

Loading

landreev commented Jul 13, 2018 •

edited

Loading

matthew-a-dunlap commented Jul 13, 2018 •

edited

Loading

mheppler commented Jul 13, 2018 •

edited

Loading

matthew-a-dunlap commented Jul 15, 2018 •

edited

Loading

mheppler commented Jul 18, 2018 •

edited

Loading

matthew-a-dunlap commented Jul 30, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 6, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 6, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 7, 2018 •

edited

Loading

matthew-a-dunlap commented Aug 8, 2018 •

edited

Loading