-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr 7 - Highlighting #4836
Comments
FWIW, the way I understand the issue, the fact that the superuser is seeing results different from regular users is NOT the main problem. (Admins/users with more permissions are supposed to see different search results by design).
This used to work back when we were using solr 4* (the config lines are taken verbatim from our old solrconfig). But, for whatever reason, they appear not to be producing the desired effect under solr 7. The fact that the superuser is seeing things in a different order most likely means that the order is simply random for all users. For clarity, I would rename the issue to something like "Ordering of solr search results is broken under solr 7". (there may be something super simple we are missing; maybe it just needs another piece of configuration somewhere else we are still missing...) |
@landreev thank you for investigating and clarifying. :dataverseman: emoji |
Are we sure boosting isn't working under Solr 7? @matthew-a-dunlap wrote "Also, confirm the solr boosting is working as expected. I did a simple test, taking it out and putting it back, and it seemed to work." over at #4158 (comment) and he documented how to remove the boosting (if installations don't like it) in f90e00a. |
I know when I did my test it was mainly to see if there was any impact. I did not have a great understanding of what the desired outcome was so I did not test deeply. I don't mind looking into this more but I won't be able to do so before my appointment today. |
I am quite sure that as a super user, I found it rather frustrating that the dataverse I was looking for by searching "murray" was four pages of results deep because unpublished and deaccessioned datasets were being returned in the top 10 results just because the MRA is listed as a distributor. The MRA and the datasets and dataverses that are it's children should be bumped higher than a dataset that gets a hit on the distributor field. I will again bring up that the highlighting from Solr needs to be turned back on, which would make it a lot easier to determine why these results are being returned by displaying values with a bold styling, right in the results card. |
I spent a bit of time investigating the solr highlighting issue. We had punted on it during the upgrade because the problems seems amorphous and we didn't want to hold up the release. The highlighting is on in some form, but weird and inconsistent. For example, the words "test" and "test1" get highlighted in the description, but "Murray" and "murray" do not. "murrayz" does though. Maybe a dictionary is in play? It looks like we just use the default configuration at 4.2.1 (or earlier?). When I remove the whole section about it from solrconfig.xml some form of highlighting happens. I don't think a reindex is needed for highlighting config changes. My guess is that a default configuration exists outside our solrconfig.xml and that configuration differs from the defaults we expected back in 4.2.1. But just a guess. I wouldn't be surprised if this is also part of what's happening with the superuser search results. The newer documentation does not discuss the xml configuration files, but looking back this section is of help: https://wiki.apache.org/solr/SolrConfigXml#The_Highlighter_plugin_configuration_section . We may need to go about using the new managed schema approach for solr, as no one is documenting the xml configurations for the newer versions (even though they are supported). Hopefully this'll be of help when we pick up this work. |
Thanks for the input @matthew-a-dunlap. I have added highlighting to the issue title. We need this feature back. The combination of the two problems makes for some confusing results. |
Thanks for the investigation! Note to self for backlog grooming, we should estimate this with and without the highlighting piece and consider smaller batches. I'm OK with no highlighting (I made the call to not include it earlier and we haven't heard any feedback aside from @mheppler) but I'm not OK with no boosting. |
Is there a description of what "working as expected" is that we can use for a baseline shared understanding, and to make judgements against? |
There was some discussion of re-evaluating how we rank search results, but we'll not do this now because this would be a large effort. Instead, we'll plan to restore how it was. |
The "super user" aspect of this story maybe a red herring. After demoing this issue in our sprint planning mtg, we saw questionable results returned for a guest. The last three results for a guest searching "murray" perfectly illustrate this issue. The 8th and 9th results have "murray" hits in the distributor field (which are not highlighted in bold) and the 10th result has a hit in the title (also not highlighted). And yes, the top three results are no better. Three files with "murray" in the name are returned higher than a dataverse name hit. So there appears to be not only issues with dataverse vs dataset vs file bumping, but also an issue with title/name vs distributor bumping. |
While waiting on things for other stories, I took another look at the configs around this and touched base with folks on the solr irc. Their recommendations were to start over with a new solrconfig.xml file out of 7.3.0 and customize that as we need it. It seems like a good path as we mostly used defaults before anyways.
|
Also storing two other versions of the config in the project for dev work.
Some to-do:
|
We are not using collections, those are only a part of SolrCloud. This is the first half of fixing our installation steps via recommendation on solr's IRC
Before we were creating a folder with our configs and then installing, but the installer itself expects the folder passed with -d to be a reference template. It did not seem to break anything but is bad practice and came up when asking for help from folks at solr
Discussing with the folks in the solr IRC, I learned that if you do not provide configuration for highlighting but your queries to solr have highlighting params, solr use a system default. This happens with other aspects of configuration as well. This is why removing the highlighting section from our configs had no effect, as either way it was the same configuration. The current solrconfig.xml we have in develop is not actually much different than the default. Tomorrow I'll start modifying the solrconfig.xml section for highlighting to get it back to a more acceptable form. What exactly we want in the end is vague, but if anything I'll try to understand why "Murray" does not highlight but "Murra*" does. |
Looks like the highlighting problem is related to how the schema field type Next step is to alter |
I have created a pull request with just the fix for solr highlighting. My "best practice" solr fixes and my start on fixing the boosting are not in this branch. For this fix to take effect, the |
Pull request #4937 looks good so I moved it to QA in https://waffle.io/IQSS/dataverse Here's a copy and paste from my review:
Please note that if you want to see fewer lines in the diff (it's mostly whitespace changes), you should add |
[Note: this issue # has been changed to only capture highlighting. The information on search result older is being kept for historical reasons and future development]
I will preface this by saying I am searching production as a super user for the first time in a long while, so maybe I am not familiar with the type of results I should expect... that said...
When searching for "murray" on production -- which I do quite regularly on production as a guest -- I usually expect to find the MRA dataverse up near the top of the results. Currently 4th in this set of results...
As a super user, the MRA dataverse comes in at a cool 39, on the 4th page of results...
I had wrongfully accused @landreev of breaking indexing, but he assures me that all is right in our config settings to bump dataverses. Maybe this relates to recent updates to Solr. I am not sure.
There appear to be a lot of unpublished datasets from the Robert M. Townsend Dataverse in the first four pages of results. There is nothing to indicate why those results return for "murray".
Also, highlighting in the search results was not added to the new Solr settings and needs to be returned ASAP.
The text was updated successfully, but these errors were encountered: