Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow some special characters on TagsQuery component #205

Merged
merged 5 commits into from
Apr 25, 2024

Conversation

stellarhoof
Copy link
Member

@stellarhoof stellarhoof commented Apr 1, 2024

The tags component used in the search bar currently strips out non-word characters in each tag. For example a tag my+email will be converted to my email. We do this to visually convey tags that contain text that closely resembles what we send to elastic (see futil-js/contexture-elasticsearch#170 for more information).

There are three places to think about when reviewing this PR:

  1. Contexture react: we strip some special characters for presentation purposes
  2. Contexture elasticsearch: we strip reserved query_string characters before sending the elastic request to prevent our queries from exploding. See https://github.com/smartprocure/contexture/blob/feature/allow-special-keyword-characters/packages/provider-elasticsearch/src/example-types/filters/tagsQuery.js#L36
  3. ElasticSearch: text gets stripped out of special characters when analyzed so there is no point in sending such characters in the query. You can see which tokens our analyzers generate by calling /{index}/_analyze with a payload such as { "analyzer" : "exact", "text" : "student discourse"}

Behavior in this PR

  • Strip the following query_string reserved characters in the tags component: | > < ! ( ) { } [ ] ^ " ~ * ? \.
  • Additionally, strip ; , $ ' as well since they are removed by all our analyzers anyway.
  • Escape the characters & + - = : / in contexture-elasticsearch because those are kept by our analyzers but are also reserved query_string characters.

Further notes

The props wordsMatchPattern, maxWordsPerTag, and maxCharsPerTagWord taken by the ExpandableTagsInput and TagsInput component are not passed from anywhere (including spark) so this PR removes them. This is a case of adding props in case they're useful but without a use case.

A common package contexture-util was introduced to share code between contexture packages.

@stellarhoof stellarhoof self-assigned this Apr 1, 2024
Copy link

changeset-bot bot commented Apr 1, 2024

🦋 Changeset detected

Latest commit: 79e9e2b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
contexture-util Patch
contexture-elasticsearch Patch
contexture-react Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@@ -118,19 +105,4 @@ let TagsInput = forwardRef(
}
)

// Just uses an internal observable array
export let MockTagsInput = inject(() => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This component was unused. Looks like it was used for a test at some point.

@stellarhoof stellarhoof force-pushed the feature/allow-special-keyword-characters branch from 5d43273 to 2bdceb0 Compare April 1, 2024 20:39
@stellarhoof stellarhoof force-pushed the feature/allow-special-keyword-characters branch from 2bdceb0 to 2a33db8 Compare April 1, 2024 20:55
@stellarhoof stellarhoof marked this pull request as ready for review April 1, 2024 21:11
@positiveVibes4all
Copy link
Contributor

positiveVibes4all commented Apr 3, 2024

The tags component used in the search bar currently strips out non-word characters in each tag. For example a tag my+email will be converted to my email. We do this to visually convey tags that contain text that closely resembles what we send to elastic (see smartprocure/contexture-elasticsearch#170 for more information).

There are three places to think about when reviewing this PR:

  1. Contexture react: we strip some special characters for presentation purposes
  2. Contexture elasticsearch: we strip reserved query_string characters before sending the elastic request to prevent our queries from exploding. See https://github.com/smartprocure/contexture/blob/feature/allow-special-keyword-characters/packages/provider-elasticsearch/src/example-types/filters/tagsQuery.js#L36
  3. ElasticSearch: text gets stripped out of special characters when analyzed so there is no point in sending such characters in the query. You can see which tokens our analyzers generate by calling /{index}/_analyze with a payload such as { "analyzer" : "exact", "text" : "student discourse"}

Behavior in this PR

  • Strip the following query_string reserved characters in the tags component: | > < ! ( ) { } [ ] ^ " ~ * ? \.
  • Additionally, strip ; , $ ' & as well since they are removed by all our analyzers anyway.
  • Escape the characters + - = : / in contexture-elasticsearch because those are kept by our analyzers but are also reserved query_string characters.

Further notes

The props wordsMatchPattern, maxWordsPerTag, and maxCharsPerTagWord taken by the ExpandableTagsInput and TagsInput component are not passed from anywhere (including spark) so this PR removes them. This is a case of adding props in case they're useful but without a use case.

I asked Marc for the spreadsheet he made about the analyzers, we might want to update this at some point but at the very least this should live somewhere in our notion documentation: https://docs.google.com/spreadsheets/d/1ObJyl2w0e1jszgYIdk9yUuZU7N9oeSOOtFDFRK_LDwg/edit#gid=0

Here is the code he used to generate this as well: https://github.com/smartprocure/elasticsearch-test-suite/blob/master/src/index.js

_.replace(/([&|!(){}[\]^"~*?\\<>;,$'])/g, ' '),
// These characters are not stripped out by our analyzers but they are
// `query_string` reserved characters so we need to escape them.
_.replace(/([+\-=:/])/g, '\\$1')
Copy link
Contributor

@positiveVibes4all positiveVibes4all Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we are stripping out the char '\', it should not exist in this replace for reserved characters based on the requirements: found here

Escape + - = & : /

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context it's escaping the - character

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add & to this list and remove it from the list above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

_.trim,
splitCommas ? splitTagOnComma : _.identity,
_.castArray,
sanitizeTagFn ? _.map(sanitizeTagFn) : _.identity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case will this be called without a sanitizeTagFn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full disclaimer: I think sanitizeTagFn is a terrible API. Having said that, sanitizeTagFn is only currently relevant to transform the tags text for the main search bar, from which we end up sending elastic query_string queries.

An example of where it's not needed is the TagsText component, which I think is used in the "Matches" filter.


let isValidInput = (tag, tags) => !_.isEmpty(tag) && !_.includes(tag, tags)

let TagsInput = forwardRef(
(
{
tags,
addTags,
addTags: setTags,
Copy link
Contributor

@positiveVibes4all positiveVibes4all Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see setTags in the story but nowhere else, where is this used?

Copy link
Member Author

@stellarhoof stellarhoof Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're passed here and here

@positiveVibes4all positiveVibes4all self-requested a review April 4, 2024 22:58
_.replace(/([&|!(){}[\]^"~*?\\<>;,$'])/g, ' '),
// These characters are not stripped out by our analyzers but they are
// `query_string` reserved characters so we need to escape them.
_.replace(/([+\-=:/])/g, '\\$1')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add & to this list and remove it from the list above.

//
// If in doubt, make a request to the `/{index}/analyze` elasticsearch endpoint
// to see exactly which characters get stripped out of text.
let wordRegex = /[^|><!(){}[\]^"~*?\\;,$']+/g
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consolidate this list with the list here: https://github.com/smartprocure/contexture/pull/205/files#diff-69f43c306be976d94d4493c9ca003a660ca8f6084b779bf99ae2288ca58d4519R38

We can pull this out into a common package that's shared between the front end and back end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created the common package and moved the regex there 👍

@dubiousdavid
Copy link
Member

Let's also make sure to handle the scenario where there is no text left after stripping reserved characters from a tag. In that case, we should remove the tag altogether.

@@ -68,7 +68,6 @@
"mobx": "^4.3.1",
"mobx-react": "^6.3.0",
"mobx-utils": "^5.0.0",
"moment": "^2.24.0",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency was duplicated.

@stellarhoof stellarhoof force-pushed the feature/allow-special-keyword-characters branch 2 times, most recently from f111190 to 5cdeda6 Compare April 24, 2024 21:06
@stellarhoof stellarhoof force-pushed the feature/allow-special-keyword-characters branch from 5cdeda6 to 79e9e2b Compare April 24, 2024 21:07
Copy link
Contributor

Coverage after merging feature/allow-special-keyword-characters into main will be

90.57%

Coverage Report
FileStmtsBranchesFuncsLinesUncovered Lines
packages/client/src
   exampleTypes.js93.78%85.71%40%97.80%109, 123, 126–133
   index.js98.96%93.85%100%100%211, 213, 284, 73
   lens.js100%100%100%100%
   listeners.js100%100%100%100%
   mockService.js95.65%83.33%100%100%22, 30
   node.js96.75%96%87.50%97.78%72, 76, 76
   reactors.js99.12%100%90.91%100%
   serialize.js100%100%100%100%
   subquery.js97.59%88.89%100%98.55%66, 66
   traversals.js100%100%100%100%
   types.js92.59%83.33%100%95.24%12–13
   validation.js78.79%83.33%100%77.78%20–26
packages/client/src/actions
   index.js100%100%100%100%
   wrap.js96.67%85.71%100%97.92%41, 41
packages/client/src/exampleTypes
   pivot.js82.87%74.29%87.50%83.70%10–13, 138–139, 14, 140–149, 15, 150–161, 167–168, 21–22, 227, 23, 243, 257–263, 48, 53–57, 8–9, 9
packages/client/src/util
   futil.js67.31%100%57.14%65.85%18–19, 24–28, 31, 36–41
   tree.js100%100%100%100%
packages/export/src
   csv.js92.73%80%50%95.83%25, 45–46
   index.js0%100%0%0%1–5
   utils.js100%100%100%100%
packages/export/src/nodes
   index.js0%100%0%0%1–5
   pivot.js98.43%82.35%100%100%52, 56, 89
   results.js100%100%100%100%
   terms_stats.js100%100%100%100%
packages/provider-elasticsearch/src
   index.js81.82%52%80%88.24%104–112, 12–13, 29, 53, 53–57, 62, 65–66, 73, 75, 87, 91, 91
   schema.js92.52%92.86%80%93.18%52–55, 85–87
   types.js100%100%100%100%
packages/provider-elasticsearch/src/example-types
   index.js100%100%100%100%
   schemaMapping.js100%100%100%100%
   testUtils.js98.88%92.31%100%100%12
packages/provider-elasticsearch/src/example-types/filters
   bool.js100%100%100%100%
   date.js84.62%66.67%50%88.24%16, 16, 6–8
   dateRangeFacet.js69.79%100%75%66.67%56–83
   exists.js100%100%100%100%
   facet.js92.80%86.67%66.67%94.39%39, 64, 64–69
   geo.js100%100%100%100%
   number.js52.05%100%50%50%22–29, 33–56
   query.js100%100%100%100%
   tagsQuery.js86.98%90.63%85.71%86.36%111–128, 140, 149–153, 159–161
   tagsText.js100%100%100%100%
   text.js95.73%86.67%100%98.82%11, 22, 40, 71, 71
packages/provider-elasticsearch/src/example-types/legacy
   cardinality.js93.33%100%50%100%
   dateHistogram.js95.45%100%50%100%
   esTwoLevelAggregation.js92.11%86.67%66.67%93.75%34–36, 83–87
   matchStats.js97.06%100%50%100%
   rangeStats.js100%100%100%100%
   smartIntervalHistogram.js97.44%100%50%100%
   statistical.js90.91%100%50%100%
   terms_stats.js97.73%100%66.67%100%
packages/provider-elasticsearch/src/example-types/metricGroups
   dateIntervalGroupStats.js100%100%100%100%
   dateRangesGroupStats.js98.28%100%75%100%
   fieldValuePartitionGroupStats.js87.76%100%40%92.31%34–36
   fieldValuesDelta.js97.44%100%66.67%100%
   fieldValuesGroupStats.js99.07%100%83.33%100%
   groupStatUtils.js86.05%100%75%85.71%20–24
   numberIntervalGroupStats.js100%100%100%100%
   numberRangesGroupStats.js97.87%100%66.67%100%
   percentilesGroupStats.js82.46%75%50%85.71%16–18, 24–26, 43–44
   pivot.js85.62%89.89%71.43%85.65%107, 151–152, 159, 166, 408–410, 460–470, 473, 478, 480, 483, 486, 534–535, 563–619, 75–81, 84–91
   stats.js86.96%100%50%94.12%16
   tagsQueryGroupStats.js100%100%100%100%
packages/provider-elasticsearch/src/example-types/metricGroups/pivotData
   columnResponse.js100%100%100%100%
   columnResult.js100%100%100%100%
   pivotResponse.js100%100%100%100%
   pivotResponseWithFilteredFieldValueGroup.js100%100%100%100%
packages/provider-elasticsearch/src/example-types/results
   index.js18.60%100%0%19.51%10–40, 8–9
packages/provider-elasticsearch/src/example-types/results/highlighting
   request.js95.15%93.75%100%95.26%193, 193–201, 40–42
   response.js100%100%100%100%
   search.js31.15%100%0%31.67%20–60
   testSchema.js100%100%100%100%
   util.js100%100%100%100%
packages/provider-elasticsearch/src/schema-data
   aliases.js100%100%100%100%
   mapping-with-non-objects.js100%100%100%100%
   mapping-with-types.js100%100%100%100%
   mapping-without-types.js100%100%100%100%
   schema-with-types.js100%100%100%100%
   schema-without-types.js100%100%100%100%
packages/provider-elasticsearch/src/utils
   dateUtil.js88.37%85.71%60%90.54%13–15, 18, 64–67
   elasticDSL.js100%100%100%100%
   fields.js100%100%100%100%
   futil.js99.39%100%92.86%100%
   luceneQueryUtils.js100%100%100%100%
   regex.js100%100%100%100%
   results.js100%100%100%100%
   smartInterval.js83.33%33.33%100%92.86%11, 11, 9
packages/provider-mongo/src
   index.js93.33%71.43%83.33%96.77%10, 33, 45, 9
   types.js100%100%100%100%
packages/provider-mongo/src/example-types
   bool.js100%100%100%100%
   date.js79.34%90.91%55.56%80.20%54–58, 61–69, 7, 70–71, 8–9, 90, 90
   dateHistogram.js94.20%62.50%100%98.31%19, 19, 25, 25
   exists.js100%100%100%100%
   facet.js98.95%95.74%100%99.56%123–124, 124
   index.js100%100%100%100%
   mongoId.js100%100%100%100%
   number.js100%100%100%100%
   results.js97.74%92.31%100%99.01%180, 59, 61, 64–66
   statistical.js100%100%100%100%
   tagsText.js100%100%100%100%
   termsStats.js100%100%100%100%
   text.js100%100%100%100%
packages/react/src
   FilterAdder.js0%100%0%0%1, 10–19, 2, 20–29, 3, 30–39, 4, 40–49, 5, 50–52, 6–9
   FilterAdder.stories.js0%100%0%0%1, 10–19, 2, 20–23, 3–9
   FilterButtonList.js0%100%0%0%1, 10, 100–109, 11, 110–119, 12, 120–129, 13, 130–139, 14, 140–149, 15, 150–159, 16, 160–168, 17–19, 2, 20–29, 3, 30–39, 4, 40–49, 5, 50–59, 6, 60–69, 7, 70–79, 8, 80–89, 9, 90–99
   FilterButtonList.stories.js0%100%0%0%1, 10–19, 2, 20–29, 3, 30–39, 4, 40–49, 5, 50–56, 6–9
   FilterList.js0%100%0%0%1, 10, 100–109, 11, 110–119, 12, 120–129, 13, 130–139, 14, 140–149, 15, 150–159, 16, 160–169,

@stellarhoof
Copy link
Member Author

Let's also make sure to handle the scenario where there is no text left after stripping reserved characters from a tag. In that case, we should remove the tag altogether.

Done

Copy link
Member

@dubiousdavid dubiousdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@stellarhoof stellarhoof merged commit 31407f0 into main Apr 25, 2024
4 checks passed
@stellarhoof stellarhoof deleted the feature/allow-special-keyword-characters branch April 25, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants