feat(9586): implement freetext search in cht datasource #9625

sugat009 · 2024-11-07T07:26:49Z

Description

Closes: #9586

Code review checklist

Readable: Concise, well named, follows the style guide, documented if necessary.
Documented: Configuration and user documentation on cht-docs
Tested: Unit and/or e2e where appropriate
Internationalised: All user facing text
Backwards compatible: Works with existing data and configuration or includes a migration. Any breaking changes documented in the release notes.

Compose URLs

If Build CI hasn't passed, these may 404:

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

shared-libs/cht-datasource/src/local/report.ts

…text-search-in-cht-datasource

sugat009 · 2025-01-10T12:42:39Z

@jkuester the e2e tests related to cht-datasource are failing right now. I remember seeing something related to changes in how we deal with auth and its impact in cht-datasource. Do we have any workaround at the moment?

jkuester · 2025-01-14T20:05:10Z

Oh wow. I did not realize that the end-result of the changes was to just delete the existing remote cht-datasource tests with no alternative. This is very disappointing.

We can do better here, though. The challenge is that in a browser context, the session cookie is automatically included when making the remote fetch calls. The cht-datasource remote context is only currently used in the admin app and in webapp, both instances running in the browser. This is why we never added any code to cht-datasource to handle auth on the remote fetch calls. I see two options for fixing our tests here:

1. Update cht-datasource to accept auth information when creating a remote DataContext

This would essentially be addressing #9701. We could pass username/password in the getRemoteDataContext call and then update the cht-datasource code to set those on the fetch calls.

As I noted on the ticket, though, I am reluctant to make changes to the implementation code just for testing purposes.

2. Re-introduce the MITM for `global.fetch` in the integration tests

Some test code like this should get the integration tests running again:

  const { USERNAME, PASSWORD } = require('@constants');
  const initialFetch = global.fetch;
  before(() => {
    const headers = new Headers();
    headers.set('Authorization', 'Basic ' + Buffer.from(`${USERNAME}:${PASSWORD}`).toString('base64'));
    global.fetch = (url, options) => initialFetch(url, { headers, ...options, });
  });
  after(() => {
    global.fetch = initialFetch;
  });

We could add this code to each of the files in tests/integration/shared-libs/cht-datasource or we could try to put together a hooks file specific to these tests where we could centralize this code. (Would like to still be able to run the tests from the integration-all-local npm script.)

At this point I am fine with either approach. #2 would be easier to work into an already huge PR. #1 will probably be needed eventually if we want to be able to use cht-datasource outside of cht-core (still probably a BIG "if").

sugat009 · 2025-01-15T06:18:10Z

I'm more inclined towards method #2 at the moment as the ticket to address the issue has already been created, whereas the majority of the scope of #9586 has been addressed through this PR.

sugat009 · 2025-01-17T07:55:55Z

@jkuester I did not go with the hook implementation as it would mean checking for filenames to run because this auth setup needs to run only for a few test suites, which would have resulted in a not-so-fruitful hook file with checks.

jkuester

Alright! We are getting very close here.

shared-libs/cht-datasource/src/contact.ts

shared-libs/cht-datasource/src/index.ts

shared-libs/cht-datasource/src/qualifier.ts

shared-libs/cht-datasource/src/local/libs/lineage.ts

jkuester · 2025-01-22T15:38:40Z

shared-libs/cht-datasource/src/local/libs/lineage.ts

+};
+
+/** @internal */
+export const getContactLineage = (medicDb: PouchDB.Database<Doc>) => {


nitpick: I guess now the getPrimaryContactIds, hydratePrimaryContact, hydrateLineage functions do not need to be exported since they are only getting used in this file.

jkuester · 2025-01-22T16:21:51Z

shared-libs/cht-datasource/src/local/report.ts

-        data: pagedDocs.data.map((doc) => doc._id),
-        cursor: pagedDocs.cursor
-      };
+      return await getPaginatedDocs(getDocsFn, limit, skip);


issue: okay bad news 😞 I found an edge case that breaks this logic... We should never get null id values back from these freetext queries, but it is possible to get duplicate id values.

To test this, I created a contact with:

"name": "Alberto O'Kon Rivera", "short_name": "River",

Then I did a search with freetext=river. The list of ids returned contained two instances of the same id value for this contact. This is because the getByStartsWithFreetext query will match the emissions for both River and Rivera (as intended). To make things even worse, even if we dupe-checked the ids returned for a given page, I don't think there is any reason that Couch could not give us more dupes of the same id later on different pages (because the view will return things ordered by key not by id).... 😬

@sugat009 definitely interested to hear your thoughts on the best way to proceed here. Pragmatically, my current inclination is to guarantee that each page will be free from duplicates, but then note in our documentation that different pages could contain the same ids. (I cannot come up with any feasible way to guarantee no dupes across pages).

The good news is that if we decide to just dupe-check on a page-by-page basis, then I think we can easily just reuse the fetch and filter logic here like this:

const uuidSet = new Set<string>(); const filterFn = (uuid: Nullable<string>): boolean => { if (!uuid) { return false; } const { size } = uuidSet; uuidSet.add(uuid); return uuidSet.size !== size; }; return await fetchAndFilter( getDocsFn, filterFn, limit )(limit, skip);

(We just need to update the fetchAndFilter signature and change <T extends Doc> to just be <T>. I don't think we actually need the extends Doc for the current functionality.

The same thing is going to apply for the contact logic too.

Okay, after a bunch more investigation, I still believe the best approach is to ensure there are no duplicate entries on each page, but allow duplicate entries to be returned across pages. This is slightly better than in the current implementation of shared-libs/search where, as far as I can tell, the default page limit was 50 and there was no dupe resolution at all for single-request search queries (so it would have been possible to get duplicate results back even on the same page).

For the record, here are the various approaches I considered to try and avoid returning duplicate entries across pages (and why each of them will not work):

Edit the view code to only emit once for each contact.

This would just break freetext searching since it depends on mapping multiple search terms to the same contact.

Use a reduce function on the view to combine duplicate results.

To properly support freetext searches, we need to emit multiple keys for the same contact. The "duplicate" results are produced at runtime and are caused by doing a "starts-with" search across a range of keys. I do not see any way to use a reduce to join results for a contact that would be useful for a "starts-with" type search.

Emit the contact _id value as part of the key. Then the query results would be sorted by contact id.

You cannot use the _id as the first value in a key array (unless you wanted to only search for entries for a particular id). Otherwise you would need to match all for the first element in the array and that would prevent filtering by any subsequent elements in the array. The start_key/end_key boundaries would essentially just include everything.

You cannot emit the _id value as part of the key (concatenated to the actual search key). If you put the _id at the beginning of the key value, it would break the "starts-with" range search functionality. If you put the _id at the end of the key value, there still would be no way to guarantee no duplicates because the results would be sorted by key and the same id value might be later in the results after a slightly different search term.

Just pull back all the ids and filter them for duplicates.

Technically this would work, but would totally defeat the purpose of having a paged API. It would also be much less performant than our current approach.

I've tried this to learn and see if there's anything else that can be done. I did not find anything different than the above options. The technically feasible thing to do right now is as @jkuester suggested i.e. to filter duplicate ids on the same page but allow them on different pages.
When you think about it, it isn't that bad realistically, as the default page size is 10000 and the chances that the IDs will be on different pages are not that high, I think.

shared-libs/cht-datasource/src/remote/libs/data-context.ts

tests/integration/shared-libs/cht-datasource/auth.js

Co-authored-by: Joshua Kuestersteffen <[email protected]>

…thub.com:medic/cht-core into 9586-implement-freetext-search-in-cht-datasource

Co-authored-by: Joshua Kuestersteffen <[email protected]>

…thub.com:medic/cht-core into 9586-implement-freetext-search-in-cht-datasource

jkuester · 2025-02-03T19:52:24Z

shared-libs/cht-datasource/src/qualifier.ts

+  return isRecord(qualifier) &&
+    hasField(qualifier, { name: 'freetext', type: 'string' }) &&
+    qualifier.freetext.length >= 3 &&
+    !qualifier.freetext.includes(' ');


Okay, so once again I missed some important logic here. 😬 🤦 I was reviewing the view query code and realized that the keyed freetext values are actually emitted whole and un-split. So, these actually can include whitespace values (as opposed to the non-keyed values which do not include white-spaces). I did some testing to confirm. If you have a doc with the field: "name": "Auburn's Household" the following keys get emitted for this field:

["auburn's"]

["household"]

["name:auburn's household"]

We should allow for searching for any of these. In fact, I found this issue when reviewing your shared-libs/validation changes. That code definitely relies on this behavior (and we need to avoid having to fall back to the Couch view in the shared-libs/validation code...)

So, I think we need to update this logic to validate that either the freetext value is keyed (it includes :) OR is has no whitespace.

Also, our white-space check should probably match the view logic and use /\s+/ to check for white-space instead of just ' '.

Honestly, @sugat009, this is probably as good a time as any to pause and reconsider if we should include this kind of validation here in cht-datasource at all... 🤔 Originally I was thinking that it would be best to protect against running searches with keys that could not return any results. Throwing an error for an invalid freetext qualifier allows consumers to distinguish between a search that just did not return any results vs a search with an invalid qualifier.

The more I consider this, though, the less confident I get in that behavior. It seems like someone using cht-datasource to search with (e.g. someone calling the REST apis) might not care about the difference between an invalid qualifier and not finding any results. In both cases, just getting back an empty page might be reasonable/desired behavior??? 🤔

What are your thoughts here? Personally, I am still leaning to keeping the validation here if only to force consumers to be aware of some of the weird behavior that these search rules entail. Take the above "name": "Auburn's Household" case. If you do a view query for ["name:auburn's household"], you will find that contact. However, if you do a view query for ["auburn's household"] you will NOT find the contact (because of how things get indexed). This is just weird and unexpected if you are not intimately familiar with the internal logic of the freetext views. Seems better to surface this weirdness here... I think.

sugat009 added 2 commits November 6, 2024 18:29

Initial setup

ac0ab67

add /api/v1/contacts/:uuid endpoint

cd11adc

sugat009 linked an issue Nov 7, 2024 that may be closed by this pull request

Implement freetext search in cht-datasource #9586

Open

sugat009 added 4 commits November 7, 2024 18:21

add /api/v1/contact/:uuid?with_lineage=<option> endpoint

b7c5d1f

add missing files

f91e393

add missing files

52ac11b

add endpoint /api/v1/report

c1b68cf

sugat009 commented Nov 8, 2024

View reviewed changes

shared-libs/cht-datasource/src/local/report.ts Outdated Show resolved Hide resolved

sugat009 added 8 commits November 11, 2024 18:39

add additional checks for report validity

45c0ccf

add /api/contact/id endpoint (not tested yet)

acf903a

add endpoint /api/v1/report/id --untested

89927b1

implement search feature in /api/contact/id endpoint

45abf58

add search functionality to /api/report/id endpoint

6f5c326

add async API Contact.v1.getIdsAll

192af16

add async API Report.v1.getIds

bb98dcf

add JSDocs for functions

43efbef

sugat009 force-pushed the 9586-implement-freetext-search-in-cht-datasource branch from eba7aac to 43efbef Compare November 18, 2024 08:59

sugat009 added 13 commits November 18, 2024 16:53

Merge remote-tracking branch 'origin/master' into 9586-implement-free…

ede85fd

…text-search-in-cht-datasource

add unit tests for index.ts, qualifier.ts and contact.ts

2b33c07

add unit tests for report.ts

3bfcace

add tests for local/contact.js

074b0c1

add some additional tests in local/contact.spec.ts

1907907

add tests for local/report.ts

a629116

add unit tests for remote/contact.ts

c45cee8

add unit tests for remote/report.ts

5400736

add unit tests for contact-types.ts

bf46e4a

remove unused variables

b1cf669

add unit tests for api/src/controllers/contact.js

98fd4e2

add tests for api/src/controllers/report.js

ee32262

Merge remote-tracking branch 'origin/master' into 9586-implement-free…

b5bc6e6

…text-search-in-cht-datasource

sugat009 added 8 commits January 3, 2025 18:11

refactor getWithLineage contacts logic

bfa0a57

listen to eslint

eab5877

complete a round of feedbacks

1cca39c

remove assertions in afterEach

abff459

almost done

9673d78

implement getDocUUids functions

6f1280a

Merge remote-tracking branch 'origin/master' into 9586-implement-free…

a79d93f

…text-search-in-cht-datasource

fix some stuff

be48c13

add temp auth setters in cht-datasource specific integration tests

e523a25

sugat009 requested a review from jkuester January 17, 2025 07:56

jkuester requested changes Jan 22, 2025

View reviewed changes

jkuester and others added 13 commits January 27, 2025 21:28

feat: refactor contact code to remove contact-types.ts (#9753)

d4bc934

Update shared-libs/cht-datasource/src/contact.ts

783d9d6

Co-authored-by: Joshua Kuestersteffen <[email protected]>

Update shared-libs/cht-datasource/src/contact.ts

d7f8ca1

Co-authored-by: Joshua Kuestersteffen <[email protected]>

Update shared-libs/cht-datasource/src/qualifier.ts

9fc41d9

Co-authored-by: Joshua Kuestersteffen <[email protected]>

Update shared-libs/cht-datasource/src/qualifier.ts

423ee0a

Co-authored-by: Joshua Kuestersteffen <[email protected]>

qualifier.ts changes

2641477

Merge branch '9586-implement-freetext-search-in-cht-datasource' of gi…

faac438

…thub.com:medic/cht-core into 9586-implement-freetext-search-in-cht-datasource

Update shared-libs/cht-datasource/src/local/libs/doc.ts

6878952

Co-authored-by: Joshua Kuestersteffen <[email protected]>

Update shared-libs/cht-datasource/src/local/libs/doc.ts

70e0db5

Co-authored-by: Joshua Kuestersteffen <[email protected]>

Update shared-libs/cht-datasource/src/local/libs/doc.ts

f8c7be5

Co-authored-by: Joshua Kuestersteffen <[email protected]>

feedback implementation

760e916

Merge branch '9586-implement-freetext-search-in-cht-datasource' of gi…

91ef121

…thub.com:medic/cht-core into 9586-implement-freetext-search-in-cht-datasource

remove unused imports

bac5e2c

jkuester requested changes Feb 3, 2025

View reviewed changes

jkuester mentioned this pull request Feb 3, 2025

feat(#9653): refactor shared-libs/validations to call cht-datasource #9755

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(9586): implement freetext search in cht datasource #9625

feat(9586): implement freetext search in cht datasource #9625

sugat009 commented Nov 7, 2024 •

edited by github-actions bot

Loading

sugat009 commented Jan 10, 2025

jkuester commented Jan 14, 2025 •

edited

Loading

sugat009 commented Jan 15, 2025

sugat009 commented Jan 17, 2025

jkuester left a comment

jkuester Jan 22, 2025

jkuester Jan 22, 2025 •

edited

Loading

jkuester Jan 22, 2025

jkuester Jan 29, 2025

sugat009 Jan 30, 2025

jkuester Feb 3, 2025 •

edited

Loading

jkuester Feb 3, 2025

jkuester Feb 3, 2025

feat(9586): implement freetext search in cht datasource #9625

Are you sure you want to change the base?

feat(9586): implement freetext search in cht datasource #9625

Conversation

sugat009 commented Nov 7, 2024 • edited by github-actions bot Loading

Description

Code review checklist

Compose URLs

License

sugat009 commented Jan 10, 2025

jkuester commented Jan 14, 2025 • edited Loading

1. Update cht-datasource to accept auth information when creating a remote DataContext

2. Re-introduce the MITM for global.fetch in the integration tests

sugat009 commented Jan 15, 2025

sugat009 commented Jan 17, 2025

jkuester left a comment

Choose a reason for hiding this comment

jkuester Jan 22, 2025

Choose a reason for hiding this comment

jkuester Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

jkuester Jan 22, 2025

Choose a reason for hiding this comment

jkuester Jan 29, 2025

Choose a reason for hiding this comment

sugat009 Jan 30, 2025

Choose a reason for hiding this comment

jkuester Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

jkuester Feb 3, 2025

Choose a reason for hiding this comment

jkuester Feb 3, 2025

Choose a reason for hiding this comment

sugat009 commented Nov 7, 2024 •

edited by github-actions bot

Loading

jkuester commented Jan 14, 2025 •

edited

Loading

2. Re-introduce the MITM for `global.fetch` in the integration tests

jkuester Jan 22, 2025 •

edited

Loading

jkuester Feb 3, 2025 •

edited

Loading