Skip to content

beacon-biosignals/KeywordSearch.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status codecov Docs: stable Docs: development

KeywordSearch

Given a collection of free-text documents and a set of keywords, we want to know: which keywords have fuzzy matches in which documents?

There are two types of errors:

  • false positive: we flag a match when it does not exist
  • false negative: we miss a match that should exist

Here, we primarily wish to minimize false negatives (ideally without having too many false positives either!). That is, we wish to error on the side of having too many matches: in some cases where we have many haystacks and few needles, we'd rather have to manually filter out some pins (false needles 😉) than miss a needle.

In particular, we want robust to the following types of mistakes in the documents which could cause us to miss a match:

  1. Misspellings
  2. word conjunctions (missed space separating words)
  3. hyphenated words conjoined without a hyphen
  4. hyphenated words/phrase joined with spaces instead of hyphens
  5. Erroneously redacted terms (e.g. Darwin's Frog being redacted as suspected PHI since Darwin could be flagged as name, whereas in this case it is the name of an animal).

We also will pay attention to one source of false positives:

  1. Search terms that are subsets of other general terms (not the intended term class). E.g. "ant" could show up as part of "anteater", but that would not be a correct match if we're looking for the little insects.

We address...

  • ...(1) by using fuzzy matching via an edit distance (i.e. a string matches another substring if it is the same up to n edits, where we have to choose n). Edit distances are supplied by StringDistances.jl.
  • ...(2) by not requiring matches between word boundaries but just between substrings (i.e. if the query is cobra and kingcobra shows up in the document, then we would have a perfect match with the substring cobra found in the document)
  • ...(3) and (4) by first replacing hyphens with spaces, and then augmenting our search terms by taking any terms with spaces or hyphens and generating a list of terms with each possible choice of (spaces, no spaces). For example, the query "crab-eating macaque" is augmented to the query ("crab eating macaque" OR "crabeating macaque" OR "crab eatingmacaque" OR "crabeatingmacaque").
  • ...(5) Here, we allow a global list of replacements (KeywordSearch.AUTOMATIC_REPLACEMENTS) to manually undo erroneous redaction.
  • ...(6) Here, our solution to (2) has gotten us into trouble. What we can do is add spaces to our term, e.g., instead of searching for "ant" we can search for " ant " and require an exact match. This is accomplished by e.g. word_boundary(Query("ant")) in the language of KeywordSearch.jl.

Example

julia> using KeywordSearch, UUIDs

julia> document = Document("""
                         The crabeating macacue ate a crab.
                         """, (; document_uuid = uuid4()))
Document starting with "The crabeating macacue…". Metadata: (document_uuid = UUID("a703302c-eeda-46ba-8755-940a7db86b63"),)

julia> query = augment(FuzzyQuery("crab-eating macaque"))
Or
├─ FuzzyQuery("crab eating macaque", DamerauLevenshtein(), 2)
├─ FuzzyQuery("crabeating macaque", DamerauLevenshtein(), 2)
├─ FuzzyQuery("crab eatingmacaque", DamerauLevenshtein(), 2)
└─ FuzzyQuery("crabeatingmacaque", DamerauLevenshtein(), 2)

julia> m = match(query, document)
QueryMatch with distance 1 at indices 5:22.

julia> explain(m)
The query "crabeating macaque" matched the text "The crabeating macacue ate a crab \n " with distance 1.

Currently, only the StringDistances.DamerauLevenshtein distance measure is supported.