Skip to content

Latest commit

 

History

History
165 lines (115 loc) · 9.35 KB

README.md

File metadata and controls

165 lines (115 loc) · 9.35 KB

libsearch 🔎

npm libsearch TypeScript types Build Status

Simple, index-free text search for JavaScript, used across my personal projects like YC Vibe Check, linus.zone/entr, and my personal productivity software. Read the annotated source to understand how it works under the hood.

The API

Let's begin with some quick examples:

import { search } from 'libsearch'; // on Node.js
const { search } = window.libsearch; // in the browser

const articles = [
    { title: 'Weather in Berkeley, California' },
    { title: 'University report: UC Berkeley' },
    { title: 'Berkeley students rise in solidarity...' },
    { title: 'Californian wildlife returning home' },
];

// basic usage
search(articles, 'berkeley cali', a => a.title);
// => [{ title: 'Weather in Berkeley, California' }]
search(articles, 'california', a => a.title);
// => [
//   { title: 'Weather in Berkeley, California' },
//   { title: 'Californian wildlife returning home' },
// ]

// mode: 'word' only returns whole-word matches
search(articles, 'california', a => a.title, { mode: 'word' });
// => [{ title: 'Weather in Berkeley, California' }]

// case sensitivity
search(articles, 'W', a => a.title, { caseSensitive: true });
// => [{ title: 'Weather in Berkeley, California' }]

// empty query returns the full list, unmodified
search(articles, '', a => a.title);
// => [{...}, {...}, {...}, {...}]

More formally, libsearch exposes a single API, the search function. This function takes two required arguments and two optional arguments:

function search<T>(
    items: T[],
    query: string,
    by?: (it: T) => string,
    options?: {
        caseSensitive: boolean,
        mode: 'word' | 'prefix' | 'autocomplete',
    },
): T[]
  • items is a list of items to search. Typically items will be an array of strings or an array of objects with some string property.
  • query is a string query with which to search the list of items.
  • by (optional) is a predicate function that takes an item from items and returns a string value by which to search for that item. For example, if items is a list of objects like { name: 'Linus' }, by will need to be a function x => x.name. This has the value x => String(x) by default, which works for an items of type string[].
  • options (optional) is a dictionary of options:
    • caseSensitive makes a search case-sensitive. It's false by default.
    • mode controls the way in which incomplete query words are matched:
      • mode: 'word' requires every query word to match only full, exact words rather than parts of words. For example, the query "California" will match "University of California" but not "Californian University".
      • mode: 'prefix' means that every query word may be an incomplete "prefix" of the matched word. "Uni Cali" will match both "University of California" and "Californian University" Even in this mode, every query word must match somewhere — "California" is not a match, because it doesn't match the query word "Uni".
      • mode: 'autocomplete' is a hybrid of the other two modes that's useful when used in autocomplete-style searches, where a user is continuously typing in a query as search results are being returned. This mode is identical to mode: 'word', except that the last query word may be incomplete like in mode: 'prefix'. It means "University of Cali" will match "University of California", which is useful because the user may find their match before having typed in their full query.

You can find more examples of how these options combine together in the unit tests.

Installation and usage

On the web, with <script>

Drop this into your HTML:

<script src="https://unpkg.com/libsearch/dist/browser.js"></script>

This will expose the search function as window.libsearch.search.

Via NPM

npm install libsearch
# or
yarn add libsearch

And use in your code:

import { search } from 'libsearch';

// search(...);

Using TypeScript types

libsearch ships with TypeScript type definitions generated from the source file. Using libsearch from NPM should get them picked up by the TypeScript compiler.

How it works

libsearch lets you perform basic full-text search across a list of JavaScript objects quickly, without requiring a pre-built search index, while offering reasonably good TF-IDF ranking of results. It doesn't deliver the wide array of features that come with libraries like FlexSearch and lunr.js, but is a big step above text.indexOf(query) > -1, and is fast enough to be usable for searching thousands of documents on every keystroke in my experience.

There are two key ideas in how libsearch delivers this:

1. Transforming queries into regular expressions

Modern JavaScript engines ship with highly optimized regular expression engines, and libsearch takes advantage of this for fast, index-free text search by transforming query strings into regular expression filters at search time.

Most full-text search libraries work by first requiring the developer to build up an "index" data structure mapping search terms to documents in which they appear. This is usually a good tradeoff, because it moves some of the computational work of "searching" to be done ahead of time, so search itself can remain fast and accurate. It also allows for fancy transformations and data cleanup like lemmatization on the indexed data without destroying search speed. But when building prototypes and simple web apps, I often didn't want to incur the complexity of having a separate "indexing" step to get a "good enough" search solution. An index needs to be stored somewhere and maintained constantly as the underlying dataset changes and grows.

The main task of a search index is mapping "tokens" or keywords that appear in the dataset to the documents in which they appear, so that the question "which documents contain the word X?" is fast (O(1)) to answer at search time. Without an index, this turns into an O(n) question, as every document needs to be scanned for the keyword. But often, on modern hardware, for small-enough datasets (of a few MBs) typical in a client-side web app, the n is pretty small, small enough that O(n) on every keystroke isn't noticeable.

libsearch transforms a query like "Uni of California" into a list of regular expression filters, (^|\W)Uni($|\W), (^|\W)of($|\W), (^|\W)California. It then "searches" without needing an index by filtering the corpus through each of those regular expressions.

2. "Good enough" TF-IDF ranking based on RegExp matches and document length

The conventional TF-IDF metric is computed for each word as:

(# matches) / (# words in the doc) * log(# total docs / # docs that matched)

Getting the number of words in a doc requires tokenizing the document, or at least splitting the document by whitespaces, which is computationally expensive. So libsearch approximates this by using the length of the document (number of characters) instead.

Using the regular expression queries described above, libsearch's TF-IDF formula is:

(# RegExp matches) / (doc.length) * log(# docs / # docs that matched RegExp)

which is computed for each word as the search is performed, and then aggregated at the end for sorting.

Development

libsearch's source code is written in TypeScript. To allow the library to be used across TypeScript, vanilla Node.js and the web, we compile two builds:

  • The ES module build, which is just search.ts type-checked and types removed. This is the code imported when libsearch is imported in Node.js
  • The browser build, which exports the main search function to the window.libsearch global

The ES module build is produced with tsc, the TypeScript compiler, and the minified browser build is further produced with Webpack.

NPM/Yarn commands:

  • lint and fmt, which lint and automatically format source code in the repository
  • test runs unit tests on the latest build of the library; you should run build:tsc before running test
  • Various build:* commands orchestrate producing the different types of library builds:
    • build:tsc builds the ES module build
    • build:w runs build:tsc on every file write
    • build:cjs builds the browser build from the ES module build
    • build:all builds both builds, in order
  • clean removes all generated/build files in dist/
  • docs builds the Litterate-based documentation, which lives at thesephist.github.io/libsearch.

Before pushing to main or publishing, I usually run

yarn fmt && yarn build:all && yarn test && yarn docs

to make sure I haven't forgotten anything.