by Neel Mehta, Harvard University
SearchBetter lets you make powerful, fast, and drop-in search engines for any dataset, no matter how small or how large. It also offers built-in query rewriting, which uses NLP to help your search engines find semantically-related content to the user's search term.
For instance, a search for machine learning
might only return results for items that contain the words "machine learning". But with query rewriting, you would get results not only for machine learning
but also, say, artificial intelligence
and neural networks
.
SearchBetter lets you power up your search engines with minimal effort. It's especially useful if you have a small dataset to search on, or if you don't have the time or data to make fancy bespoke query rewriting algorithms.
To drop this module into your app:
pip install searchbetter
For more advanced analysis and research purposes, use the interactive demo to get yourself set up!
For a truly quick-and-dirty dive into SearchBetter (no setup required), use:
from searchbetter import rewriter
query_rewriter = rewriter.WikipediaRewriter()
query_rewriter.rewrite('biochemistry')
Documentation is available online at http://searchbetter.readthedocs.io/.
To build the docs yourself using Sphinx:
cd docs
make html
open _build/html/index.html
Some of this data is proprietary to Harvard and HarvardX. Other info, like the Udacity API and Wikipedia dump, is open to the public.
Name | URL | What to name file |
---|---|---|
Udacity API | https://www.udacity.com/public-api/v0/courses | udacity-api.json |
Wikipedia dump | See below | wikiclean8 |
edX courses | Proprietary | Master CourseListings - edX.csv |
DART data | Proprietary | corpus_HarvardX_LatestCourses_based_on_2016-10-18.csv |
Download and unzip the enwik8
dataset from http://www.mattmahoney.net/dc/enwik8.zip. Then run:
perl processing-scripts/wiki-clean.pl enwik8 > wikiclean8
This might take a minute or two to run.
SearchBetter was designed as part of a research project by Neel Mehta, Daniel Seaton, and Dustin Tingley for Harvard's CS 91r, a research for credit course.
It was originally designed for Harvard DART, a tool that helps educators reuse HarvardX assets such as videos and exercises in their online or offline courses. SearchBetter is especially useful for MOOCs, which often have small corpuses and have to deal with many uncommon queries (students will search for the most unfamiliar terms, after all.) Still, SearchBetter has been made general-purpose enough that it can be used with any corpus or any search engine.