Skip to content

neuml/staticvectors

Repository files navigation

Work with static vector models

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status

staticvectors makes it easy to work with static vector models. This includes word vector models such as Word2Vec, GloVe and FastText. While Transformers-based models are now the primary way to embed content for vector search, these older models still have a purpose.

For example, this FastText Language identification model is still one of the fastest and most efficient ways to detect languages. N-grams work well for this task and it's lightning fast.

Additionally, there are historical, low resource and other languages where there just isn't enough training data to build a solid language model. In these cases, a simpler model using one of these older techniques might be the best option.

What's wrong with the existing libraries

Unfortunately, the tooling to use word vector models is aging and in some cases unmaintained. The world is moving forward and these libraries are getting harder to install.

As a concrete example, the build script for txtai often has to be modified to get FastText to work on all supported platforms. There are pre-compiled versions but they're often slow to support the latest version of Python or fix issues.

This project breathes life into word vector models and integrates them with modern tooling such as the Hugging Face Hub and Safetensors. While it's pure Python, it's still fast due to it's heavy usage of NumPy and vectorization techniques.

This makes it easier to maintain as it's only a single install package to maintain.

Installation

The easiest way to install is via pip and PyPI

pip install staticvectors

Python 3.9+ is supported. Using a Python virtual environment is recommended.

staticvectors can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/staticvectors

Quickstart

See the following examples on how to use this library. Note that many of the examples below require the train extra.

Convert an existing FastText model

from staticvectors import FastTextConverter, StaticVectors

# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin
converter = FastTextConverter()
converter("lid.176.bin", "langid")

# Load the converted model - runs in pure Python, FastText library install not required for inference
model = StaticVectors("langid")
model.predict("Hello, what language is this?")

Load an existing Magnitude SQLite database

This library replaces Magnitude Lite which is now deprecated. Magnitude libraries are supported by this library.

from staticvectors import StaticVectors

model = StaticVectors("/path/to/vectors.magnitude")

# Get word vector
model.embeddings("hello")

Convert and quantize

from staticvectors import FastTextConverter, StaticVectors

# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin
converter = FastTextConverter()

# Quantize with PQ and two subspaces - model goes from 125MB to 4MB with minimal accuracy impacts!
converter("lid.176.bin", "langid-pq2x256", quantize=2)

# Load the converted model - runs in pure Python, FastText library install not required for inference
model = StaticVectors("langid")
model.predict("Hello, what language is this?")

Train a new model

from staticvectors import StaticVectorsTrainer

# Internally builds a FastText model then exports it to a StaticVectors model
trainer = StaticVectorsTrainer()
model = trainer("path/to/training.txt", size=100, mincount=1, path="model output path")

See the unit tests in this project here for more examples.

Libraries for Static Embeddings with Transformers models

This library is primarily focused on word vector models. There is a recent push to distill Transformers models into static embeddings models. The difference between staticvectors and these libraries is that the base models are Transformers models. Additionally, they use Transformers tokenizers where as word vector models tokenize on whitespace and use n-grams.

Check out these links for more on static embeddings with Transformers models.