Work with static vector models
staticvectors
makes it easy to work with static vector models. This includes word vector models such as Word2Vec, GloVe and FastText. While Transformers-based models are now the primary way to embed content for vector search, these older models still have a purpose.
For example, this FastText Language identification model is still one of the fastest and most efficient ways to detect languages. N-grams work well for this task and it's lightning fast.
Additionally, there are historical, low resource and other languages where there just isn't enough training data to build a solid language model. In these cases, a simpler model using one of these older techniques might be the best option.
Unfortunately, the tooling to use word vector models is aging and in some cases unmaintained. The world is moving forward and these libraries are getting harder to install.
As a concrete example, the build script for txtai often has to be modified to get FastText to work on all supported platforms. There are pre-compiled versions but they're often slow to support the latest version of Python or fix issues.
This project breathes life into word vector models and integrates them with modern tooling such as the Hugging Face Hub and Safetensors. While it's pure Python, it's still fast due to it's heavy usage of NumPy and vectorization techniques.
This makes it easier to maintain as it's only a single install package to maintain.
The easiest way to install is via pip and PyPI
pip install staticvectors
Python 3.9+ is supported. Using a Python virtual environment is recommended.
staticvectors
can also be installed directly from GitHub to access the latest, unreleased features.
pip install git+https://github.com/neuml/staticvectors
See the following examples on how to use this library. Note that many of the examples below require the train
extra.
from staticvectors import FastTextConverter, StaticVectors
# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin
converter = FastTextConverter()
converter("lid.176.bin", "langid")
# Load the converted model - runs in pure Python, FastText library install not required for inference
model = StaticVectors("langid")
model.predict("Hello, what language is this?")
This library replaces Magnitude Lite which is now deprecated. Magnitude libraries are supported by this library.
from staticvectors import StaticVectors
model = StaticVectors("/path/to/vectors.magnitude")
# Get word vector
model.embeddings("hello")
from staticvectors import FastTextConverter, StaticVectors
# Download https://huggingface.co/julien-c/fasttext-language-id/blob/main/lid.176.bin
converter = FastTextConverter()
# Quantize with PQ and two subspaces - model goes from 125MB to 4MB with minimal accuracy impacts!
converter("lid.176.bin", "langid-pq2x256", quantize=2)
# Load the converted model - runs in pure Python, FastText library install not required for inference
model = StaticVectors("langid")
model.predict("Hello, what language is this?")
from staticvectors import StaticVectorsTrainer
# Internally builds a FastText model then exports it to a StaticVectors model
trainer = StaticVectorsTrainer()
model = trainer("path/to/training.txt", size=100, mincount=1, path="model output path")
See the unit tests in this project here for more examples.
This library is primarily focused on word vector models. There is a recent push to distill Transformers models into static embeddings models. The difference between staticvectors
and these libraries is that the base models are Transformers models. Additionally, they use Transformers tokenizers where as word vector models tokenize on whitespace and use n-grams.
Check out these links for more on static embeddings with Transformers models.
- Model2Vec - Turn any sentence transformer into a really small static model
- Static Sentence Transformers - Run 100x to 400x faster on CPU than state-of-the-art embedding models