-
Notifications
You must be signed in to change notification settings - Fork 3.7k
FAQ
To analyze a matrix, print
MatrixStats(my_matrix).comments
(Python)
MatrixStats(n, d, my_matrix).comments
(C++)
this will output some (hopefully readable) comments on the matrix content: are there NaNs? Duplicate vectors? Constant dimensions? Are the vectors normalized? It is always useful to run this when faced with some weird behavior in Faiss.
Keep in mind that floating-point computations are prone to round-off errors. These are particularly visible when floats of different magnitudes are added ("catastrophic cancellation").
Two examples:
-
if components of a vector are large and slightly different, and the decomposition
||x-y||^2 = ||x||^2 + ||y||^2 - 2 * <x, y>
is used (this is the case withIndexFlat
with batches of > 20 query vectors), then the differences are may be cancelled out, see this example. A workaround is to center the vector components, which does not change the distances but improves the problem's conditioning. -
if some components are much larger than others, then during accumulation of distances, smaller components may be cancelled out, see this example. Here there is no real workaround. Fortunately it happens mostly with vectors that are far apart, and thus are hopefully not relevant for similarity search.
A formalization of finite precision computations: see chapter 2.7 in the book "Matrix computations", Golub & van Loan, Hopkins univ press.
IndexIVFPQ
(aka "IVFx,PQy"
) relies on vector compression and an inverted list that restricts the distance computations to just a fraction of the dataset.
If the accuracy of an IndexIVFPQ
is too low:
-
set the
nprobe
to the number of centroids to scan the whole dataset instead, and see how it performs. Note that the defaultnprobe
is 1, which is on the low side. -
build
IndexIVFFlat
(aka"IVFx,Flat"
) instead ofIndexIVFPQ
to see how much is lost in the vector compression.
The combination of both should yield 100% accuracy.
Concurrent searches are supported for the CPU code, but not the GPU code. Concurrent search/add or add/add are not supported. There is no locking mechanism in place to protect against this, so the calling code should maintain a lock.
See #492 for workarounds.
Faiss is optimized for batch search. There are three reasons for that:
-
most indexes rely on a clustering of the data that at query time requires a matrix-vector multiplication (for a single query vector) or matrix-matrix multiplication (for a batch of queries). Matrix-matrix multiplications are often much faster than the corresponding amount of matrix-vector multiplications.
-
search parallelization is over the queries. Doing otherwise would require to maintain several result lists per thread and merging them on output, a source of overhead.
-
in a multithreaded environment, several searches can be performed concurrently, to fully occupy the processing cores of the machine.
In C++, do I need to keep a reference to the coarse quantizer around for an IndexIVFFlat/IndexIVFPQ/IndexIVFPQR index?
If you construct the coarse quantizer yourself, the code assumes by default that you will delete it. To transfer ownership to the IndexIVF
, set own_fields
to true.
If you constructed the index with the index_factory
then all sub-indexes belong to the object returned by the factory, so there is no need to worry about ownership.
In Python, ownership management is automatic. See also: https://github.com/facebookresearch/faiss/wiki/Troubleshooting#crashes-in-pure-python-code
For this you can:
-
train an IndexIVF* on a representative sample of the data, store it.
-
for each node, load the trained index, add the local data to it, store the resulting populated index
-
on a central node, load all the populated indexes and merge them. Here is a C++ example on how to merge: test_merge.cpp
If the data on the different machines has a different distribution, then it may be beneficial to do a separate training on each of the machines and merge the results at search time:
-
on each node, train an index on this node's data + add the data to the index
-
save the node's index
-
load all the indexes on a central machine and combine them into an
IndexShards
.
Note that if the index uses strong compression, this second solution may yield distances that are hard to compare, and thus the overall indexing accuracy may be worse than doing a common training. YMMV.
Faiss does not support string ids for vectors (or any datatype other than 64-bit ints). It is unlikely that this will change. See issue #641 for a discussion of this topic.
Why does python Faiss not accept float64
/float32
/float16
vectors or uint64
ids or non-contiguous arrays?
Python Faiss is intended to be a shallow wrapper above C++. Therefore the choice was to not implicitly convert data when the C++ API does not support them.
Then why does the C++ layer not support double precision? Because Faiss is a performance-oriented library and double precision is slower with almost no benefit in precision.
The IndexIVF
and IndexHSNW
variants have trouble indexing large numbers of identical vectors.
In the IndexIVF
case this is because they all end up in the same inverted list, that must then be scanned sequentially.
See #1097 for an analysis of the IndexHNSW
case.
This also applies to near-duplicate vectors.
The workaround to this is to de-duplicate vectors prior to indexing.
Faiss does not do that by default because it would have a run-time and memory impact for use cases where there are no duplicates.
However, the IndexFlatDedup
index does de-duplication.
Also, MatrixStats
will find whether a dataset has duplicates.
When applying k-means algorithm to cluster n points to k centroids, there are several use cases:
-
n < k: this raises an exception with an assertion because we cannot do anything meaningful
-
n < min_points_per_centroid * k: this produces the warning above. It means that usually there are too few points to reliably estimate the centroids. This may still be ok if the dataset to index is as small as the training set.
-
n < max_points_per_centroid * k: comfort zone
-
n > max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled.
The parameters {min,max}_points_per_centroids (39 and 256 by default) belong to the ClusteringParameters
structure.
In the python KMeans object the fields can be set as parameters to the constructor.
For indexes, the k-means routine is called for IndexPQ
, IndexIVFFlat
, and twice for IndexIVFPQ
(once for the coarse quantizer with k=ncentroids
, once for the PQ, with k=2^nbits_per_idx
), and three times for IndexIVFPQR
.
In that case, the appropriate ClusteringIndex
is accessible as IndexPQ::pq::cp
and IndexIVF::cp
.
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points. And these number diminish when k increases (ie. when training for larger k you can do fewer iterations on fewer training points). This is why the k-means clustering function samples vectors by default (see previous question).
As an example, the results you'd see from clustering 6.2B vectors to 80K centroids would probably be about the same quality-wise as sampling a random 20.48M subset to 80K centroids, so it's just saving you work (of course sampling should be unbiased).
If the set fits in RAM, the answer is easy, just sample elements without replacement.
In Python
rs = np.random.RandomState(123)
idx = rs.choice(nt_sample, size=nt, replace=False)
xt = xt[idx]
In C++ there is a helper function: fvecs_maybe_subsample
.
If the dataset does not fit in RAM and/or comes from a stream, you need reservoir sampling, see https://en.wikipedia.org/wiki/Reservoir_sampling
Can I used an index trained on some kind of data to index some other type of data of the same dimension?
No.
The objective of the training stage is to exploit the distribution of the data (clusters, sub-spaces) to improve the efficiency of the index. The distribution is estimated on a sample provided at train time, that should be representative of the data that is indexed. This is of course the case when the train set is the same as the added vectors.
When adding data and searching, Faiss checks only whether the dimensionality of the data is correct (and this only in the Python wrappers). If the distribution is incorrect, this will result in degraded performance in terms of accuracy and/or search time.
Cases when a new training is required:
-
when re-training a CNN that produces descriptors that are indexed
-
when the type of media you index becomes statistically different (eg. class1 grows from 1% to 90% of the data)
No it is not. The states for the index are:
-
is_trained = false, ntotal = 0
, transition to 2 withtrain()
-
is_trained = true, ntotal = 0
, transition to 3 withadd()
-
is_trained = true, ntotal > 0
, transition to 2 withreset()
Since a new index is just a bunch of parameters, it is not worthwhile to support "un-training" an index, it is simpler to just construct a new one.
Here we give some handy code in Python notebooks that can be copy/pasted to perform some useful operations.
See get_matrix_from_PCA.ipynb.
This applies to any LinearTransform
object.
See access_PQ_centroids.ipynb.
See here: get_invlists.ipynb
We have an index file but don't know what's in it.
When accessing the Index
fields of a wrapper index, they show up as a plain Index
object.
The downcast_index
converts this plain index to the "leaf" class the index belongs to.
This snippet is a demo on how to use downcast_index
to extract all info from it:
demo_explore_indedex.ipynb
Sometimes the results returned by queries on some index may be disappointing: if there are 100 instances of the same vector in the dataset, and a query happens to hit one of the instances, then the 99 other instances will fill the result list. From Faiss' point of view, this is the correct thing to do, because these are indeed the nearest neighbors of the query, but it is not satisfying from an application point of view.
Possible solutions:
-
do not to add multiple instances of the same object to an index,
-
query more results than you need, and post-process the result list to remove duplicates and near duplicates.
There is no easy way of restricting elements from the search. Faiss mainly relies on scanning strings of codes and computing distances. During the scan, it accesses the corresponding ID only if the code corresponds to a potentially interesting item. Skipping over some of the codes would require to do some ID-dependent operation, which slows the processing down.
The most sensible thing to do is to query Faiss with a larger k than needed and filter out the irrelevant results post-hoc.
To perform searches in an IndexIVF, the algorithm scans a finite set of elements. If there are not enough elements to fill the result list, the missing results are set to -1. You can increase the number of elements that are visited with ParameterSpace().set_index_parameter(index, 'nprobe', 100)
. The parameter nprobe
adjusts the speed/accuracy tradeoff of the search.
However, note that if nprobe
becomes nlist
(the number of inverted lists), this is an exhaustive, exact (if using IVFFlat), brute-force search again and a Flat index will be faster in this case. nprobe
should be substantially less than nlist
in order to achieve speedup.
The proper choice of nprobe
for IVF will depend upon your speed/accuracy tradeoff requirements.
Currently, the support is:
- the
ProductQuantizer
object supports any code betweennbits=
1 and 16. The code size of the M PQ indices is rounded up to a whole number of bytes, ie. PQ3x4 uses 2 bytes; - the
IndexPQ
supports the same; - the
IndexIVFPQ
supports only PQ with 8 bits; - the
MultiIndexQuantizer
supports up to 16 bits per code.
The nprobe
field cannot be accessed directly, so you can either loop manually over the sub-indexes or use a ParameterSpace
object.
In C++:
IndexShards index_shards = .....;
ParameterSpace().set_index_parameter(index_shards, "nprobe", 123);
In Python:
index_shards = faiss.IndexShards(...)
ParameterSpace().set_index_parameter(index_shards, "nprobe", 123);
On GPU, IndexShards
or IndexProxy
objects are built automatically by the index_cpu_to_gpu*
functions.
Instead of ParameterSpace
, use GpuParameterSpace
.
WARNING: setting index_shards.nprobe = 123
in Python does not generate an error, but the nprobe of the index_shards will not be set. This is because Python can add fields dynamically.
Sometimes the IndexIVF
is opaque (ie. seen by Python as an Index
or as an Index* in C++).
This is the case in particular when directly accessing main_index.index
where main_index
is an IndexPreTransform
.
To set the nprobe there are two possibilities.
- In C++:
auto cpu_index = dynamic_cast<faiss::IndexPreTransform*>(faiss::read_index(faissindex_file));
auto index_ivf = dynamic_cast<faiss::IndexIVF*>(cpu_index->index);
index_ivf->nprobe = 123;
or
auto cpu_index = faiss::read_index(faissindex_file);
ParameterSpace().set_index_parameter(cpu_index, "nprobe", 123);
- In Python:
cpu_index = faiss.read_index(faissindex_file)
index_ivf = faiss.downcast_index(cpu_index.index)
index_ivf.nprobe = 123;
or
cpu_index = faiss.read_index(faissindex_file)
ParameterSpace().set_index_parameter(cpu_index, "nprobe", 123)
If you use GPU indices, replace ParameterSpace
with GpuParameterSpace
.
WARNING: setting cpu_index.index.nprobe = 123
does not generate an error, but the nprobe of the index_ivf will not be set.
This is because index.index
is seen by Python as a generic Index
to which it adds field nprobe
dynamically.
This happens for any SWIG-wrapped object.
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark