Releases: xhluca/bm25s
0.2.7post1
What's Changed
- Fix query filtering and vocabulary dict by @mossbee in #92 (1/2)
- Fix query filtering and vocabulary dict by @xhluca in #96 (2/2)
- Update corpus.py by @Restodecoca in #102
- Add pypi and pepy badges by @xhluca in #103
Notes
The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.
New Contributors
- @Restodecoca made their first contribution in #102
Full Changelog: 0.2.6...0.2.7
0.2.7pre3
What's Changed
- Update corpus.py by @Restodecoca in #102
New Contributors
- @Restodecoca made their first contribution in #102
Full Changelog: 0.2.7pre2...0.2.7pre3
0.2.7pre2
Full Changelog: 0.2.7pre1...0.2.7pre2
0.2.7pre1
What's Changed
Notes
- The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.
Full Changelog: 0.2.6...0.2.7
0.2.6
0.2.5
0.2.4
What's Changed
Fix crash tokenizing with empty word_to_id by @mgraczyk in #72
Create nltk_stemmer.py by @aflip in #77
aa31a23: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.
bm25s/init.py:
Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.bm25s/tokenization.py:
Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.
New Contributors
Full Changelog: 0.2.3...0.2.4
0.2.3
0.2.2
- Improve README with example of memory usage optimization
- Add a
Results.merge
method allowing merging list of results - Make
get_max_memory_usage
compatible with mac os - Add
BM25.load_scores
that allows loading only the scores of the object - Add a
load_vocab
parameter set toTrue
by default inBM25.load
, allowing the vocabulary to not be always loaded.
PR: #63
Full Changelog: 0.2.1...0.2.2
v0.2.1
- Add
Tokenizer.save_vocab
andTokenizer.load_vocab
methods to save/load vocabulary to a json file calledvocab.tokenizer.json
by default - Add
Tokenizer.save_stopwords
andTokenizer.load_stopwords
methods to save/load stopwords to a json file calledstopwords.tokenizer.json
by default - Add
TokenizerHF
class to allow saving/loading from huggingface hub- New function:
load_vocab_from_hub
,save_vocab_to_hub
,load_stopwords_from_hub
,save_stopwords_to_hub
- New function:
New tests and examples were added (see
examples/index_to_hf.py
andexamples/tokenizer_class.py
)