Skip to content

Releases: xhluca/bm25s

0.2.7post1

16 Jan 05:16
bff5ad3
Compare
Choose a tag to compare

What's Changed

Notes

The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

New Contributors

Full Changelog: 0.2.6...0.2.7

0.2.7pre3

15 Jan 19:26
813fcdf
Compare
Choose a tag to compare
0.2.7pre3 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: 0.2.7pre2...0.2.7pre3

0.2.7pre2

09 Jan 03:05
ec0bcff
Compare
Choose a tag to compare
0.2.7pre2 Pre-release
Pre-release

Full Changelog: 0.2.7pre1...0.2.7pre2

0.2.7pre1

29 Dec 01:18
6dfb6ce
Compare
Choose a tag to compare
0.2.7pre1 Pre-release
Pre-release

What's Changed

Notes

  • The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

Full Changelog: 0.2.6...0.2.7

0.2.6

23 Dec 23:01
ce8f886
Compare
Choose a tag to compare

What's Changed

  • Extending to Non-ASCII characters with corpora loading and saving by @IssacXid in #93

Full Changelog: 0.2.5...0.2.6

0.2.5

26 Nov 17:00
c4fef24
Compare
Choose a tag to compare

What's Changed

  • Update README.md by @xhluca in #83
  • Added support for saving and loading non ASCII chars in corpus and vocab by @IssacXid in #86
  • Update README.md by @mrisher in #87

New Contributors

Full Changelog: 0.2.4...0.2.5

0.2.4

13 Nov 22:46
8b5ff10
Compare
Choose a tag to compare

What's Changed

Fix crash tokenizing with empty word_to_id by @mgraczyk in #72

Create nltk_stemmer.py by @aflip in #77

aa31a23: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.

  • bm25s/init.py: Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.
  • bm25s/tokenization.py: Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.

New Contributors

Full Changelog: 0.2.3...0.2.4

0.2.3

18 Oct 20:59
e06ecf4
Compare
Choose a tag to compare

What's Changed

  • PR #67 fixes issue #60
  • More test cases for edge cases of Tokenizer class, such as when update_vocab=True in return_as="ids" mode, which leads to unseen new token IDs being passed to retriever.retrieve

Full Changelog: 0.2.2...0.2.3

0.2.2

06 Oct 21:04
0296bf6
Compare
Choose a tag to compare
  • Improve README with example of memory usage optimization
  • Add a Results.merge method allowing merging list of results
  • Make get_max_memory_usage compatible with mac os
  • Add BM25.load_scores that allows loading only the scores of the object
  • Add a load_vocab parameter set to True by default in BM25.load, allowing the vocabulary to not be always loaded.

PR: #63

Full Changelog: 0.2.1...0.2.2

v0.2.1

22 Sep 18:07
1e636a9
Compare
Choose a tag to compare
  • Add Tokenizer.save_vocab and Tokenizer.load_vocab methods to save/load vocabulary to a json file called vocab.tokenizer.json by default
  • Add Tokenizer.save_stopwords and Tokenizer.load_stopwords methods to save/load stopwords to a json file called stopwords.tokenizer.json by default
  • Add TokenizerHF class to allow saving/loading from huggingface hub
    • New function: load_vocab_from_hub, save_vocab_to_hub, load_stopwords_from_hub, save_stopwords_to_hub

New tests and examples were added (see examples/index_to_hf.py and examples/tokenizer_class.py)