Releases · xhluca/bm25s

16 Jan 05:16

xhluca

0.2.7post1

bff5ad3

0.2.7post1 Latest

Latest

What's Changed

Fix query filtering and vocabulary dict by @mossbee in #92 (1/2)
Fix query filtering and vocabulary dict by @xhluca in #96 (2/2)
Update corpus.py by @Restodecoca in #102
Add pypi and pepy badges by @xhluca in #103

Notes

The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

New Contributors

@Restodecoca made their first contribution in #102

Full Changelog: 0.2.6...0.2.7

Contributors

xhluca, mossbee, and Restodecoca

Assets 2

15 Jan 19:26

xhluca

0.2.7pre3

813fcdf

0.2.7pre3 Pre-release

Pre-release

What's Changed

Update corpus.py by @Restodecoca in #102

New Contributors

@Restodecoca made their first contribution in #102

Full Changelog: 0.2.7pre2...0.2.7pre3

Contributors

Restodecoca

Assets 2

09 Jan 03:05

xhluca

0.2.7pre2

ec0bcff

0.2.7pre2 Pre-release

Pre-release

Full Changelog: 0.2.7pre1...0.2.7pre2

Assets 2

29 Dec 01:18

xhluca

0.2.7pre1

6dfb6ce

0.2.7pre1 Pre-release

Pre-release

What's Changed

Fix query filtering and vocabulary dict by @xhluca in #96 and @mossbee in #92

Notes

The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

Full Changelog: 0.2.6...0.2.7

Contributors

xhluca and mossbee

Assets 2

1 Join discussion

23 Dec 23:01

xhluca

0.2.6

ce8f886

0.2.6

What's Changed

Extending to Non-ASCII characters with corpora loading and saving by @IssacXid in #93

Full Changelog: 0.2.5...0.2.6

Contributors

IssacXid

Assets 2

26 Nov 17:00

xhluca

0.2.5

c4fef24

0.2.5

What's Changed

Update README.md by @xhluca in #83
Added support for saving and loading non ASCII chars in corpus and vocab by @IssacXid in #86
Update README.md by @mrisher in #87

New Contributors

@IssacXid made their first contribution in #86
@mrisher made their first contribution in #87

Full Changelog: 0.2.4...0.2.5

Contributors

mrisher, xhluca, and IssacXid

Assets 2

13 Nov 22:46

xhluca

0.2.4

8b5ff10

0.2.4

What's Changed

Fix crash tokenizing with empty word_to_id by @mgraczyk in #72

Create nltk_stemmer.py by @aflip in #77

aa31a23: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.

bm25s/init.py: Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.
bm25s/tokenization.py: Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.

New Contributors

@mgraczyk made their first contribution in #72
@aflip made their first contribution in #77

Full Changelog: 0.2.3...0.2.4

Contributors

mgraczyk and aflip

Assets 2

18 Oct 20:59

xhluca

0.2.3

e06ecf4

0.2.3

What's Changed

PR #67 fixes issue #60
More test cases for edge cases of Tokenizer class, such as when update_vocab=True in return_as="ids" mode, which leads to unseen new token IDs being passed to retriever.retrieve

Full Changelog: 0.2.2...0.2.3

Assets 2

06 Oct 21:04

xhluca

0.2.2

0296bf6

0.2.2

Improve README with example of memory usage optimization
Add a Results.merge method allowing merging list of results
Make get_max_memory_usage compatible with mac os
Add BM25.load_scores that allows loading only the scores of the object
Add a load_vocab parameter set to True by default in BM25.load, allowing the vocabulary to not be always loaded.

PR: #63

Full Changelog: 0.2.1...0.2.2

Assets 2

22 Sep 18:07

xhluca

0.2.1

1e636a9

v0.2.1

Add Tokenizer.save_vocab and Tokenizer.load_vocab methods to save/load vocabulary to a json file called vocab.tokenizer.json by default
Add Tokenizer.save_stopwords and Tokenizer.load_stopwords methods to save/load stopwords to a json file called stopwords.tokenizer.json by default
Add TokenizerHF class to allow saving/loading from huggingface hub
- New function: load_vocab_from_hub, save_vocab_to_hub, load_stopwords_from_hub, save_stopwords_to_hub

New tests and examples were added (see examples/index_to_hf.py and examples/tokenizer_class.py)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Notes

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Notes

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Releases: xhluca/bm25s

0.2.7post1

What's Changed

Notes

New Contributors

Contributors

0.2.7pre3

What's Changed

New Contributors

Contributors

0.2.7pre2

0.2.7pre1

What's Changed

Notes

Contributors

0.2.6

What's Changed

Contributors

0.2.5

What's Changed

New Contributors

Contributors

0.2.4

What's Changed

New Contributors

Contributors

0.2.3

What's Changed

0.2.2

v0.2.1