All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
bicleaner-ai-sample
command to sample training data from multiple files/language pairs.
- Restrict
numpy
version to <2.
- Update hardrules to 2.10.4.
- New multilingual full models:
large-en-xx
,es-xx
,large-es-xx
. - Options to overwrite source or target language in model config.
- Generate training data with separated command.
- More logging messages during noise generation.
- Compress noise generation intermediate files.
- Multilingual models training documentation.
- Show a warning when no GPU/TPU has been detected.
- Show a warning if some layers do not correctly.
- Huge improvements in accuracy multilingual full models.
- Updated mutilingual full model
en-xx
. - Upload multilingual full models to HF.
- Disable hardrules that need lang parameter when using multilingual model if language has not been overwritten.
- Optional
parallel_train
,parallel_valid
and noise generation inbicleaner_train
. - Test classify full with HF model.
- Re-organize documentation.
- Updated Tensorflow.
- Support for Python 3.11.
- Updated Transformers to 4.36.
- Always English being used in Tokenizer during noise generation.
- Noise generation failing when
freq_noise
was disabled. - Accidentally deleting generated valid file at the end of training.
- Fix compatibility with Hardrules metadata dict.
bicleaner-ai-download
quiet mode.
- Update hardrules to 2.9.0, now hardrules accepts HF identifiers to load metadata.
- Full models download from HF now accept a local path to store the model instead of using HF cache.
- Update HuggingFace Transformers and Hub.
- Update documentation about downloading models and managing HF cache.
- Use a single worker to download from HF, fixes timeout errors during download.
- Fix '--rules_config' parameter.
- Fix HF downloads in some slow docker instances with increased etag timeout.
- Fix builds in PIP>=23 with new Hardrules and FastSpell.
- Support tokenizing by characters (useful for Chinese).
- CLI option to configure minimum words/tokens to be omited/replaced.
- Refactored Tokenizer class.
- Update HF hub.
- Removed external tokenizer option.
- Improved begginers guide.
- Create only one Tokenizer object per process in noise function.
- Update Hardrules to 2.8.0
- Better coverage of Icelandic langid
- Updated KenLM installation instructions.
- KenLM installation.
- Upload full models to Hugging Face Hub.
- Automatic download of full models.
- Hide Tensorflow and Transformers logging messages in executable scripts.
- Redirect Keras prediction progress bar to stderr.
- Huge memory improvements during training.
- Speed improvements using pading
longest
instead ofmax_length
- Models are more insensitive to the presence of capital letter at the start of the sentence.
- Improved performance on HBS Cyrillic transliterating in models which had poor training on cyrillic text.
- Basic test suite.
- Allow changing the base model for XLMR. Any XLMRoberta model can be used.
- Migrate to
pyproject.toml
andsrc/
tree structure, comply with PEP517, PEP518 and PEP621. - Update to Hardrules 2.6
- Rules can be parametrized with
--rules_config config.yaml
- Some rules have been refactored with better names.
--run_all_rules
mode to run each rule instead of stoppping at first discard- Language identification with FastSpell
- Better Serbo-Croatian and Slovene language detection.
- Easier installation! Now KenLM comes pre-compiled.
- Rules can be parametrized with
- Now BICLEANER_AI_THREADS environment variable controls the number of threads.
- Update HF Transformers.
- Update TensorFlow minimum version.
- Removed
glove-python
dependency and use own custom compilation. - Improved download scripts, easier to install and use.
- Set inter/intra_op parallelism to 0 by default.
- Block size by default to 10k, a bit faster.
- Faster noise generation for small datasets with lower block size.
- Model argument can be provided with or without 'metadata.yaml'.
- Add citation info to README.
- Avoid generating empty sentences in omit noise.
- Restore capital letters at the beggining of the sentennce in frequency noise.
- Retrocompatibility with older models.
- Compatibility of
glove
with Python>=3.7. - Fix loading lite models in other Python versions than 3.8.
- Fix unbound variable
lm_stats
. - Other minor fixes.
- Update hardrules to 1.2: adds score only mode.
- Bicleaner train changes:
- Separate most of the training logic in the BaseModel class.
- Re-factor synthetic noise build function.
- Parallelize synthetic noise generation.
- Add fuzzy matching noise and neighbour noise.
- Add Decomposable Attention model.
- Add Transkformer-like model.
- Add XLMRoberta model.
- Bicleaner classify changes:
- Change old classifier by new neural models.
- Move hardrules into a separate package.