New algorithm for Fuzzy dedup transform Sample notebooks for some of the language transforms Integrate Semantic profiler and report generation for code profiler transform
- Increase ray agent limit to 10,000 (default was 100)
- Fuzzy dedup new algorithm for Python, Ray and Spark
- Update RAG example to use granite model
- Updated transforms with Docling 2
- Added single package for dpk with extra for [spark] and [ray]
- Added single package for transforms with extra for [all] or [individual-transform-name]
- Fix metadata logging even when actors crash
- Add multilock for ray workers downloads/cleanup
- Multiple updates to spark runtime
- Added support for python 3.12
- refactoring of data access code
- Modify superpipeline params type Str/json
- Set kuberay apiserver version
- Add Super pipeline for code transforms
- Enhance pdf2parquet with docling2 support for extracting HTML, DOCS, etc.
- Added web2parquet transform
- Added HAP transform
- Enhanced parameter/configuration allows the user to customize crawler settings
- implement subdomain focus feature in data-prep-connector
- Bug fixes across the repo
- Minor enhancements and experimentation with single packaging techniques using [extra]
- Decoupled the release process for each of the component so we can be more responsive to the needs of our stakeholders
- The minor digit for the release for all components is incremented and the patch digit is reset to 0 for all new releases of the data-prep-toolkit
- The patch digit for the release of any one component can be increased independently from other component patch number
- Released first version of the data-prep-toolkit-connector for crawling web sites and downloading HTML and PDF files for ingestion by the pipeline
- Bug fixes across the repo
- Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
- Added new transforms and single package for transforms published to pypi
- Improved CI/CD with targeted workflow triggered on specific changes to specific modules
- New enhancements for cutting a release
- Restructure the repository to distinguish/separate runtime libraries
- Split data-processing-lib/ray into python and ray
- Spark runtime
- Updated pyarrow version
- Define required transform() method as abstract to AbstractTableTransform
- Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies
- Add a configurable timeout before destroying the deployed Ray cluster.
- Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, pdf2parquet, HTML2Parquet and PII Transform
- Added ededup python implementation and incremental ededup
- Added fuzzy floating point comparison
- Many bug fixes across the repo, plus the following specifics.
- Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
- Automation of release process branch/tag management
- Documentation improvements
- Split libraries into 3 runtime-specific implementations
- Fix missing final count of processed and add percentages
- Improved fault tolerance in python and ray runtimes
- Report global DataAccess retry metric
- Support for binary data transforms
- Updated to Ray version to 2.24
- Updated to PyArrow version 16.1.0
- Add KFP V2 support
- Create a distinct (timestamped) execution.log file for each retry
- Support for multiple inputs/outputs
- Added language/lang_id - detects language in documents
- Added universal/profiler - counts works/tokens in documents
- Converted ingest2parquet tool to transform named code2parquet
- Split transforms, as appropriate, into python, ray and/or spark.
- Added spark implementations of filter, doc_id and noop transforms.
- Switch from using requirements.txt to pyproject.toml file for each transform runtime
- Repository restructured to move kfp workflow definitions to associated transform project directory