Skip to content

Latest commit

 

History

History
142 lines (91 loc) · 4.72 KB

release-notes.md

File metadata and controls

142 lines (91 loc) · 4.72 KB

Data Prep Kit Release notes

Release 0.2.3 - 12/15/2024

General

New algorithm for Fuzzy dedup transform Sample notebooks for some of the language transforms Integrate Semantic profiler and report generation for code profiler transform

data-prep-toolkit libraries (python, ray, spark)

  1. Increase ray agent limit to 10,000 (default was 100)

Transforms

  1. Fuzzy dedup new algorithm for Python, Ray and Spark

Release 0.2.2 - 11/25/2024

General

  1. Update RAG example to use granite model
  2. Updated transforms with Docling 2
  3. Added single package for dpk with extra for [spark] and [ray]
  4. Added single package for transforms with extra for [all] or [individual-transform-name]

data-prep-toolkit libraries (python, ray, spark)

  1. Fix metadata logging even when actors crash
  2. Add multilock for ray workers downloads/cleanup
  3. Multiple updates to spark runtime
  4. Added support for python 3.12
  5. refactoring of data access code

KFP Workloads

  1. Modify superpipeline params type Str/json
  2. Set kuberay apiserver version
  3. Add Super pipeline for code transforms

Transforms

  1. Enhance pdf2parquet with docling2 support for extracting HTML, DOCS, etc.
  2. Added web2parquet transform
  3. Added HAP transform

HTTP Connector 0.2.3

  1. Enhanced parameter/configuration allows the user to customize crawler settings
  2. implement subdomain focus feature in data-prep-connector

Release 0.2.2- HTTP Connector Module - 10/23/2024

General

  1. Bug fixes across the repo
  2. Minor enhancements and experimentation with single packaging techniques using [extra]
  3. Decoupled the release process for each of the component so we can be more responsive to the needs of our stakeholders
  4. The minor digit for the release for all components is incremented and the patch digit is reset to 0 for all new releases of the data-prep-toolkit
  5. The patch digit for the release of any one component can be increased independently from other component patch number

data-prep-toolkit-Connector

  1. Released first version of the data-prep-toolkit-connector for crawling web sites and downloading HTML and PDF files for ingestion by the pipeline

Release 0.2.1 - 9/24/2024

General

  1. Bug fixes across the repo
  2. Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
  3. Added new transforms and single package for transforms published to pypi
  4. Improved CI/CD with targeted workflow triggered on specific changes to specific modules
  5. New enhancements for cutting a release

data-prep-toolkit libraries (python, ray, spark)

  1. Restructure the repository to distinguish/separate runtime libraries
  2. Split data-processing-lib/ray into python and ray
  3. Spark runtime
  4. Updated pyarrow version
  5. Define required transform() method as abstract to AbstractTableTransform
  6. Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies

KFP Workloads

  1. Add a configurable timeout before destroying the deployed Ray cluster.

Transforms

  1. Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, pdf2parquet, HTML2Parquet and PII Transform
  2. Added ededup python implementation and incremental ededup
  3. Added fuzzy floating point comparison

Release 0.2.0 - 6/27/2024

General

  1. Many bug fixes across the repo, plus the following specifics.
  2. Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
  3. Automation of release process branch/tag management
  4. Documentation improvements

data-prep-toolkit libraries (python, ray, spark)

  1. Split libraries into 3 runtime-specific implementations
  2. Fix missing final count of processed and add percentages
  3. Improved fault tolerance in python and ray runtimes
  4. Report global DataAccess retry metric
  5. Support for binary data transforms
  6. Updated to Ray version to 2.24
  7. Updated to PyArrow version 16.1.0

KFP Workloads

  1. Add KFP V2 support
  2. Create a distinct (timestamped) execution.log file for each retry
  3. Support for multiple inputs/outputs

Transforms

  1. Added language/lang_id - detects language in documents
  2. Added universal/profiler - counts works/tokens in documents
  3. Converted ingest2parquet tool to transform named code2parquet
  4. Split transforms, as appropriate, into python, ray and/or spark.
  5. Added spark implementations of filter, doc_id and noop transforms.
  6. Switch from using requirements.txt to pyproject.toml file for each transform runtime
  7. Repository restructured to move kfp workflow definitions to associated transform project directory

Release 0.1.1 - 5/24/2024

Release 0.1.0 - 5/15/2024

Release 0.1.0 - 5/08/2024