Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learned transformer for String indexing #215

Closed
wants to merge 8 commits into from

Conversation

Jiaweihu08
Copy link
Member

@Jiaweihu08 Jiaweihu08 commented Sep 5, 2023

Description

This PR explores the ml approach for String column indexing by conoceptualizing the task of ordering String values lexicographically as learning a Cumulative Density Function for their ranks. The goal is to place strings that are lexicographically close in neighbouring cubes in order to reduce file-level min/max.

The idea is explored more thoroughly in this paper.

Training and indexing

  • Extract unique values from the given column, sort the values lexicographically
  • Create labels as the rank of the sorted unique values
  • Encode strings by converting each character to their ascii value, creating fix-sizes numeric Sequence, padding the shorter strings while truncating the longer ones
  • Train a regressor to predict string ranks from their vector encodings
  • Deploy model as Transformer / Transformation, index the inferred rank value

Reading and filtering

Relying on model prediction for filtering strings will lead to false negatives, but given the deterministic nature of the models, equalTo can be safely implemented.

Other posibilities can be explored, e.g. adding file string column min/max to cube block metadata.

TODOs:

  • New feature / bug fix has been committed following the Contribution guide
  • Add comments to the code (make it easier for the community!)
  • Add tests
  • Your branch is updated to the main branch (dependent changes have been merged)
  • Benchmark the overhead
  • Investigate spark and parquet file pruning
  • Learning API(?) Model loading and saving(?)
  • More robust default model
  • xgboost4j for MAC Apple Silicon
  • Update documentation

Test Configuration:

  • Spark Version: 3.3.0
  • Hadoop Version: 3.3.4

@Jiaweihu08 Jiaweihu08 added the type: enhancement Improvement of existing feature or code label Sep 5, 2023
@cdelfosse
Copy link
Contributor

To be closed

@osopardo1
Copy link
Member

Let's close this one!

@Jiaweihu08 I give you the honor 🗝️

Replaced by solution with Histograms in #221

@Jiaweihu08 Jiaweihu08 closed this Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing feature or code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants