Learned transformer for String indexing #215

Jiaweihu08 · 2023-09-05T15:09:43Z

Description

This PR explores the ml approach for String column indexing by conoceptualizing the task of ordering String values lexicographically as learning a Cumulative Density Function for their ranks. The goal is to place strings that are lexicographically close in neighbouring cubes in order to reduce file-level min/max.

The idea is explored more thoroughly in this paper.

Training and indexing

Extract unique values from the given column, sort the values lexicographically
Create labels as the rank of the sorted unique values
Encode strings by converting each character to their ascii value, creating fix-sizes numeric Sequence, padding the shorter strings while truncating the longer ones
Train a regressor to predict string ranks from their vector encodings
Deploy model as Transformer / Transformation, index the inferred rank value

Reading and filtering

Relying on model prediction for filtering strings will lead to false negatives, but given the deterministic nature of the models, equalTo can be safely implemented.

Other posibilities can be explored, e.g. adding file string column min/max to cube block metadata.

TODOs:

Test Configuration:

Spark Version: 3.3.0
Hadoop Version: 3.3.4

…ices

cdelfosse · 2023-10-24T08:10:47Z

To be closed

osopardo1 · 2023-10-24T13:34:01Z

Let's close this one!

@Jiaweihu08 I give you the honor 🗝️

Replaced by solution with Histograms in #221

Jiaweihu08 added 4 commits September 1, 2023 16:53

Add learned string transformer and transformation for string indexing

c846711

Add LearnedStringStransformation for point string search

cd631cd

Adjust tests for LearnedStringTransformation

7910950

Move model loading to companion object

0720061

Jiaweihu08 requested a review from cugni September 5, 2023 15:09

Jiaweihu08 added the type: enhancement Improvement of existing feature or code label Sep 5, 2023

Jiaweihu08 added 4 commits September 13, 2023 11:06

Batch inference for LearnedStringTransformations via transposing matr…

f313845

…ices

WIP - Vectorize Transformations

88103b0

WIP, vectorize indexing

5b4869a

WIP, remove Breeze, and use transpose

2d7db48

This was referenced Oct 23, 2023

Add length of encoding to String indexed columns #188

Closed

Filtering by equality of string leads to a wrong traversal of the tree #190

Closed

Jiaweihu08 closed this Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learned transformer for String indexing #215

Learned transformer for String indexing #215

Jiaweihu08 commented Sep 5, 2023 •

edited

Loading

cdelfosse commented Oct 24, 2023

osopardo1 commented Oct 24, 2023

Learned transformer for String indexing #215

Learned transformer for String indexing #215

Conversation

Jiaweihu08 commented Sep 5, 2023 • edited Loading

Description

Training and indexing

Reading and filtering

TODOs:

cdelfosse commented Oct 24, 2023

osopardo1 commented Oct 24, 2023

Jiaweihu08 commented Sep 5, 2023 •

edited

Loading