Backport string hist encoding to main #234

alexeiakimov · 2023-11-24T14:04:10Z

Description

This PR introduces a new String Transformation that uses String histograms for value mapping.

(Backport the feature from main-1.0.0)

How does it work?

A sorted sequence of distinct string values should be provided to map String values to the space.

The sequence is treated as an equal-width histogram for the column to index.

To transforma a given String value, we look for its insertion position within the sequence using binary search:

val hist = ["a", "b", "c", "d", "e"]
val coordinate = hist.search("b").insertionPoint.toDouble / (hist.length - 1) // 0.25

Values that are not contained within the limits of the histogram are mapped to their corresponding extremes, i.e. 0.0 or 1.0.

How to use?

To use the feature, we build the histogram and provide it as part of columnStats when writing.

:histogram should also be specified for the String column to index.

Compute the histogram:

Use all distinct String values:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

def getAllString(df: DataFrame, columnName: String): String = {	
  df
  .select(columnName)
  .distinct()
  .na.drop
  .orderBy(col(columnName).asc)
  .collect()
  .map { r => 
    val s = r.getAs[String](0)
    s"'$s'"
  }
  .mkString("[", ",", "]")
}

val histogram = getAllString(df, "test_col_name")

Use a reduced number of String values:

import org.apache.spark.sql.delta.skipping.MultiDimClusteringFunctions
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, min}

def getStringHistogramStr(df: DataFrame, columnName: String, numBins: Int): String = {
  val binStarts = "__bin_starts"
  val stringPartitionColumn = MultiDimClusteringFunctions.range_partition_id(col(columnName), numBins)
	
  df
  .select(columnName)
  .distinct()
  .na.drop
  .groupBy(stringPartitionColumn)
  .agg(min(columnName).alias(binStarts))
  .select(binStarts)
  .orderBy(binStarts)
  .collect()
  .map { r => 
    val s = r.getAs[String](0)
    s"'$s'"
  }
  .mkString("[", ",", "]")
}

val histogram = getStringHistogramStr(df, "test_col_name", 100)

Index the data as follows:

val columnStats = s"""{"test_col_name_histogram":$histogram}"""

df
  .write
  .format("qbeast")
  .option("columnsToIndex", s"test_col_name:histogram")
  .option("columnStats", columnStats)
  .save(targetPath)

If no histogram is provided in columnStats during the first write, a default histogram("a" to "z") will be used. Subsequent appends will reuse the same histogram.

A new revision will be created when a different custom histogram is provided as "columnStats".

Note: It also fixes the bug where the first write has to have overwrite write mode.

Checklist:

New feature / bug fix has been committed following the Contribution guide.
Add comments to the code (make it easier for the community!).
Change the documentation.
Add tests.
Branch is updated to the main branch.

…rmation

…ation

…orrect-implementation-of-equals-and-hashcode Qbeast-io#228 Implementation of equals/hashCode of CubeId is fixed

String hist encoding

osopardo1 · 2023-11-27T08:11:50Z

All good!

cugni

LGTM

Jiaweihu08 and others added 20 commits September 27, 2023 13:53

WIP, histogram-based string encoding

c84083e

First working version

1423d0b

StringHistogramTransformation only created via columnStats

d719237

Cover edge cases for string histograms

7f7c0a8

User default string histogram when no value is provided by the user

adcb89f

Change data structure for string hist, cover corner cases for transfo…

3048bc9

…rmation

HashinTransformation should be superseded by StringHistogramTransform…

319b072

…ation

Add and fix tests

e1a7749

Update documentation for String indexing via histograms

44f66fc

Update documentation

9ee36e2

fix issue on appending new revision

47de1fe

scalafmt

1db4b13

Revert changes

79cc16c

Improve extensibility

c92cc52

Fix test

90c59e6

Qbeast-io#228 Implementation of equals/hashCode of CubeId is reworked

c1c144f

Remove null from histogram

fe272cc

Qbeast-io#228 Test coverage of CubeId is improved.

882a7f1

Merge pull request Qbeast-io#231 from alexeiakimov/228-cubeid-has-inc…

cdcc4e2

…orrect-implementation-of-equals-and-hashcode Qbeast-io#228 Implementation of equals/hashCode of CubeId is fixed

Merge pull request Qbeast-io#230 from Jiaweihu08/string-hist-encoding

2152a20

String hist encoding

alexeiakimov self-assigned this Nov 24, 2023

alexeiakimov requested review from Jiaweihu08, cugni and osopardo1 November 24, 2023 14:04

osopardo1 approved these changes Nov 24, 2023

View reviewed changes

osopardo1 mentioned this pull request Nov 27, 2023

String indexing via column histograms #221

Closed

5 tasks

cugni approved these changes Nov 27, 2023

View reviewed changes

osopardo1 merged commit 0cbf7aa into Qbeast-io:main Nov 27, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport string hist encoding to main #234

Backport string hist encoding to main #234

alexeiakimov commented Nov 24, 2023 •

edited by osopardo1

Loading

osopardo1 commented Nov 27, 2023

cugni left a comment

Backport string hist encoding to main #234

Backport string hist encoding to main #234

Conversation

alexeiakimov commented Nov 24, 2023 • edited by osopardo1 Loading

Description

How does it work?

How to use?

Checklist:

osopardo1 commented Nov 27, 2023

cugni left a comment

Choose a reason for hiding this comment

alexeiakimov commented Nov 24, 2023 •

edited by osopardo1

Loading