Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport string hist encoding to main #234

Merged

Conversation

alexeiakimov
Copy link
Contributor

@alexeiakimov alexeiakimov commented Nov 24, 2023

Description

This PR introduces a new String Transformation that uses String histograms for value mapping.

(Backport the feature from main-1.0.0)

How does it work?

A sorted sequence of distinct string values should be provided to map String values to the space.

The sequence is treated as an equal-width histogram for the column to index.

To transforma a given String value, we look for its insertion position within the sequence using binary search:

val hist = ["a", "b", "c", "d", "e"]
val coordinate = hist.search("b").insertionPoint.toDouble / (hist.length - 1) // 0.25

Values that are not contained within the limits of the histogram are mapped to their corresponding extremes, i.e. 0.0 or 1.0.

How to use?

To use the feature, we build the histogram and provide it as part of columnStats when writing.

:histogram should also be specified for the String column to index.

  1. Compute the histogram:
  • Use all distinct String values:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

def getAllString(df: DataFrame, columnName: String): String = {	
  df
  .select(columnName)
  .distinct()
  .na.drop
  .orderBy(col(columnName).asc)
  .collect()
  .map { r => 
    val s = r.getAs[String](0)
    s"'$s'"
  }
  .mkString("[", ",", "]")
}

val histogram = getAllString(df, "test_col_name")
  • Use a reduced number of String values:
import org.apache.spark.sql.delta.skipping.MultiDimClusteringFunctions
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, min}

def getStringHistogramStr(df: DataFrame, columnName: String, numBins: Int): String = {
  val binStarts = "__bin_starts"
  val stringPartitionColumn = MultiDimClusteringFunctions.range_partition_id(col(columnName), numBins)
	
  df
  .select(columnName)
  .distinct()
  .na.drop
  .groupBy(stringPartitionColumn)
  .agg(min(columnName).alias(binStarts))
  .select(binStarts)
  .orderBy(binStarts)
  .collect()
  .map { r => 
    val s = r.getAs[String](0)
    s"'$s'"
  }
  .mkString("[", ",", "]")
}

val histogram = getStringHistogramStr(df, "test_col_name", 100)
  1. Index the data as follows:
val columnStats = s"""{"test_col_name_histogram":$histogram}"""

df
  .write
  .format("qbeast")
  .option("columnsToIndex", s"test_col_name:histogram")
  .option("columnStats", columnStats)
  .save(targetPath)

If no histogram is provided in columnStats during the first write, a default histogram("a" to "z") will be used. Subsequent appends will reuse the same histogram.

A new revision will be created when a different custom histogram is provided as "columnStats".

Note: It also fixes the bug where the first write has to have overwrite write mode.

Checklist:

  • New feature / bug fix has been committed following the Contribution guide.
  • Add comments to the code (make it easier for the community!).
  • Change the documentation.
  • Add tests.
  • Branch is updated to the main branch.

@osopardo1
Copy link
Member

All good!

Copy link
Member

@cugni cugni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@osopardo1 osopardo1 merged commit 0cbf7aa into Qbeast-io:main Nov 27, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants