You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Was playing around with the BigQuery public-datasets populated using ETL. Was wondering if it has been investigated whether it will be possible to add clustering in addition to partitioning. Would really reduce the processing costs if queries are structured properly.
Instead, the data can be copied to a new table that has clustering enabled.
Example
As an example, the blocks table can get migrated using the following SQL script:
CREATE TABLE
`bigquery-public-data.crypto_ethereum.blocks-cluster`
PARTITION BY
TIMESTAMP_TRUNC(`timestamp`, DAY)
CLUSTER BY
`number`, `hash` AS
SELECT
*
FROM
`bigquery-public-data.crypto_ethereum.blocks`
Thereafter the tables can be renamed to swap out the tables and delete the originals.
ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks` RENAME TO `blocks-legacy`
ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks-cluster` RENAME TO `blocks`
Benchmark
As a comparison, I created a copy of the blocks table and queried by number on the original table and the copy with clustering:
Original
Clustering
Comparison
Looking at the results, the overall amount of data processed was reduced from 14.7 GB to 6.9 MB.
Structure
The major design decision that will need to be made is which columns to cluster on and in what order. It is recommended to keep the order of clustered columns the same as the schema as it simplifies writing queries. Clustering only works if filters (WHERE) are applied in order of clustering.
Clustering has a minor storage overhead, but the benefits generally exceed the disadvantages and as many columns as possible should make use of clustering.
The text was updated successfully, but these errors were encountered:
Was playing around with the BigQuery public-datasets populated using ETL. Was wondering if it has been investigated whether it will be possible to add clustering in addition to partitioning. Would really reduce the processing costs if queries are structured properly.
https://cloud.google.com/bigquery/docs/clustered-tables
Although it possible to add clustering to an existing table, only new data will be clustered and have reduced processing costs:
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
Instead, the data can be copied to a new table that has clustering enabled.
Example
As an example, the
blocks
table can get migrated using the following SQL script:Thereafter the tables can be renamed to swap out the tables and delete the originals.
Benchmark
As a comparison, I created a copy of the
blocks
table and queried bynumber
on the original table and the copy with clustering:Original
Clustering
Comparison
Looking at the results, the overall amount of data processed was reduced from 14.7 GB to 6.9 MB.
Structure
The major design decision that will need to be made is which columns to cluster on and in what order. It is recommended to keep the order of clustered columns the same as the schema as it simplifies writing queries. Clustering only works if filters (
WHERE
) are applied in order of clustering.Clustering has a minor storage overhead, but the benefits generally exceed the disadvantages and as many columns as possible should make use of clustering.
The text was updated successfully, but these errors were encountered: