Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE #14922

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Feb 28, 2025

Which issue does this PR close?

Rationale for this change

schema_force_view_type configuration not working for CREATE EXTERNAL TABLE

Because we onlly use infer_schema to infer schema and transform schema, but for those provided shema such as CREATE EXTERNAL TABLE, we also need to transform schema.

What changes are included in this PR?

Add separate transform schema function to do the transform.

Are these changes tested?

Yes

> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR
) STORED AS parquet
LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem';
0 row(s) fetched.
Elapsed 0.006 seconds.

> select arrow_typeof(l_comment) from lineitem limit 1;
+----------------------------------+
| arrow_typeof(lineitem.l_comment) |
+----------------------------------+
| Utf8View                         |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.092 seconds.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Feb 28, 2025
@zhuqi-lucas zhuqi-lucas marked this pull request as draft February 28, 2025 03:56
@github-actions github-actions bot added the execution Related to the execution crate label Feb 28, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 28, 2025
@@ -2400,7 +2400,8 @@ async fn write_json_with_order() -> Result<()> {
#[tokio::test]
async fn write_table_with_order() -> Result<()> {
let tmp_dir = TempDir::new()?;
let ctx = SessionContext::new();
let config = SessionConfig::new().with_parquet_force_view_metadata(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to disable this?

I can explain, because

  1. We select from the datasource or memory which are not forced to utf8view type, but we create external table with utf8view field when we enable the force_view.
  2. When we insert into cases, we check the insert data type and the table data type, so we should make the data type consistent.

Related ticket to support cast it to utf8view automaticlly:
#13408 (comment)

statement ok
create table t(a varchar) as values ('1'), ('2');

query T
select arrow_typeof(a) from t;
----
Utf8
Utf8

statement ok
drop table t

@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review March 2, 2025 00:30
@@ -59,7 +59,8 @@ use tempfile::tempdir;
#[tokio::main]
async fn main() -> Result<()> {
// The SessionContext is the main high level API for interacting with DataFusion
let ctx = SessionContext::new();
let config = SessionConfig::new().with_parquet_force_view_metadata(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we need to make the datatype consistent for insert into case.

03)----Sort: aggregate_test_100.c1 ASC NULLS LAST
04)------TableScan: aggregate_test_100 projection=[c1]
physical_plan
01)DataSinkExec: sink=ParquetSink(file_groups=[])
02)--SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[false]
03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1], file_type=csv, has_header=true
02)--CoalescePartitionsExec
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this PR:

I am confused why we add CAST() for logic plan will cause the physical plan change to add

02)--CoalescePartitionsExec
03)----ProjectionExec: expr=[CAST(c1@0 AS Utf8View) as c1]
04)------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1

@@ -144,7 +144,7 @@ EXPLAIN select b from t_pushdown where a = 'bar' order by b;
----
logical_plan
01)Sort: t_pushdown.b ASC NULLS LAST
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8("bar")]
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8View("bar")]
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Create table without pushdown
statement ok
CREATE EXTERNAL TABLE t_pushdown(a varchar, b int, c float) STORED AS PARQUET
LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/';

This is expected, now we support it for external table creation with Utf8View for this PR.

@zhuqi-lucas
Copy link
Contributor Author

Testing result:

> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR
) STORED AS parquet
LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem';
0 row(s) fetched.
Elapsed 0.006 seconds.

> select arrow_typeof(l_comment) from lineitem limit 1;
+----------------------------------+
| arrow_typeof(lineitem.l_comment) |
+----------------------------------+
| Utf8View                         |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.092 seconds.

@@ -456,13 +456,16 @@ explain insert into table_without_values select c1 from aggregate_test_100 order
----
logical_plan
01)Dml: op=[Insert Into] table=[table_without_values]
02)--Projection: aggregate_test_100.c1 AS c1
02)--Projection: CAST(aggregate_test_100.c1 AS Utf8View) AS c1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also expected, for the sql way insert into, we will automatically cast the type to match the table schema, see details:

The code here:

.cast_to(target_field.data_type(), source.schema())?

@@ -377,6 +377,21 @@ impl FileFormat for ParquetFormat {
Ok(Arc::new(schema))
}

async fn transform_schema(&self, schema: SchemaRef) -> Result<SchemaRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this transformation rule adapted from some existing transformation? Because there is an additional rule here: transform_binary_to_string, so I made a such guess.
If so, perhaps we should extract the common logic to keep it consistent, or at least let them reference to each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we call it at the end of infer_shema, but it's not limited to infer_shema:

     let schema = if self.binary_as_string() {
            transform_binary_to_string(&schema)
        } else {
            schema
        };

        let schema = if self.force_view_types() {
            transform_schema_to_view(&schema)
        } else {
            schema
        };

I will try to extract the common logic to make it consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate execution Related to the execution crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schema_force_view_type configuration not working for CREATE EXTERNAL TABLE
2 participants