-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE #14922
base: main
Are you sure you want to change the base?
Conversation
02b45c6
to
b41b4fb
Compare
…al table field will default to utf8view
@@ -2400,7 +2400,8 @@ async fn write_json_with_order() -> Result<()> { | |||
#[tokio::test] | |||
async fn write_table_with_order() -> Result<()> { | |||
let tmp_dir = TempDir::new()?; | |||
let ctx = SessionContext::new(); | |||
let config = SessionConfig::new().with_parquet_force_view_metadata(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to disable this?
I can explain, because
- We select from the datasource or memory which are not forced to utf8view type, but we create external table with utf8view field when we enable the force_view.
- When we insert into cases, we check the insert data type and the table data type, so we should make the data type consistent.
Related ticket to support cast it to utf8view automaticlly:
#13408 (comment)
statement ok
create table t(a varchar) as values ('1'), ('2');
query T
select arrow_typeof(a) from t;
----
Utf8
Utf8
statement ok
drop table t
@@ -59,7 +59,8 @@ use tempfile::tempdir; | |||
#[tokio::main] | |||
async fn main() -> Result<()> { | |||
// The SessionContext is the main high level API for interacting with DataFusion | |||
let ctx = SessionContext::new(); | |||
let config = SessionConfig::new().with_parquet_force_view_metadata(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we need to make the datatype consistent for insert into case.
03)----Sort: aggregate_test_100.c1 ASC NULLS LAST | ||
04)------TableScan: aggregate_test_100 projection=[c1] | ||
physical_plan | ||
01)DataSinkExec: sink=ParquetSink(file_groups=[]) | ||
02)--SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1], file_type=csv, has_header=true | ||
02)--CoalescePartitionsExec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to this PR:
I am confused why we add CAST() for logic plan will cause the physical plan change to add
02)--CoalescePartitionsExec
03)----ProjectionExec: expr=[CAST(c1@0 AS Utf8View) as c1]
04)------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
@@ -144,7 +144,7 @@ EXPLAIN select b from t_pushdown where a = 'bar' order by b; | |||
---- | |||
logical_plan | |||
01)Sort: t_pushdown.b ASC NULLS LAST | |||
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8("bar")] | |||
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8View("bar")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Create table without pushdown
statement ok
CREATE EXTERNAL TABLE t_pushdown(a varchar, b int, c float) STORED AS PARQUET
LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/';
This is expected, now we support it for external table creation with Utf8View for this PR.
Testing result: > CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
l_orderkey BIGINT,
l_partkey BIGINT,
l_suppkey BIGINT,
l_linenumber INTEGER,
l_quantity DECIMAL(15, 2),
l_extendedprice DECIMAL(15, 2),
l_discount DECIMAL(15, 2),
l_tax DECIMAL(15, 2),
l_returnflag VARCHAR,
l_linestatus VARCHAR,
l_shipdate DATE,
l_commitdate DATE,
l_receiptdate DATE,
l_shipinstruct VARCHAR,
l_shipmode VARCHAR,
l_comment VARCHAR
) STORED AS parquet
LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem';
0 row(s) fetched.
Elapsed 0.006 seconds.
> select arrow_typeof(l_comment) from lineitem limit 1;
+----------------------------------+
| arrow_typeof(lineitem.l_comment) |
+----------------------------------+
| Utf8View |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.092 seconds. |
@@ -456,13 +456,16 @@ explain insert into table_without_values select c1 from aggregate_test_100 order | |||
---- | |||
logical_plan | |||
01)Dml: op=[Insert Into] table=[table_without_values] | |||
02)--Projection: aggregate_test_100.c1 AS c1 | |||
02)--Projection: CAST(aggregate_test_100.c1 AS Utf8View) AS c1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also expected, for the sql way insert into, we will automatically cast the type to match the table schema, see details:
The code here:
datafusion/datafusion/sql/src/statement.rs
Line 1958 in 2011f52
.cast_to(target_field.data_type(), source.schema())? |
@@ -377,6 +377,21 @@ impl FileFormat for ParquetFormat { | |||
Ok(Arc::new(schema)) | |||
} | |||
|
|||
async fn transform_schema(&self, schema: SchemaRef) -> Result<SchemaRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this transformation rule adapted from some existing transformation? Because there is an additional rule here: transform_binary_to_string
, so I made a such guess.
If so, perhaps we should extract the common logic to keep it consistent, or at least let them reference to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we call it at the end of infer_shema, but it's not limited to infer_shema:
let schema = if self.binary_as_string() {
transform_binary_to_string(&schema)
} else {
schema
};
let schema = if self.force_view_types() {
transform_schema_to_view(&schema)
} else {
schema
};
I will try to extract the common logic to make it consistent.
Which issue does this PR close?
schema_force_view_type
configuration not working forCREATE EXTERNAL TABLE
#14909Rationale for this change
schema_force_view_type configuration not working for CREATE EXTERNAL TABLE
Because we onlly use infer_schema to infer schema and transform schema, but for those provided shema such as CREATE EXTERNAL TABLE, we also need to transform schema.
What changes are included in this PR?
Add separate transform schema function to do the transform.
Are these changes tested?
Yes
Are there any user-facing changes?