-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE #14922
base: main
Are you sure you want to change the base?
Changes from all commits
3608782
bad38e8
5be3e37
7e45eb6
b41b4fb
a179700
c56e9a5
f70b9c8
1b92fde
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -377,6 +377,21 @@ impl FileFormat for ParquetFormat { | |
Ok(Arc::new(schema)) | ||
} | ||
|
||
async fn transform_schema(&self, schema: SchemaRef) -> Result<SchemaRef> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this transformation rule adapted from some existing transformation? Because there is an additional rule here: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, we call it at the end of infer_shema, but it's not limited to infer_shema: let schema = if self.binary_as_string() {
transform_binary_to_string(&schema)
} else {
schema
};
let schema = if self.force_view_types() {
transform_schema_to_view(&schema)
} else {
schema
}; I will try to extract the common logic to make it consistent. |
||
let schema = if self.binary_as_string() { | ||
Arc::new(transform_binary_to_string(schema.as_ref())) | ||
} else { | ||
schema | ||
}; | ||
|
||
let schema = if self.force_view_types() { | ||
Arc::new(transform_schema_to_view(schema.as_ref())) | ||
} else { | ||
schema | ||
}; | ||
Ok(schema) | ||
} | ||
|
||
async fn infer_stats( | ||
&self, | ||
_state: &dyn Session, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2400,7 +2400,8 @@ async fn write_json_with_order() -> Result<()> { | |
#[tokio::test] | ||
async fn write_table_with_order() -> Result<()> { | ||
let tmp_dir = TempDir::new()?; | ||
let ctx = SessionContext::new(); | ||
let config = SessionConfig::new().with_parquet_force_view_metadata(false); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why we need to disable this? I can explain, because
Related ticket to support cast it to utf8view automaticlly: statement ok
create table t(a varchar) as values ('1'), ('2');
query T
select arrow_typeof(a) from t;
----
Utf8
Utf8
statement ok
drop table t |
||
let ctx = SessionContext::new_with_config(config); | ||
let location = tmp_dir.path().join("test_table/"); | ||
|
||
let mut write_df = ctx | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -456,13 +456,16 @@ explain insert into table_without_values select c1 from aggregate_test_100 order | |||
---- | ||||
logical_plan | ||||
01)Dml: op=[Insert Into] table=[table_without_values] | ||||
02)--Projection: aggregate_test_100.c1 AS c1 | ||||
02)--Projection: CAST(aggregate_test_100.c1 AS Utf8View) AS c1 | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also expected, for the sql way insert into, we will automatically cast the type to match the table schema, see details: The code here: datafusion/datafusion/sql/src/statement.rs Line 1958 in 2011f52
|
||||
03)----Sort: aggregate_test_100.c1 ASC NULLS LAST | ||||
04)------TableScan: aggregate_test_100 projection=[c1] | ||||
physical_plan | ||||
01)DataSinkExec: sink=ParquetSink(file_groups=[]) | ||||
02)--SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[false] | ||||
03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1], file_type=csv, has_header=true | ||||
02)--CoalescePartitionsExec | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not related to this PR: I am confused why we add CAST() for logic plan will cause the physical plan change to add 02)--CoalescePartitionsExec |
||||
03)----ProjectionExec: expr=[CAST(c1@0 AS Utf8View) as c1] | ||||
04)------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1 | ||||
05)--------SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[false] | ||||
06)----------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1], file_type=csv, has_header=true | ||||
|
||||
query I | ||||
insert into table_without_values select c1 from aggregate_test_100 order by c1; | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -144,7 +144,7 @@ EXPLAIN select b from t_pushdown where a = 'bar' order by b; | |
---- | ||
logical_plan | ||
01)Sort: t_pushdown.b ASC NULLS LAST | ||
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8("bar")] | ||
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8View("bar")] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ## Create table without pushdown
statement ok
CREATE EXTERNAL TABLE t_pushdown(a varchar, b int, c float) STORED AS PARQUET
LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/'; This is expected, now we support it for external table creation with Utf8View for this PR. |
||
physical_plan | ||
01)SortPreservingMergeExec: [b@0 ASC NULLS LAST] | ||
02)--SortExec: expr=[b@0 ASC NULLS LAST], preserve_partitioning=[true] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we need to make the datatype consistent for insert into case.