-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix duplicated schema name error from count wildcard #14824
Conversation
// handle count() case | ||
if expr.is_empty() { | ||
return Ok(vec![ | ||
Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to count(1)
case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this function is not only used by count
. I'm not quite sure about the impact of this change.
Ideally, this function should not involve the logic of any specific aggregation function.
.collect::<Result<Vec<_>>>()?; | ||
// Handle count(*) case | ||
let values = if expr.is_empty() { | ||
vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to count(1) case
fix the extended test in main branch |
That is exactly what I was thinking and hopefully will fix tonight. I think a decent short-term fix is to 'lock' the sqllogictest-rs dependency version and add a comment that any update to it will require a full run of the regenerate script before committing. Long term ideally would be to improve my changes to my fork of sqllogictest-rs such that they would be suitable to submit a PR to that project. That is not an insignificant amount of work to be honest and I'm a bit thin on time for the next month or two. |
Makes sesne -- thank you BTW I have another interim workaround here: I think we can use that to regenerate the output for this PR |
After merging apache/datafusion-testing#7 and update commit, I guess is good to go |
I just merged apache/datafusion-testing#7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211
@@ -2455,7 +2455,7 @@ async fn test_count_wildcard_on_sort() -> Result<()> { | |||
let ctx = create_join_context()?; | |||
|
|||
let sql_results = ctx | |||
.sql("select b,count(*) from t1 group by b order by count(*)") | |||
.sql("select b,count(1) from t1 group by b order by count(1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to double check -- the reason this needs to change is that the test is comparing again a dataframe built with count_all()
which now uses count(1)
Though maybe we could change count_all()
to return count(1) as "count(*)"
so it would be consistent with older versions?
@@ -286,6 +286,7 @@ pub struct NamePreserver { | |||
|
|||
/// If the qualified name of an expression is remembered, it will be preserved | |||
/// when rewriting the expression | |||
#[derive(Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
I think we need to update the datafusion-testing pin -- closing/reopening this PR to rerun the tests to make sure |
NM I think things are clean now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 -- since this PR fixes a bunch of tests and gets the main branch back to green, I am going to merge it. We can then address the count_all()
function name as a follow on PR
@@ -2455,7 +2455,7 @@ async fn test_count_wildcard_on_sort() -> Result<()> { | |||
let ctx = create_join_context()?; | |||
|
|||
let sql_results = ctx | |||
.sql("select b,count(*) from t1 group by b order by count(*)") | |||
.sql("select b,count(1) from t1 group by b order by count(1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found I could avoid the double alias by adding a check in Expr::alias
:
diff --git a/datafusion/expr/src/expr.rs b/datafusion/expr/src/expr.rs
index f8baf9c94..2f3c2c575 100644
--- a/datafusion/expr/src/expr.rs
+++ b/datafusion/expr/src/expr.rs
@@ -1276,7 +1276,14 @@ impl Expr {
/// Return `self AS name` alias expression
pub fn alias(self, name: impl Into<String>) -> Expr {
- Expr::Alias(Alias::new(self, None::<&str>, name.into()))
+ let name = name.into();
+ // don't realias the same thing
+ if matches!(&self, Expr::Alias(Alias {name: existing_name, ..} ) if existing_name == &name)
+ {
+ self
+ } else {
+ Expr::Alias(Alias::new(self, None::<&str>, name))
+ }
}
/// Return `self AS name` alias expression with a specific qualifier
@@ -1285,7 +1292,15 @@ impl Expr {
relation: Option<impl Into<TableReference>>,
name: impl Into<String>,
) -> Expr {
- Expr::Alias(Alias::new(self, relation, name.into()))
+ let relation = relation.map(|r| r.into());
+ let name = name.into();
+ // don't realias the same thing
+ if matches!(&self, Expr::Alias(Alias {name: existing_name, relation: existing_relation, ..} ) if existing_name == &name && relation.as_ref()==existing_relation.as_ref() )
+ {
+ self
+ } else {
+ Expr::Alias(Alias::new(self, relation, name))
+ }
}
/// Remove an alias from an expression if one exists.
diff --git a/datafusion/functions-aggregate/src/count.rs b/datafusion/functions-aggregate/src/count.rs
index a3339f0fc..1faf1968b 100644
--- a/datafusion/functions-aggregate/src/count.rs
+++ b/datafusion/functions-aggregate/src/count.rs
@@ -81,7 +81,7 @@ pub fn count_distinct(expr: Expr) -> Expr {
/// Creates aggregation to count all rows, equivalent to `COUNT(*)`, `COUNT()`, `COUNT(1)`
pub fn count_all() -> Expr {
- count(Expr::Literal(COUNT_STAR_EXPANSION))
+ count(Expr::Literal(COUNT_STAR_EXPANSION)).alias("count(*)")
}
#[user_doc(
@@ -198,7 +198,7 @@ fn between_date32_plus_interval() -> Result<()> { | |||
WHERE col_date32 between '1998-03-18' AND cast('1998-03-18' as date) + INTERVAL '90 days'"; | |||
let plan = test_sql(sql)?; | |||
let expected = | |||
"Aggregate: groupBy=[[]], aggr=[[count(*)]]\ | |||
"Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this certainly seems an improvement
Let's get the tests clean |
Thanks @alamb. I will file related issue as follow-up |
Thansk! Note I did file |
The tests are green again on main! |
Which issue does this PR close?
We convert count(constant) i.e. count(2) to count(*) in previous PR
so
select count(1) * count(2)
produces duplicated schema name error given both arecount(*)
in schema name.select count(), count(*)
does not work #14855Rationale for this change
Instead of converting
count()
andcount(*)
tocount(1)
. We makescount()
possible as a replacement of count wildcard. In this case,count(1)
can be treated as the normal case (although it is equivalent to wildcard), without this we need to handle many different complex case forcount(1)
such ascount(cast(1 as i32))
. The schema name is much more consistent with DuckDB too.What changes are included in this PR?
Implement count with zero arg in aggregate function level.
count()
-> count()count(*)
-> count()count(1)
-> count(1)count(2)
-> count(2)Are these changes tested?
Are there any user-facing changes?