Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix duplicated schema name error from count wildcard #14824

Merged
merged 28 commits into from
Feb 26, 2025

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Feb 22, 2025

Which issue does this PR close?

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Rationale for this change

Instead of converting count() and count(*) to count(1). We makes count() possible as a replacement of count wildcard. In this case, count(1) can be treated as the normal case (although it is equivalent to wildcard), without this we need to handle many different complex case for count(1) such as count(cast(1 as i32)). The schema name is much more consistent with DuckDB too.

What changes are included in this PR?

Implement count with zero arg in aggregate function level.

count() -> count()
count(*) -> count()
count(1) -> count(1)
count(2) -> count(2)

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 22, 2025
@jayzhan211 jayzhan211 changed the title Fix duplicated schema name of count wildcard issue Fix duplicated schema name error from count wildcard Feb 22, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates labels Feb 22, 2025
// handle count() case
if expr.is_empty() {
return Ok(vec![
Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to count(1) case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function is not only used by count. I'm not quite sure about the impact of this change.
Ideally, this function should not involve the logic of any specific aggregation function.

.collect::<Result<Vec<_>>>()?;
// Handle count(*) case
let values = if expr.is_empty() {
vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to count(1) case

@jayzhan211 jayzhan211 marked this pull request as ready for review February 22, 2025 13:51
@jayzhan211 jayzhan211 changed the title Fix duplicated schema name error from count wildcard Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard Feb 22, 2025
@jayzhan211 jayzhan211 marked this pull request as draft February 22, 2025 14:14
@github-actions github-actions bot added sql SQL Planner optimizer Optimizer rules substrait Changes to the substrait crate labels Feb 22, 2025
@jayzhan211 jayzhan211 marked this pull request as ready for review February 23, 2025 03:15
@jayzhan211 jayzhan211 requested a review from alamb February 24, 2025 01:20
@jayzhan211
Copy link
Contributor Author

fix the extended test in main branch

@Omega359
Copy link
Contributor

I think the issue is that the runner in https://github.com/Omega359/sqllogictest-rs is based on an older version of the sqllogictests than we use in datafusion.

I have an idea for a workaround, but longer term we probably need to make the update eaiser to maintaine

That is exactly what I was thinking and hopefully will fix tonight. I think a decent short-term fix is to 'lock' the sqllogictest-rs dependency version and add a comment that any update to it will require a full run of the regenerate script before committing.

Long term ideally would be to improve my changes to my fork of sqllogictest-rs such that they would be suitable to submit a PR to that project. That is not an insignificant amount of work to be honest and I'm a bit thin on time for the next month or two.

@alamb
Copy link
Contributor

alamb commented Feb 25, 2025

I think the issue is that the runner in https://github.com/Omega359/sqllogictest-rs is based on an older version of the sqllogictests than we use in datafusion.
I have an idea for a workaround, but longer term we probably need to make the update eaiser to maintaine

That is exactly what I was thinking and hopefully will fix tonight. I think a decent short-term fix is to 'lock' the sqllogictest-rs dependency version and add a comment that any update to it will require a full run of the regenerate script before committing.

Long term ideally would be to improve my changes to my fork of sqllogictest-rs such that they would be suitable to submit a PR to that project. That is not an insignificant amount of work to be honest and I'm a bit thin on time for the next month or two.

Makes sesne -- thank you

BTW I have another interim workaround here:

I think we can use that to regenerate the output for this PR

@jayzhan211
Copy link
Contributor Author

After merging apache/datafusion-testing#7 and update commit, I guess is good to go

@alamb alamb mentioned this pull request Feb 26, 2025
26 tasks
@alamb alamb marked this pull request as ready for review February 26, 2025 12:01
@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

I just merged apache/datafusion-testing#7

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211

@@ -2455,7 +2455,7 @@ async fn test_count_wildcard_on_sort() -> Result<()> {
let ctx = create_join_context()?;

let sql_results = ctx
.sql("select b,count(*) from t1 group by b order by count(*)")
.sql("select b,count(1) from t1 group by b order by count(1)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to double check -- the reason this needs to change is that the test is comparing again a dataframe built with count_all() which now uses count(1)

Though maybe we could change count_all() to return count(1) as "count(*)" so it would be consistent with older versions?

@@ -286,6 +286,7 @@ pub struct NamePreserver {

/// If the qualified name of an expression is remembered, it will be preserved
/// when rewriting the expression
#[derive(Debug)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

I think we need to update the datafusion-testing pin -- closing/reopening this PR to rerun the tests to make sure

@alamb alamb closed this Feb 26, 2025
@alamb alamb reopened this Feb 26, 2025
@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

NM I think things are clean now

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- since this PR fixes a bunch of tests and gets the main branch back to green, I am going to merge it. We can then address the count_all() function name as a follow on PR

@@ -2455,7 +2455,7 @@ async fn test_count_wildcard_on_sort() -> Result<()> {
let ctx = create_join_context()?;

let sql_results = ctx
.sql("select b,count(*) from t1 group by b order by count(*)")
.sql("select b,count(1) from t1 group by b order by count(1)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found I could avoid the double alias by adding a check in Expr::alias:

diff --git a/datafusion/expr/src/expr.rs b/datafusion/expr/src/expr.rs
index f8baf9c94..2f3c2c575 100644
--- a/datafusion/expr/src/expr.rs
+++ b/datafusion/expr/src/expr.rs
@@ -1276,7 +1276,14 @@ impl Expr {

     /// Return `self AS name` alias expression
     pub fn alias(self, name: impl Into<String>) -> Expr {
-        Expr::Alias(Alias::new(self, None::<&str>, name.into()))
+        let name = name.into();
+        // don't realias the same thing
+        if matches!(&self, Expr::Alias(Alias {name: existing_name, ..} ) if existing_name == &name)
+        {
+            self
+        } else {
+            Expr::Alias(Alias::new(self, None::<&str>, name))
+        }
     }

     /// Return `self AS name` alias expression with a specific qualifier
@@ -1285,7 +1292,15 @@ impl Expr {
         relation: Option<impl Into<TableReference>>,
         name: impl Into<String>,
     ) -> Expr {
-        Expr::Alias(Alias::new(self, relation, name.into()))
+        let relation = relation.map(|r| r.into());
+        let name = name.into();
+        // don't realias the same thing
+        if matches!(&self, Expr::Alias(Alias {name: existing_name, relation: existing_relation, ..} ) if existing_name == &name && relation.as_ref()==existing_relation.as_ref() )
+        {
+            self
+        } else {
+            Expr::Alias(Alias::new(self, relation, name))
+        }
     }

     /// Remove an alias from an expression if one exists.
diff --git a/datafusion/functions-aggregate/src/count.rs b/datafusion/functions-aggregate/src/count.rs
index a3339f0fc..1faf1968b 100644
--- a/datafusion/functions-aggregate/src/count.rs
+++ b/datafusion/functions-aggregate/src/count.rs
@@ -81,7 +81,7 @@ pub fn count_distinct(expr: Expr) -> Expr {

 /// Creates aggregation to count all rows, equivalent to `COUNT(*)`, `COUNT()`, `COUNT(1)`
 pub fn count_all() -> Expr {
-    count(Expr::Literal(COUNT_STAR_EXPANSION))
+    count(Expr::Literal(COUNT_STAR_EXPANSION)).alias("count(*)")
 }

 #[user_doc(

@@ -198,7 +198,7 @@ fn between_date32_plus_interval() -> Result<()> {
WHERE col_date32 between '1998-03-18' AND cast('1998-03-18' as date) + INTERVAL '90 days'";
let plan = test_sql(sql)?;
let expected =
"Aggregate: groupBy=[[]], aggr=[[count(*)]]\
"Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this certainly seems an improvement

@alamb alamb merged commit 9278233 into apache:main Feb 26, 2025
24 of 47 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

Let's get the tests clean

@jayzhan211
Copy link
Contributor Author

Thanks @alamb. I will file related issue as follow-up

@jayzhan211 jayzhan211 deleted the count-schema-name branch February 26, 2025 13:56
@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

Change in 46: count_all() expr_fn function now displayed as count(1) rather than count(*) #14894

Thansk! Note I did file

@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

The tests are green again on main!
https://github.com/apache/datafusion/actions/runs/13545248421/job/37855153112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression since 45.0.0: select count(), count(*) does not work Extended sqllite tests are failing on main
5 participants