Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create rule S7191: PySpark "withColumns" should be preferred over "withColumn" when multiple columns are specified #4633

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

You can preview this rule here (updated a few minutes after each push).

Review

A dedicated reviewer checked the rule description successfully for:

  • logical errors and incorrect information
  • information gaps and missing content
  • text style and tone
  • PR summary and labels follow the guidelines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments on the description, but looks good overall!
I also left a comment in the implementation ticket, where I think we can clarify more explicitly some use cases (e.g. loops).

@@ -0,0 +1,65 @@
This rule identifies instances where multiple `withColumn` calls are used in succession to add or modify columns in a PySpark DataFrame. It suggests using `withColumns` instead, which is more efficient and concise when dealing with multiple columns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Maybe we could drop the final "when dealing with multiple columns.", which can be a bit verbose?

from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumn("value_plus_1", col("value") + 1).withColumn("value_plus_2", col("value") + 2).withColumn("value_plus_3", col("value") + 3) # Non-compliant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: for consistency it should beNoncompliant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum I copy this way of writing it from another PR but can't find which one. Anyway, I follow your suggestion.

from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumn("value_plus_1", col("value") + 1).withColumn("value_plus_2", col("value") + 2).withColumn("value_plus_3", col("value") + 3) # Non-compliant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the examples should be crystal clear, so we shouldn't mix Noncompliant/Compliant cases there.

"MAINTAINABILITY": "MEDIUM",
"RELIABILITY": "MEDIUM"
},
"attribute": "CONVENTIONAL"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about EFFICIENT here since it's primarily about performances?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor nitpick left on the rule title.

@@ -0,0 +1,26 @@
{
"title": "`withColumns` method should be preferred over `withColumn` when multiple columns are specified",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it in the first round, but I think that formatted text doesn't actually display in rule titles (as opposed with descriptions), so it's better to use double quotes rather than backticks here.

@guillaume-dequenne-sonarsource guillaume-dequenne-sonarsource changed the title Create rule S7191 Create rule S7191: PySpark "withColumns" should be preferred over "withColumn" when multiple columns are specified Feb 4, 2025
Copy link

sonarqube-next bot commented Feb 4, 2025

Quality Gate passed Quality Gate passed for 'rspec-tools'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Copy link

sonarqube-next bot commented Feb 4, 2025

Quality Gate passed Quality Gate passed for 'rspec-frontend'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants