-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create rule S7191: PySpark "withColumns" should be preferred over "withColumn" when multiple columns are specified #4633
base: master
Are you sure you want to change the base?
Conversation
ee30e4c
to
e24d4d9
Compare
e24d4d9
to
5460d72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments on the description, but looks good overall!
I also left a comment in the implementation ticket, where I think we can clarify more explicitly some use cases (e.g. loops).
rules/S7191/python/rule.adoc
Outdated
@@ -0,0 +1,65 @@ | |||
This rule identifies instances where multiple `withColumn` calls are used in succession to add or modify columns in a PySpark DataFrame. It suggests using `withColumns` instead, which is more efficient and concise when dealing with multiple columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Maybe we could drop the final "when dealing with multiple columns.", which can be a bit verbose?
rules/S7191/python/rule.adoc
Outdated
from pyspark.sql.functions import col | ||
spark = SparkSession.builder.getOrCreate() | ||
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"]) | ||
df_with_new_cols = df.withColumn("value_plus_1", col("value") + 1).withColumn("value_plus_2", col("value") + 2).withColumn("value_plus_3", col("value") + 3) # Non-compliant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: for consistency it should beNoncompliant
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum I copy this way of writing it from another PR but can't find which one. Anyway, I follow your suggestion.
rules/S7191/python/rule.adoc
Outdated
from pyspark.sql.functions import col | ||
spark = SparkSession.builder.getOrCreate() | ||
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"]) | ||
df_with_new_cols = df.withColumn("value_plus_1", col("value") + 1).withColumn("value_plus_2", col("value") + 2).withColumn("value_plus_3", col("value") + 3) # Non-compliant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the examples should be crystal clear, so we shouldn't mix Noncompliant/Compliant cases there.
rules/S7191/python/metadata.json
Outdated
"MAINTAINABILITY": "MEDIUM", | ||
"RELIABILITY": "MEDIUM" | ||
}, | ||
"attribute": "CONVENTIONAL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about EFFICIENT
here since it's primarily about performances?
5460d72
to
af438da
Compare
…umns are specified
af438da
to
cb5c361
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor nitpick left on the rule title.
rules/S7191/python/metadata.json
Outdated
@@ -0,0 +1,26 @@ | |||
{ | |||
"title": "`withColumns` method should be preferred over `withColumn` when multiple columns are specified", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it in the first round, but I think that formatted text doesn't actually display in rule titles (as opposed with descriptions), so it's better to use double quotes rather than backticks here.
|
|
You can preview this rule here (updated a few minutes after each push).
Review
A dedicated reviewer checked the rule description successfully for: