-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create rule S7196: Complex logic provided to PySpark withColumn
method should be refactored into a separate expression
#4642
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{ | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"title": "Complex logic provided to PySpark withColumn method should be refactored into a separate expression", | ||
"type": "CODE_SMELL", | ||
"status": "ready", | ||
"remediation": { | ||
"func": "Constant\/Issue", | ||
"constantCost": "5min" | ||
}, | ||
"tags": [ | ||
"data-science", | ||
"pyspark" | ||
], | ||
"defaultSeverity": "Medium", | ||
"ruleSpecification": "RSPEC-7196", | ||
"sqKey": "S7196", | ||
"scope": "All", | ||
"defaultQualityProfiles": ["Sonar way"], | ||
"quickfix": "unknown", | ||
"code": { | ||
"impacts": { | ||
"MAINTAINABILITY": "LOW", | ||
"RELIABILITY": "MEDIUM", | ||
}, | ||
"attribute": "FOCUSED" | ||
} | ||
} |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,42 @@ | ||||||||
This rule raises an issue when complex functions or expressions are directly passed to withColumn | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here withColumn should be in backticks. |
||||||||
|
||||||||
== Why is this an issue? | ||||||||
|
||||||||
`withColumn` method is commonly used to add or modify columns in a DataFrame. When complex functions or expressions are directly passed to withColumn, it can lead to code that is difficult to read, understand, and maintain. Also, it will become easier to write unit tests for these functions, ensuring that the logic is correct and behaves as expected. This leads to more robust and reliable code. | ||||||||
|
||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I would leave out complex function. As it is ok to pass a single complex function to a withColumn.
Suggested change
|
||||||||
== How to fix it | ||||||||
|
||||||||
To fix this issue, complex logic within `withColumn` logic should be refactored into separate functions or variables before being passed to `withColumn` to improve code clarity and maintainability, | ||||||||
|
||||||||
=== Code examples | ||||||||
|
||||||||
==== Noncompliant code example | ||||||||
|
||||||||
[source,python,diff-id=1,diff-type=noncompliant] | ||||||||
---- | ||||||||
from pyspark.sql.functions import * | ||||||||
df = df.withColumn('Revenue', col('fare_amount').substr(0, 10).cast("float") + col('extra').substr(0, 5).cast("float") + col('tax').substr(0, 3).cast("float")) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||||||||
---- | ||||||||
|
||||||||
==== Compliant solution | ||||||||
|
||||||||
[source,python,diff-id=1,diff-type=compliant] | ||||||||
---- | ||||||||
from pyspark.sql.functions import * | ||||||||
def convert_to_float(col_str): | ||||||||
return col_str.substr(0, 10).cast("float") | ||||||||
|
||||||||
def compute_revenue(): | ||||||||
fare_amount = col('fare_amount').substr(0, 10).cast("float") | ||||||||
extra = col('extra').substr(0, 5).cast("float") | ||||||||
tax = col('tax').substr(0, 3).cast("float") | ||||||||
|
||||||||
return fare_amount + extra + tax | ||||||||
|
||||||||
df = df.withColumn("Revenue", compute_revenue()) # Compliant | ||||||||
---- | ||||||||
|
||||||||
== Resources | ||||||||
=== Documentation | ||||||||
|
||||||||
* PySpark withColumn Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html[pyspark.sql.DataFrame.withColumn] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think withColumn should be in quotes here (I think in the title we cannot use backticks, but I am not sure)