Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create rule S7196: Complex logic provided to PySpark withColumn method should be refactored into a separate expression #4642

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

You can preview this rule here (updated a few minutes after each push).

Review

A dedicated reviewer checked the rule description successfully for:

  • logical errors and incorrect information
  • information gaps and missing content
  • text style and tone
  • PR summary and labels follow the guidelines

@thomas-serre-sonarsource thomas-serre-sonarsource changed the title Create rule S7196 Create rule S7196: Complex logic provided to PySpark withColumn method should be refactored into a separate expression Jan 31, 2025
…d should be refactored into a separate expression
Copy link
Contributor

@joke1196 joke1196 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I have added a small comment on formatting and maybe a bit of clarification on what the remediation would be. And I think we could add when and filter as well, as you mentioned in the ticket.

Comment on lines 5 to 6
`withColumn` method is commonly used to add or modify columns in a DataFrame. When complex functions or expressions are directly passed to withColumn, it can lead to code that is difficult to read, understand, and maintain. Also, it will become easier to write unit tests for these functions, ensuring that the logic is correct and behaves as expected. This leads to more robust and reliable code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would leave out complex function. As it is ok to pass a single complex function to a withColumn.
I would also add that refactoring theses expressions is a good thing.
This is just a suggestion feel free to rework it.

Suggested change
`withColumn` method is commonly used to add or modify columns in a DataFrame. When complex functions or expressions are directly passed to withColumn, it can lead to code that is difficult to read, understand, and maintain. Also, it will become easier to write unit tests for these functions, ensuring that the logic is correct and behaves as expected. This leads to more robust and reliable code.
`withColumn` method is commonly used to add or modify columns in a DataFrame. When long or complex expressions are directly passed to `withColumn`, it can lead to code that is difficult to read, understand, and maintain. Refactoring such expressions into functions or variables will help with readability. Also, it will become easier to write unit tests for these functions, ensuring that the logic is correct and behaves as expected. This leads to more robust and reliable code.

@@ -0,0 +1,42 @@
This rule raises an issue when complex functions or expressions are directly passed to withColumn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here withColumn should be in backticks.

@@ -0,0 +1,26 @@
{
"title": "Complex logic provided to PySpark withColumn method should be refactored into a separate expression",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think withColumn should be in quotes here (I think in the title we cannot use backticks, but I am not sure)

[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql.functions import *
df = df.withColumn('Revenue', col('fare_amount').substr(0, 10).cast("float") + col('extra').substr(0, 5).cast("float") + col('tax').substr(0, 3).cast("float"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The # NonCompliant comment is missing here. I don;t think this is mandatory, it is just a convention that we have used so far.

Copy link
Contributor

@joke1196 joke1196 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super good. The example really highlight the strength of this rule! There is only one thing I would add, which I forgot Yesterday, is that we should also detect where which is an alias for filter but it is a small detail. We should at least put it in the implementation ticket.

Copy link

sonarqube-next bot commented Feb 4, 2025

Quality Gate passed Quality Gate passed for 'rspec-tools'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Copy link

sonarqube-next bot commented Feb 4, 2025

Quality Gate passed Quality Gate passed for 'rspec-frontend'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants