-
Notifications
You must be signed in to change notification settings - Fork 30
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Create rule S7192: The "how" parameter should be specified when joini…
…ng two PySpark DataFrames
- Loading branch information
1 parent
1168630
commit d6ef956
Showing
3 changed files
with
91 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{ | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
{ | ||
"title": "The \"how\" parameter should be specified when joining two PySpark DataFrames", | ||
"type": "CODE_SMELL", | ||
"status": "ready", | ||
"remediation": { | ||
"func": "Constant\/Issue", | ||
"constantCost": "5min" | ||
}, | ||
"tags": [ | ||
"data-science", | ||
"pyspark" | ||
], | ||
"defaultSeverity": "Major", | ||
"ruleSpecification": "RSPEC-7192", | ||
"sqKey": "S7192", | ||
"scope": "All", | ||
"defaultQualityProfiles": ["Sonar way"], | ||
"quickfix": "unknown", | ||
"code": { | ||
"impacts": { | ||
"RELIABILITY": "MEDIUM" | ||
}, | ||
"attribute": "CONVENTIONAL" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
This rule raises an issue when the parameter `how` is not provided when joining two PySpark `DataFrames`. | ||
|
||
== Why is this an issue? | ||
|
||
In PySpark, when joining two DataFrames, the `how` parameter specifies the type of join operation that will be performed. This parameter is crucial because it determines how the rows from the two DataFrames are combined based on the specified join condition. Common types of join operations include inner join, left join, right join, and outer join. | ||
|
||
If the `how` parameter is not provided, PySpark will default to an inner join. This can lead to unexpected results if the join condition is not met, as rows that do not satisfy the join condition will be excluded from the result. | ||
|
||
Specifying the `how` parameter explicitly is important because it defines the logic of how you want to combine the data from the two DataFrames. Depending on your data analysis needs, you might require different types of joins to get the desired results. For example, if you want to include all records from one DataFrame regardless of whether they have a match in the other, you would use a left or right outer join. If you only want records that have matches in both DataFrames, an inner join would be appropriate. | ||
|
||
== How to fix it | ||
|
||
To fix this issue, make sure to explicitly provide the `how` parameter when joining PySpark `DataFrames`. | ||
|
||
=== Code examples | ||
|
||
==== Noncompliant code example | ||
|
||
[source,python,diff-id=1,diff-type=noncompliant] | ||
---- | ||
# Joining DataFrames without specifying the 'how' parameter | ||
df1 = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ["id", "name"]) | ||
df2 = spark.createDataFrame([(1, 'HR'), (2, 'Finance')], ["id", "department"]) | ||
# Non-compliant: 'how' parameter is not specified | ||
joined_df = df1.join(df2, on="id") | ||
---- | ||
|
||
==== Compliant solution | ||
|
||
[source,python,diff-id=1,diff-type=compliant] | ||
---- | ||
# Joining DataFrames with the 'how' parameter specified | ||
df1 = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ["id", "name"]) | ||
df2 = spark.createDataFrame([(1, 'HR'), (2, 'Finance')], ["id", "department"]) | ||
# Compliant: 'how' parameter is specified | ||
joined_df = df1.join(df2, on="id", how="inner") | ||
---- | ||
|
||
== Resources | ||
=== Documentation | ||
* Spark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html[DataFrame.join] | ||
|
||
=== Related rules | ||
- S6735 - When using pandas.merge or pandas.join, the parameters on, how and validate should be provided | ||
|
||
ifdef::env-github,rspecator-view[] | ||
|
||
''' | ||
== Implementation Specification | ||
(visible only on this page) | ||
|
||
=== Message | ||
|
||
Specify the `how` parameter of this join. | ||
|
||
=== Quickfix | ||
|
||
Add the `how` parameter to the join operation, with the default value set to `"inner"`. | ||
|
||
include::../comments-and-links.adoc[] | ||
|
||
endif::env-github,rspecator-view[] |