Skip to content

Commit

Permalink
Create rule S7195: PySpark lit(None) should be used when populating e…
Browse files Browse the repository at this point in the history
…mpty columns.
  • Loading branch information
joke1196 committed Jan 31, 2025
1 parent dd9687a commit fbb298d
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 23 deletions.
9 changes: 5 additions & 4 deletions rules/S7195/python/metadata.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
{
"title": "FIXME",
"title": "PySpark lit(None) should be used when populating empty columns",
"type": "CODE_SMELL",
"status": "ready",
"remediation": {
"func": "Constant\/Issue",
"constantCost": "5min"
},
"tags": [
"data-science",
"pyspark"
],
"defaultSeverity": "Major",
"ruleSpecification": "RSPEC-7195",
Expand All @@ -16,9 +18,8 @@
"quickfix": "unknown",
"code": {
"impacts": {
"MAINTAINABILITY": "HIGH",
"RELIABILITY": "MEDIUM",
"SECURITY": "LOW"
"MAINTAINABILITY": "MEDIUM",
"RELIABILITY": "MEDIUM"
},
"attribute": "CONVENTIONAL"
}
Expand Down
60 changes: 41 additions & 19 deletions rules/S7195/python/rule.adoc
Original file line number Diff line number Diff line change
@@ -1,44 +1,66 @@
FIXME: add a description

// If you want to factorize the description uncomment the following line and create the file.
//include::../description.adoc[]
This rule raises an issue when a column of a PySpark DataFrame is populated with `lit('')`.

== Why is this an issue?

FIXME: remove the unused optional headers (that are commented out)
In PySpark, when populating a DataFrame columns with empty or null values, it is recommended to use `lit(None)`.
Using literals such as `lit('')` as a placeholder for absent values can lead to data misinterpretation and inconsistencies.

//=== What is the potential impact?
The usage of `lit(None)` ensures clarity and consistency in the codebase, making it explicit that the column is intentionally populated with null values.
Using `lit(None)` also preserves the ability to use functions such as `isnull` or `isnotnull` to check for null values in the DataFrame.

== How to fix it
//== How to fix it in FRAMEWORK NAME

To fix this issue, replace `lit('')` with `lit(None)` when populating a DataFrame columns with empty/null values.

=== Code examples

==== Noncompliant code example

[source,python,diff-id=1,diff-type=noncompliant]
----
FIXME
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [
(1, "Alice"),
(2, "Bob"),
(3, "Charlie")
]
df = spark.createDataFrame(data, ["id", "name"])
df_with_empty_column = df.withColumn("middle_name", lit('')) # Noncompliant: usage of lit('') to represent en empty value
----

==== Compliant solution

[source,python,diff-id=1,diff-type=compliant]
----
FIXME
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [
(1, "Alice"),
(2, "Bob"),
(3, "Charlie")
]
df = spark.createDataFrame(data, ["id", "name"])
df_with_empty_column = df.withColumn("middle_name", lit(None)) # Compliant
----

//=== How does this work?
== Resources
=== Documentation

//=== Pitfalls
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lit.html#pyspark-sql-functions-lit[pyspark-sql-functions-lit]
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.isnull.html#pyspark-sql-functions-isnull[pyspark-sql-functions-isnull]

//=== Going the extra mile
=== Standards

* Palantir PySpark Style Guide - https://github.com/palantir/pyspark-style-guide?tab=readme-ov-file#empty-columns[empty-columns]

//== Resources
//=== Documentation
//=== Articles & blog posts
//=== Conference presentations
//=== Standards
//=== External coding guidelines
//=== Benchmarks

0 comments on commit fbb298d

Please sign in to comment.