Create rule S7195: PySpark lit(None) should be used when populating e…

…mpty columns.
SonarSource · Jan 31, 2025 · fbb298d · fbb298d
1 parent dd9687a
commit fbb298d
Show file tree

Hide file tree

Showing 2 changed files with 46 additions and 23 deletions.
diff --git a/rules/S7195/python/metadata.json b/rules/S7195/python/metadata.json
@@ -1,12 +1,14 @@
 {
-  "title": "FIXME",
+  "title": "PySpark lit(None) should be used when populating empty columns",
   "type": "CODE_SMELL",
   "status": "ready",
   "remediation": {
     "func": "Constant\/Issue",
     "constantCost": "5min"
   },
   "tags": [
+    "data-science",
+    "pyspark"
   ],
   "defaultSeverity": "Major",
   "ruleSpecification": "RSPEC-7195",
@@ -16,9 +18,8 @@
   "quickfix": "unknown",
   "code": {
     "impacts": {
-      "MAINTAINABILITY": "HIGH",
-      "RELIABILITY": "MEDIUM",
-      "SECURITY": "LOW"
+      "MAINTAINABILITY": "MEDIUM",
+      "RELIABILITY": "MEDIUM"
     },
     "attribute": "CONVENTIONAL"
   }

diff --git a/rules/S7195/python/rule.adoc b/rules/S7195/python/rule.adoc
@@ -1,44 +1,66 @@
-FIXME: add a description
-
-// If you want to factorize the description uncomment the following line and create the file.
-//include::../description.adoc[]
+This rule raises an issue when a column of a PySpark DataFrame is populated with `lit('')`.
 
 == Why is this an issue?
 
-FIXME: remove the unused optional headers (that are commented out)
+In PySpark, when populating a DataFrame columns with empty or null values, it is recommended to use `lit(None)`. 
+Using literals such as `lit('')` as a placeholder for absent values can lead to data misinterpretation and inconsistencies.
 
-//=== What is the potential impact?
+The usage of `lit(None)` ensures clarity and consistency in the codebase, making it explicit that the column is intentionally populated with null values.
+Using `lit(None)` also preserves the ability to use functions such as `isnull` or `isnotnull` to check for null values in the DataFrame.
 
 == How to fix it
-//== How to fix it in FRAMEWORK NAME
+
+To fix this issue, replace `lit('')` with `lit(None)` when populating a DataFrame columns with empty/null values.
 
 === Code examples
 
 ==== Noncompliant code example
 
 [source,python,diff-id=1,diff-type=noncompliant]
 ----
-FIXME
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import lit
+
+spark = SparkSession.builder.appName("Example").getOrCreate()
+
+data = [
+    (1, "Alice"),
+    (2, "Bob"),
+    (3, "Charlie")
+]
+
+df = spark.createDataFrame(data, ["id", "name"])
+
+df_with_empty_column = df.withColumn("middle_name", lit('')) # Noncompliant: usage of lit('') to represent en empty value
 ----
 
 ==== Compliant solution
 
 [source,python,diff-id=1,diff-type=compliant]
 ----
-FIXME
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import lit
+
+spark = SparkSession.builder.appName("Example").getOrCreate()
+
+data = [
+    (1, "Alice"),
+    (2, "Bob"),
+    (3, "Charlie")
+]
+
+df = spark.createDataFrame(data, ["id", "name"])
+
+df_with_empty_column = df.withColumn("middle_name", lit(None)) # Compliant
 ----
 
-//=== How does this work?
+== Resources
+=== Documentation
 
-//=== Pitfalls
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lit.html#pyspark-sql-functions-lit[pyspark-sql-functions-lit]
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.isnull.html#pyspark-sql-functions-isnull[pyspark-sql-functions-isnull]
 
-//=== Going the extra mile
+=== Standards
 
+* Palantir PySpark Style Guide - https://github.com/palantir/pyspark-style-guide?tab=readme-ov-file#empty-columns[empty-columns]
 
-//== Resources
-//=== Documentation
-//=== Articles & blog posts
-//=== Conference presentations
-//=== Standards
-//=== External coding guidelines
-//=== Benchmarks