Skip to content

Commit

Permalink
added pyspark UDF example
Browse files Browse the repository at this point in the history
  • Loading branch information
mahmoudparsian committed Jan 1, 2022
1 parent 6a2d4ab commit c04f3b5
Show file tree
Hide file tree
Showing 32 changed files with 57 additions and 0 deletions.
Empty file modified tutorial/.DS_Store
100644 → 100755
Empty file.
Empty file modified tutorial/add-indices/add-indices.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-average/basic-average.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-filter/basic-filter.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-join/basicjoin.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-map/basic-map.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-multiply/basic-multiply.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-sort/sort-by-key.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-sum/basic-sum.txt
100644 → 100755
Empty file.
Empty file modified tutorial/basic-union/basic-union.txt
100644 → 100755
Empty file.
Empty file modified tutorial/bigrams/bigrams.txt
100644 → 100755
Empty file.
Empty file modified tutorial/cartesian/cartesian.txt
100644 → 100755
Empty file.
Empty file modified tutorial/combine-by-key/README.md
100644 → 100755
Empty file.
Empty file modified tutorial/combine-by-key/combine-by-key.txt
100644 → 100755
Empty file.
Empty file.
Empty file modified tutorial/combine-by-key/spark-combineByKey.md
100644 → 100755
Empty file.
Empty file modified tutorial/combine-by-key/spark-combineByKey.txt
100644 → 100755
Empty file.
Empty file modified tutorial/combine-by-key/standard_deviation_by_combineByKey.md
100644 → 100755
Empty file.
Empty file modified tutorial/dna-basecount/README.md
100644 → 100755
Empty file.
Empty file modified tutorial/dna-basecount/dna-basecount.md
100644 → 100755
Empty file.
Empty file modified tutorial/dna-basecount/dna-basecount2.md
100644 → 100755
Empty file.
Empty file modified tutorial/dna-basecount/dna-basecount3.md
100644 → 100755
Empty file.
Empty file modified tutorial/dna-basecount/dna_seq.txt
100644 → 100755
Empty file.
Empty file modified tutorial/map-partitions/README.md
100644 → 100755
Empty file.
57 changes: 57 additions & 0 deletions tutorial/pyspark-udf/pyspark_udf_maptype.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
$SPARK_HOME/bin/pyspark
Python 3.8.9 (default, Nov 9 2021, 04:26:29)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
/_/

Using Python version 3.8.9 (default, Nov 9 2021 04:26:29)
Spark context Web UI available at http://10.0.0.232:4040
Spark context available as 'sc' (master = local[*], app id = local-1641011178190).
SparkSession available as 'spark'.

>>> from pyspark.sql import Row

>>> data = spark.createDataFrame(
... [Row(zip_code='94087', city='Sunnyvale'),
... Row(zip_code='94088', city='Cupertino'),
... Row(zip_code='95055', city='Santa Clara'),
... Row(zip_code='95054', city='Palo Alto')])

>>>
>>> data.show()
+--------+-----------+
|zip_code| city|
+--------+-----------+
| 94087| Sunnyvale|
| 94088| Cupertino|
| 95055|Santa Clara|
| 95054| Palo Alto|
+--------+-----------+

>>> from pyspark.sql.functions import udf
>>> from pyspark.sql import types as T
>>>
>>> @udf(T.MapType(T.StringType(), T.StringType()))
... def create_structure(zip_code, city):
... return {zip_code: city}
...
>>> data.withColumn('structure', create_structure(data.zip_code, data.city)).toJSON().collect()
[
'{"zip_code":"94087","city":"Sunnyvale","structure":{"94087":"Sunnyvale"}}',
'{"zip_code":"94088","city":"Cupertino","structure":{"94088":"Cupertino"}}',
'{"zip_code":"95055","city":"Santa Clara","structure":{"95055":"Santa Clara"}}',
'{"zip_code":"95054","city":"Palo Alto","structure":{"95054":"Palo Alto"}}'
]

>>> data.withColumn('structure', create_structure(data.zip_code, data.city)).show(truncate=False)
+--------+-----------+----------------------+
|zip_code|city |structure |
+--------+-----------+----------------------+
|94087 |Sunnyvale |{94087 -> Sunnyvale} |
|94088 |Cupertino |{94088 -> Cupertino} |
|95055 |Santa Clara|{95055 -> Santa Clara}|
|95054 |Palo Alto |{95054 -> Palo Alto} |
+--------+-----------+----------------------+
Empty file modified tutorial/split-function/README.md
100644 → 100755
Empty file.
Empty file modified tutorial/top-N/top-N.txt
100644 → 100755
Empty file.
Empty file modified tutorial/wordcount/README.md
100644 → 100755
Empty file.
Empty file modified tutorial/wordcount/word_count.py
100644 → 100755
Empty file.
Empty file modified tutorial/wordcount/word_count_ver2.py
100644 → 100755
Empty file.
Empty file modified tutorial/wordcount/wordcount-shorthand.txt
100644 → 100755
Empty file.
Empty file modified tutorial/wordcount/wordcount.txt
100644 → 100755
Empty file.

0 comments on commit c04f3b5

Please sign in to comment.