tutorial/pyspark-examples/rdds/pyspark-session-2021-04-12.txt

~/spark-3.1.1 $ ./bin/pyspark
Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/04/12 20:59:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018 02:44:43)
Spark context Web UI available at http://10.0.0.93:4040
Spark context available as 'sc' (master = local[*], app id = local-1618286379380).
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x7fd4cf98d438>
>>>
>>>
>>> numbers = [1, 2, 3, 6, 7, 8, 99, 10, -10, -30]
>>> numbers
[1, 2, 3, 6, 7, 8, 99, 10, -10, -30]

>>># create an RDD[Integer] from a collection
>>># RDD = Resilient Distributed Dataset
>>> rdd = spark.sparkContext.parallelize(numbers)
>>> rdd.collect()
[1, 2, 3, 6, 7, 8, 99, 10, -10, -30]
>>> rdd.count()
10

>>># fund sum of all numbers in rdd as (RDD[Integer])
>>> total = rdd.reduce(lambda x, y: x+y)
>>> total
96

>>>#apply a filter: find all positive numbers
>>> positives = rdd.filter(lambda x : x > 0)
>>> positives.collect()
[1, 2, 3, 6, 7, 8, 99, 10]
>>>
>>># increment every element by 1000
>>> rdd2 = rdd.map(lambda x : x+1000)
>>> rdd2.collect()
[1001, 1002, 1003, 1006, 1007, 1008, 1099, 1010, 990, 970]
>>>
>>># create (key, value) pairs
>>> data = [("m1", 4), ("m1", 5), ("m2", 3), ("m2", 4), ("m2", 5), ("m3", 2), ("m3", 4)]
>>> data
[('m1', 4), ('m1', 5), ('m2', 3), ('m2', 4), ('m2', 5), ('m3', 2), ('m3', 4)]

>>>
>>> pairs = spark.sparkContext.parallelize(data)
>>> pairs.collect()
[('m1', 4), ('m1', 5), ('m2', 3), ('m2', 4), ('m2', 5), ('m3', 2), ('m3', 4)]

>>># keep elements if their associated value is Greater Than 3
>>># x[0] refers to key
>>># x[1] refers to value
>>> rating45 = pairs.filter(lambda x : x[1] > 3)
>>> rating45.collect()
[('m1', 4), ('m1', 5), ('m2', 4), ('m2', 5), ('m3', 4)]
>>>