-
Notifications
You must be signed in to change notification settings - Fork 474
/
Copy pathpyspark-session-2021-04-12.txt
executable file
·67 lines (61 loc) · 2.29 KB
/
pyspark-session-2021-04-12.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
~/spark-3.1.1 $ ./bin/pyspark
Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/04/12 20:59:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Python version 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018 02:44:43)
Spark context Web UI available at http://10.0.0.93:4040
Spark context available as 'sc' (master = local[*], app id = local-1618286379380).
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x7fd4cf98d438>
>>>
>>>
>>> numbers = [1, 2, 3, 6, 7, 8, 99, 10, -10, -30]
>>> numbers
[1, 2, 3, 6, 7, 8, 99, 10, -10, -30]
>>># create an RDD[Integer] from a collection
>>># RDD = Resilient Distributed Dataset
>>> rdd = spark.sparkContext.parallelize(numbers)
>>> rdd.collect()
[1, 2, 3, 6, 7, 8, 99, 10, -10, -30]
>>> rdd.count()
10
>>># fund sum of all numbers in rdd as (RDD[Integer])
>>> total = rdd.reduce(lambda x, y: x+y)
>>> total
96
>>>#apply a filter: find all positive numbers
>>> positives = rdd.filter(lambda x : x > 0)
>>> positives.collect()
[1, 2, 3, 6, 7, 8, 99, 10]
>>>
>>># increment every element by 1000
>>> rdd2 = rdd.map(lambda x : x+1000)
>>> rdd2.collect()
[1001, 1002, 1003, 1006, 1007, 1008, 1099, 1010, 990, 970]
>>>
>>># create (key, value) pairs
>>> data = [("m1", 4), ("m1", 5), ("m2", 3), ("m2", 4), ("m2", 5), ("m3", 2), ("m3", 4)]
>>> data
[('m1', 4), ('m1', 5), ('m2', 3), ('m2', 4), ('m2', 5), ('m3', 2), ('m3', 4)]
>>>
>>> pairs = spark.sparkContext.parallelize(data)
>>> pairs.collect()
[('m1', 4), ('m1', 5), ('m2', 3), ('m2', 4), ('m2', 5), ('m3', 2), ('m3', 4)]
>>># keep elements if their associated value is Greater Than 3
>>># x[0] refers to key
>>># x[1] refers to value
>>> rating45 = pairs.filter(lambda x : x[1] > 3)
>>> rating45.collect()
[('m1', 4), ('m1', 5), ('m2', 4), ('m2', 5), ('m3', 4)]
>>>