-
Notifications
You must be signed in to change notification settings - Fork 474
/
Copy pathpyspark-session-2021-04-29-min-max-avg.txt
executable file
·57 lines (50 loc) · 1.87 KB
/
pyspark-session-2021-04-29-min-max-avg.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Given billions of numbers, find (minimum, maximum, average)
for all numbers.
I provide 2 solutions: one using tuple of 4: (minimum, maximum, sum, count)
another solution using tuple of 3: (minimum, maximum, sum)
$ ./bin/pyspark
Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Python version 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018 02:44:43)
Spark context Web UI available at http://10.0.0.93:4040
Spark context available as 'sc' (master = local[*], app id = local-1619727491830).
SparkSession available as 'spark'.
>>>
>>>
>>> nums = [1, 2, 3, -1, -2, -3, 4, 5, 6, 7, 8]
>>>
>>># Let rd denote billions of numbers
>>> rdd = spark.sparkContext.parallelize(nums)
>>> rdd.collect()
[1, 2, 3, -1, -2, -3, 4, 5, 6, 7, 8]
>>>
>>># Create tuple of 4 elements as: (minimum, maximum, sum, count)
>>> tuple4 = rdd.map(lambda n: (n, n, n, 1))
>>> tuple4.collect()
[(1, 1, 1, 1), (2, 2, 2, 1), (3, 3, 3, 1), (-1, -1, -1, 1), (-2, -2, -2, 1), (-3, -3, -3, 1), (4, 4, 4, 1), (5, 5, 5, 1), (6, 6, 6, 1), (7, 7, 7, 1), (8, 8, 8, 1)]
>>># Perform a reduction on tuple4
>>> min_max_sum_count = tuple4.reduce(lambda x, y: (min(x[0], y[0]), max(x[1],y[1]), x[2]+y[2], x[3]+y[3]) )
>>>
>>># Now, min_max_sum_count represents (minimum, maximum, sum, count)
>>> min_max_sum_count
(-3, 8, 30, 11)
>>> final = (min_max_sum_count[0], min_max_sum_count[1], min_max_sum_count[2] / min_max_sum_count[3])
>>> final
(-3, 8, 2.727272727272727)
>>>
>>># Solution using tuple of 3
>>> tuple3 = rdd.map(lambda n: (n, n, n))
>>> min_max_sum = tuple3.reduce(lambda x, y: (min(x[0], y[0]), max(x[1],y[1]), x[2]+y[2]) )
>>> min_max_sum
(-3, 8, 30)
>>> N = rdd.count()
>>> N
11
>>> final = (min_max_sum[0], min_max_sum[1], min_max_sum[2] / N)
>>> final
(-3, 8, 2.727272727272727)