-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpeaktraffic.rb
110 lines (100 loc) · 4.84 KB
/
peaktraffic.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
=begin
PEAK TRAFFIC
CHALLENGE DESCRIPTION:
Credits: This challenge is from the facebook engineering puzzles
Facebook is looking for ways to help users find out which friends they
interact with the most on the site. Towards that end, you have collected data
from your friends regarding who they interacted with on the site. Each piece
of data represents a desirable but one-way interaction between one user of
Facebook towards another user of Facebook. By finding groups of users who
regularly interact with one another, you hope to help users determine who
among their friends they spend the most time with online.
Being a popular user, you have collected a lot of data; so much that you
cannot possibly process it by hand. But being a programmer of no small repute,
you believe you can write a program to do the analysis for you. You are
interested in finding clusters of users within your data pool; in other words,
groups of users who interact among one another. A cluster is defined as a set
of at least three users, where every possible permutation of two users within
the cluster have both received and sent some kind of interaction between the
two.
With your program, you wish to analyze the collected data and find out all
clusters within.
INPUT SAMPLE:
Your program should accept as its first argument a path to a filename. The
input file consists of multiple lines of aggregated log data. Each line starts
with a date entry, whose constituent parts are separated by single white
spaces. The exact format of the date always follows the examples given below.
Following the date is a single tab, and then the email address of the user who
is performing the action. Following that email is another single tab and then
finally the email of the Facebook user who receives the action. The last line
of the file may or may not have a newline at its end.
Thu Dec 11 17:53:01 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:02 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:03 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:04 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:05 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:06 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:07 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:08 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:09 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:10 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:11 PST 2008 [email protected] [email protected]
Thu Dec 11 17:53:12 PST 2008 [email protected] [email protected]
Every line in the input file will follow this format, you are guaranteed that
your submission will run against well formed input files.
OUTPUT SAMPLE:
You must output all clusters detected from the input log file with size of at
least 3 members. A cluster is defined as N >= 3 users on Facebook that have
send and received actions between all possible permutations of any two members
within the cluster.
Your program should print to standard out, exactly one cluster per line. Each
cluster must have its member user emails in alphabetical order, separated by a
comma and a single space character each. There must not be a comma (or white
space) after the final email in the cluster; instead print a single new line
character at the end of every line. The clusters themselves must be printed to
standard out also in alphabetical order; treat each cluster as a whole string
for purposes of alphabetical comparisons. Do not sort the clusters by size or
any other criteria.
Finally, any cluster that is a sub-cluster (in other words, all users within
one cluster are also present in another) must be removed from the output. For
this case, your program should only print the largest super-cluster that
includes the other clusters. Your program must be fast, efficient, and able to
handle extremely large input files.
=end
lines = File.readlines(ARGV[0])
people = []
relations = Hash.new(0)
lines.each do |line|
words = line.chomp.split(" ")
this_interaction = []
words.each do |word|
if word.chars.include?("@")
this_interaction << word
people << word
end
end
relations[this_interaction] = 1
end
clusters = []
people.each do |person|
clusters << [person]
end
people.each do |person|
clusters.each do |cluster|
if !cluster.include?(person)
if cluster.all? { |friend| (relations[[person, friend]] == 1) && (relations[[friend, person]] == 1) }
cluster << person
end
end
end
end
output = []
clusters.each(&:sort!)
clusters.uniq.each do |cluster|
output << cluster.sort.join(", ") if cluster.size > 2
end
output.sort.each do |string|
puts string
end