Using Python, given an input TXT file Artist_lists_small.txt
,
- find pairs that appear at least 50x most frequent
- sort it by frequency
- then output it to
outputFile.csv
The application can be modified:
- minimum appearance can be changed here:
https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L8
- input file name can be changed here:
https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L6
- output file name can be changed here:
https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L7
- Follow the setup instruction to install
python
and cloning the repo - cd into the root of the program directory
cd High-Occurance-Pairs
- run
python find_high_occurance_pairs.py
- Follow the setup instruction to install python and cloning the repo
- cd into the root of the program directory
cd High-Occurance-Pairs
- run
python test/classes/permutation_test.py
Ran 10 tests in 0.001s
OK
- Let install
homebrew
, a package manager that will help us install python
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
- Install Python through brew
brew install python
- Clone the Repo
git clone https://github.com/Mr-Ming/High-Occurance-Pairs.git
Originally I was going to write this script using PHP that does the following
- Read the Input File, then iterate through everything, then generates the pairs through a double for-loop
- Save each pair as an
index
of an array and thevalue
of the array would be its # of occurance - Finally Save the result to the output file while occurance >= 50
But then using that method has a big O complexity of O(n^3) due to 3 nested-loops (1st-reading from input, 2nd & 3rd is generating the pair)
So I though this would be a better time to explore python and read more about it, because python is very good for performing complex computation as its widely use in Machine Learning
My research led me to 2 things
- itertool.combination (https://docs.python.org/2/library/itertools.html) -- Which is a very powerful python package that lets you quickly generate permutation of result in various length you like
- collections.defaultdict(int) (https://docs.python.org/2/library/collections.html#collections.defaultdict) -- Which is the best data structure to store exactly this specific type of data because I wanted a data structure type that allows me to --- (1) Store both the pair and # of occurance as 1 array --- (2) Have O(1) constant time for searching
In this scenario the big O complexity is O(n^2) due to 2 nested-loops because now it only cost 1 loop to generate the pair instead of 2 from the php case.
- input