- Course Outline
- Week 0 - GitHub
- Week 1 - Hands-on the Terminal
- Week 2 - Use Python as a daily tool
- Week 3 - Python for anything
- Week 4 - JSON and API
- Week 5 - Web Scraping Basics
- Week 6 - Advanced Web Scraping
- Week 7 - Table manipulation and 1-D analysis
- Week 8 - Visualisation, presentation and reproducible reports
- Week 9 - Text analysis
- Week 10 - Time series
- Week 11 - Graph theory and social network analysis
- Week 12 - 2D analysis
- Week 13 - High-dimensional analysis
- Week 14 - Clustering
- Week 15 - Classification
- Week 16 - Regression
- Week 17 - Recommender System
- Open topics
Objective
- Can use GitHub for resource hosting, project management and discussion forum.
- Can use GitHub Desktop to sync local repos with remote repos.
- Can use
gh-pages
to host static web pages as one's portfolio.
Objective:
- Able to navigate file system in Terminal, using shell
- Create the first python script and execute it
MAC:
Cmd+space
to open Spotlight; search “Terminal” to open terminal
Shell commands:
cd
to switch working folder- Path separated by
/
- Special paths:
.
,..
,-
,~
- Path separated by
ls
to list files/ folders in current folderpwd
to check current working folderls
/pwd
is your friend; type often to make sure where you aretouch
to create empty new file; mkdir to create new directorypython
to execute python scripts (usually in.py
but not necessary)- Format of shell commands:
<command-name> <arg1> <arg2>..
(space separated arguments)
Challenge:
- Write a Python script to output "Good evening" in the Terminal.
References:
Objective:
- Can use Python as a daily tool -- at least a powerful calculator
- Become comfortable with Python interpreter -- the REPL pattern (Read-Evaluate-Print Loop)
- Can use
help
to get inline documentation on new modules and functions
Python language introduction:
- Variables and assignment
- Basic data types:
int
,float
,str
,bool
- Arithmetic:
+
,-
,*
,/
,//
,%
,**
math
,numpy
(may needpip
)
- Use functions and modules:
import
(andimport ... from ...
).
notation to reference to member variable/ method()
notation to call function
- Common modules and functions
str.*
functions- String templating 1:
str.format
- String templating 2:
format_str % (var1, var2, ...)
- String templating 1:
random
numpy
,scipy
Challenge:
- Build a mortgage calculator - given principal
P
, interest rater
and load periodn
, calculated the amortised monthly paymentA
- Calculate the
area
of a circle given its radiusr
- Given the length of hypotenuse of a right triangle, calculate the length of its legs. You may want to get values like
$$\sin(\frac{\pi}{6})$$ vianumpy.pi
andnumpy.sin
- Generate 10 random numbers. (it is OK to run your script 10 times)
References:
- Chapter 1, 2, 3 of official Python 2 tutorial
- Python format string: https://pyformat.info/
Objective:
- Master the composite data type
[]
and{}
in Python- Master the control logics in Python, especially
if
andfor
- Further understand the different roles of text editor and interpreter. Be comfortable writing batch codes in
.py
file and execute in Shell environment.- [O] Understand Python engineering
Python language:
help
bool
and comparisionsstr
comparison andint
comparison
- Composite data types:
list
[]
,dict
{}
- Control flow:
for
,while
if
try..except
- Function, class, module:
def
class
*.py
;from
,import
Workflow:
- Python interpreter
- pip:
pip3
forpython3
--user
option in shared computer
Challenge:
- Distances among cities:
- Calculate the "straight line" distance on earth surface from several source cities to Hong Kong. The source cities: New York, Vancouver, Stockholm, Buenos Aires, Perth. For each source city, print one line containing the name of the city and distance. "Great-circle distance" is the academic name you use to search for the formula.
- Use
list
andfor
loop to handle multiple cities - Use function to increase the reusability
- Divide HW1 groups randomly: (case contribution)
- Get the list of student IDs from the lecturer
- Generate the grouping randomly
- Solve the "media business model" calculator.
References:
- Chapter 4, 5, 6 of official Python 3 tutorial
Objective:
- Learn to use Jupyter notebook. All demos following this week will be conducted in Jupyter notebook.
- Understand API/ JSON and can retrieve data from online databases (twitter, GitHub, weibo, douban, ...)
- Understand basic file format like
json
andcsv
.
- Be able to comfortably navigate through compound structures like
{}
and[]
.- Be able to comfortably use (multiple layer of) for-loop to re-format data.
- Be able to use serialisers to handle input/ output to files.
The brief of Application Programming Interface (API):
- Operate in client-and-server mode.
- Client does not have to download the full volume of data from server. Only use the data on demand.
- Server can handle intensive computations that not available by client.
- Server can send updated data upon request.
Modules:
- Handle HTTP request/ response:
requests
- Serialiser:
json
(.loads
,.dumps
) andcsv
Challenges:
- Taiwan had an earthquake in early Feb. Let's discuss this issue:
- Search for the earthquake instances around Taiwan in recent 100 years and analyse the occurrences of earthquakes. You can refer to the same database used here. Checkout the API description. The
count
andquery
API are useful. - Search on Twitter and collect user's discussions about this topic. See if there is any findings. You can approach from the human interface here (hard mode) or use python-twitter module (need to register developer and obtain API key).
- Search for the earthquake instances around Taiwan in recent 100 years and analyse the occurrences of earthquakes. You can refer to the same database used here. Checkout the API description. The
- Retrieve and analyse the recent movie. Douban's API will be helpful here.
- Use Google Map API to retrieve geo-locaitons and canonical names: e.g. Get the location of HKBU
- Lookup real estate properties on HK gov open data portal. e.g. the dataset page, the API result
- blockchain.info provides a set of API for one to retrieve information related with bitcoin transactions. Pick one wallet address, check its UTXO sets and sum up the values to get the total balance in this wallet.
- A free crypocurrency API for you to retrieve and study historical exchange rates.
- Implement a basic version of first automated writer - QuakeBot from LA Times
Exercise:
- Request cerntain API to acquire information
- Convert a JSON to CSV in Python
- Convert a CSV to JSON in Python
Further readings:
- Use
beautifulsoup
to scrape Twitter timeline content from Wayback machine. A good example of investigative journalism, by William Lyon from neo4j.
Objective:
- Understand the basics of HTML language, HTTP protocol, web server and Internet architecture
- Able scrape static web pages and turn them into CSV files
Tools: ( Step-by-step reference )
- Virtualenv -- Create isolated environment to avoid projects clutter each other
- Jupyter notebook -- Web-based REPL; ready to distribute; all-in-one presentation
Modules:
- Handle HTTP request/ response:
requests
- Parse web page:
lxml
, Beautiful Soup,HTMLPaser
,help(str)
- Useful string functions:
strip()
,split()
,find()
,replace()
,str[begin:end]
- Useful string functions:
- Serialiser:
csv
,json
Challenges: (save to *.csv
- Use
lxml
/bs4
requests
- Collect a table for the NSFC/RGC join research fund. A full table can be found here. You are also welcome to collect data of other funding schemes.
- Collect all the faculty's information and make a contact book. site.
- Collect the movie list and their rating from IMDB.
- Bonus:
- Collect the tweets from a celebrity like this post. You can search "python twitter" for many useful modules online.
References:
- Allison Parrish's tutorial of scraper in summer 2017.
Further reading:
- Study
urllib
as an alternative torequests
- Study Regular Expression and
re
library in Python - See how reproducibility is improved with Jupyter notebook and other tools (not only Python).
Objective:
- Bypass anti-crawler by modifying user-agent
- Handle glitches: encoding, pagination, ...
- Handle dynamic page with headless browser
- Handle login with headless browser
- Scrape social networks
- Case studies on different websites
- Further strengthen the list-of-dict data types; organise multi-layer loops/ item based parsing logics.
Cases:
- https://github.com/hupili/python-for-data-and-media-communication/tree/master/scraper-examples
- https://github.com/data-projects-archive
Objective:
- Master the schema of "data-driven story telling": the crowd (pattern) and the outlier (anomaly)
- Can efficiently manipulate structured table formatted datasets
- Use
pandas
for basic calculation and plotting
Modules:
pandas
seaborn
matplotlib
Statistics:
- mean, media, percentile
- min, max
- variance
- histogram
- sort
- central tendency and spread of data
- Scatter plot and correlation
Datasets to work on:
- openrice.csv contributed by group 1
References:
- First two chapters (i.e. before "3D") of the article The Art of Effective Visualization of Multi-dimensional Data by Dipanjan Sarkar
- Exercise numpy on ShiYanLou
- Exercise pandas on ShiYanLou
Additional notes:
- You need to finish Dataprep before analysis. That is, we start with structured data. Preparing the structured and cleaned data has no common schema. We have pointers in Dataprep for your own reading.
Objective
- Understand the theory and common tricks of visualisation.
- Can plot charts using various visualisation libraries.
- Can plot maps.
- Understand the concept of "reproducibility" and can use GitHub repo plus Jupyter notebook to create such reports.
Libraries:
py-plotly
pyecharts
Objective:
- Further strengthen the proficiency of pandas: DataFrame and Series
- Learn to plot and adjust charts with
matplotlib
- Master basic string operations
- Understand some major text mining models and be able to apply algorithm from 3rd party libraries.
Modules & topics:
str
- basic string processing.split()
,in
,.find()
%s
format string''.format()
function
collections.Counter
for word frequency calculationjieba
- the most widely used Chinese word segmentation package.- (optional)
re
- Regular Expression (regex) is the swiss knife for text pattern matching. - (optional)
nltk
- contains common routines for text analysis - (optional)
gensim
- topic mining package. It also contains theWord2Vec
routine. - (optional) Sentiment analysis - construct classifier using
sklearn
or use an API like text-processing.TextBlob
is also useful and applied in group 2's work.
Related cases:
- Quartz's analysis of New York Times's column of "Modern Love"
- Prof. Qian Gang's famous analysis of texts in political communication.
References:
- Construct Naive Bayes based classifier for sentiment analysis. Read here
Datasets to work on:
- NBC Russian Troll on Twitter dataset -- The 200,000 deleted Twitter messages posted by Russian's troll accounts. Around 50M, in CSV format.
- Hillary Clinton email archive from WikiLeaks There are the plain text and parsed data but you may need to run a scraper to get the data first.
- Understand the principle of timestamp and datetime format
- Master basic computation on datetime values
- Understand periodical analysis (daily, weekly, monthly, seasonal, etc)
- Can handle timezone conversion
Modules:
datetime
dtparser
pandas
- basic visualisation
.plot
- zoom in/ out:
.resample
,.aggregate
- basic visualisation
seaborn
References:
- timestamp usually come in unit of milliseconds (1/1000) of a second. [An example](https://github.com/dmep2017/dmep2017.github.io/blob/master/d3-map-sichuan-earthquate/Data Process.ipynb) to parse this timestamp format into
datetime
format.
Datasets:
- NBC Russian Troll on Twitter dataset (used last week)
- Twitter Data of the Donald & Ivanka Trump analysis -- reproduce the charts.
Objective:
- Understand the basics of graph theory
- Understand most common applications in social network analysis
- Can conduct graph analysis and visualisation in
networkx
Graph metrics and algorithms:
- Shortest path
- Graph profiling: diameter, degree distribution, clustering coefficient
- Centrality: degree, PageRank, betweenness, closeness, ...
- Community detection
Challenges:
- Generate the Zachary's Karate Club data: https://en.wikipedia.org/wiki/Zachary's_karate_club .
- SNAP dataset
- Cosponsorship Network Data
- Analyse the Les Miserables' graph data.
References:
- 數據新聞﹕政商網絡系列(下)(文:陳電鋸) -- articulation via centrality
- Clustering Game of Thrones -- application of community detection
- 大家都叫我老杨, 推特上有多少「新五毛」?. The analysis is done in R but the dataset and topic is interesting to look at.
- Some books for further reading: http://www.socilab.com/#books
Objective
- Understand correlation and can calculate correlation
- Can articulate correlation and causality
Following are advanced topics for your own reading. We do not discuss those topics due to lack of regular class hours.
Objective:
- Understand correlation and causality. Can conduct visual (explorative) analysis of correlation
- Can interpret common statistic quantities
- Dimensionality reduction
Challenge:
- Explore the HK Legco voting records
Modules:
sklearn
decomposition.PCA
seaborn
- (optional)
scipy.statsmodel
References:
- HK Legco 2012 - 2016 dataset from Initium Media, 2016
- HK Legco voting analysis with PCA, an early version, 2014.
Those topics may be discussed if there is plenty Q/A time left in certain week. Or, you are welcome to explore those topics via group project.
- Cloud (AWS)
- Deep learning