Course Outline

Course Outline

Week 0 - GitHub

Objective

Can use GitHub for resource hosting, project management and discussion forum.

Can use GitHub Desktop to sync local repos with remote repos.

Can use gh-pages to host static web pages as one's portfolio.

Week 1 - Hands-on the Terminal

Objective:

Able to navigate file system in Terminal, using shell

Create the first python script and execute it

MAC:

Cmd+space to open Spotlight; search “Terminal” to open terminal

Shell commands:

cd to switch working folder
- Path separated by /
- Special paths: .,..,-,~
lsto list files/ folders in current folder
pwd to check current working folder
ls/pwd is your friend; type often to make sure where you are
touch to create empty new file; mkdir to create new directory
python to execute python scripts (usually in .py but not necessary)
Format of shell commands:
- <command-name> <arg1> <arg2>.. (space separated arguments)

Challenge:

Write a Python script to output "Good evening" in the Terminal.

References:

Terminal and shell commands (Chinese)
Appendix A of "Learn Python the hard way"

Week 2 - Use Python as a daily tool

Objective:

Can use Python as a daily tool -- at least a powerful calculator

Become comfortable with Python interpreter -- the REPL pattern (Read-Evaluate-Print Loop)

Can use help to get inline documentation on new modules and functions

Python language introduction:

Variables and assignment
Basic data types: int, float, str, bool
Arithmetic:
- +, -, *, /, //, %, **
- math, numpy (may need pip)
Use functions and modules:
- import (and import ... from ...)
- . notation to reference to member variable/ method
- () notation to call function
Common modules and functions
- str.* functions
  - String templating 1: str.format
  - String templating 2: format_str % (var1, var2, ...)
- random
- numpy, scipy

Challenge:

Build a mortgage calculator - given principal P, interest rate r and load period n, calculated the amortised monthly payment A
Calculate the area of a circle given its radius r
Given the length of hypotenuse of a right triangle, calculate the length of its legs. You may want to get values like $$\sin(\frac{\pi}{6})$$ via numpy.pi and numpy.sin
Generate 10 random numbers. (it is OK to run your script 10 times)

References:

Chapter 1, 2, 3 of official Python 2 tutorial
Python format string: https://pyformat.info/

Week 3 - Python for anything

Objective:

Master the composite data type [] and {} in Python

Master the control logics in Python, especially if and for

Further understand the different roles of text editor and interpreter. Be comfortable writing batch codes in .py file and execute in Shell environment.

[O] Understand Python engineering

Python language:

help
bool and comparisions
- str comparison and int comparison
Composite data types: list [], dict {}
Control flow:
- for, while
- if
- try..except
Function, class, module:
- def
- class
- *.py; from, import

Workflow:

Python interpreter
pip: pip3 for python3
- --user option in shared computer

Challenge:

Distances among cities:
1. Calculate the "straight line" distance on earth surface from several source cities to Hong Kong. The source cities: New York, Vancouver, Stockholm, Buenos Aires, Perth. For each source city, print one line containing the name of the city and distance. "Great-circle distance" is the academic name you use to search for the formula.
2. Use list and for loop to handle multiple cities
3. Use function to increase the reusability
Divide HW1 groups randomly: (case contribution)
1. Get the list of student IDs from the lecturer
2. Generate the grouping randomly
Solve the "media business model" calculator.

References:

Chapter 4, 5, 6 of official Python 3 tutorial

Week 4 - JSON and API

Objective:

Learn to use Jupyter notebook. All demos following this week will be conducted in Jupyter notebook.

Understand API/ JSON and can retrieve data from online databases (twitter, GitHub, weibo, douban, ...)

Understand basic file format like json and csv.

Be able to comfortably navigate through compound structures like {} and [].

Be able to comfortably use (multiple layer of) for-loop to re-format data.

Be able to use serialisers to handle input/ output to files.

The brief of Application Programming Interface (API):

Operate in client-and-server mode.
Client does not have to download the full volume of data from server. Only use the data on demand.
Server can handle intensive computations that not available by client.
Server can send updated data upon request.

Modules:

Handle HTTP request/ response: requests
Serialiser: json (.loads, .dumps) and csv

Challenges:

Taiwan had an earthquake in early Feb. Let's discuss this issue:
- Search for the earthquake instances around Taiwan in recent 100 years and analyse the occurrences of earthquakes. You can refer to the same database used here. Checkout the API description. The count and query API are useful.
- Search on Twitter and collect user's discussions about this topic. See if there is any findings. You can approach from the human interface here (hard mode) or use python-twitter module (need to register developer and obtain API key).
Retrieve and analyse the recent movie. Douban's API will be helpful here.
- API sample for Recent movies
- API sample for movie details
Use Google Map API to retrieve geo-locaitons and canonical names: e.g. Get the location of HKBU
Lookup real estate properties on HK gov open data portal. e.g. the dataset page, the API result
blockchain.info provides a set of API for one to retrieve information related with bitcoin transactions. Pick one wallet address, check its UTXO sets and sum up the values to get the total balance in this wallet.
A free crypocurrency API for you to retrieve and study historical exchange rates.
Implement a basic version of first automated writer - QuakeBot from LA Times
- Get data from USGS API
- Print a story to the screen using string templating/ string interpolation
- See here for an introduction of the bot. See here for an incident and think how to avoid it?

Exercise:

Request cerntain API to acquire information
Convert a JSON to CSV in Python
Convert a CSV to JSON in Python

Week 5 - Web Scraping Basics

Objective:

Understand the basics of HTML language, HTTP protocol, web server and Internet architecture

Able scrape static web pages and turn them into CSV files

Tools: ( Step-by-step reference )

Virtualenv -- Create isolated environment to avoid projects clutter each other
Jupyter notebook -- Web-based REPL; ready to distribute; all-in-one presentation

Modules:

Handle HTTP request/ response: requests
Parse web page: lxml, Beautiful Soup, HTMLPaser, help(str)
- Useful string functions: strip(), split(), find(), replace(), str[begin:end]
Serialiser: csv, json

Challenges: (save to *.csv

Use lxml / bs4 requests
- Collect a table for the NSFC/RGC join research fund. A full table can be found here. You are also welcome to collect data of other funding schemes.
- Collect all the faculty's information and make a contact book. site.
- Collect the movie list and their rating from IMDB.
Bonus:
- Collect the tweets from a celebrity like this post. You can search "python twitter" for many useful modules online.

References:

Allison Parrish's tutorial of scraper in summer 2017.

Week 6 - Advanced Web Scraping

Objective:

Bypass anti-crawler by modifying user-agent

Handle glitches: encoding, pagination, ...

Handle dynamic page with headless browser

Handle login with headless browser

Scrape social networks

Case studies on different websites

Further strengthen the list-of-dict data types; organise multi-layer loops/ item based parsing logics.

Cases:

https://github.com/hupili/python-for-data-and-media-communication/tree/master/scraper-examples
https://github.com/data-projects-archive

Week 7 - Table manipulation and 1-D analysis

Objective:

Master the schema of "data-driven story telling": the crowd (pattern) and the outlier (anomaly)

Can efficiently manipulate structured table formatted datasets

Use pandas for basic calculation and plotting

Modules:

pandas
seaborn
matplotlib

Statistics:

mean, media, percentile
min, max
variance
histogram
sort
central tendency and spread of data
Scatter plot and correlation

Datasets to work on:

openrice.csv contributed by group 1

References:

First two chapters (i.e. before "3D") of the article The Art of Effective Visualization of Multi-dimensional Data by Dipanjan Sarkar
Exercise numpy on ShiYanLou
Exercise pandas on ShiYanLou

Additional notes:

You need to finish Dataprep before analysis. That is, we start with structured data. Preparing the structured and cleaned data has no common schema. We have pointers in Dataprep for your own reading.

Week 8 - Visualisation, presentation and reproducible reports

Objective

Understand the theory and common tricks of visualisation.

Can plot charts using various visualisation libraries.

Can plot maps.

Understand the concept of "reproducibility" and can use GitHub repo plus Jupyter notebook to create such reports.

Libraries:

py-plotly
pyecharts

Week 9 - Text analysis

Objective:

Further strengthen the proficiency of pandas: DataFrame and Series

Learn to plot and adjust charts with matplotlib

Master basic string operations

Understand some major text mining models and be able to apply algorithm from 3rd party libraries.

Modules & topics:

str - basic string processing
- .split(), in, .find()
- %s format string
- ''.format() function
collections.Counter for word frequency calculation
jieba - the most widely used Chinese word segmentation package.
(optional) re- Regular Expression (regex) is the swiss knife for text pattern matching.
(optional) nltk - contains common routines for text analysis
(optional) gensim - topic mining package. It also contains the Word2Vec routine.
(optional) Sentiment analysis - construct classifier using sklearn or use an API like text-processing. TextBlob is also useful and applied in group 2's work.

Related cases:

Quartz's analysis of New York Times's column of "Modern Love"
Prof. Qian Gang's famous analysis of texts in political communication.

References:

Construct Naive Bayes based classifier for sentiment analysis. Read here

Datasets to work on:

NBC Russian Troll on Twitter dataset -- The 200,000 deleted Twitter messages posted by Russian's troll accounts. Around 50M, in CSV format.
Hillary Clinton email archive from WikiLeaks There are the plain text and parsed data but you may need to run a scraper to get the data first.

Week 10 - Time series

Understand the principle of timestamp and datetime format

Master basic computation on datetime values

Understand periodical analysis (daily, weekly, monthly, seasonal, etc)

Can handle timezone conversion

Modules:

datetime
dtparser
pandas
- basic visualisation .plot
- zoom in/ out: .resample, .aggregate
seaborn

References:

timestamp usually come in unit of milliseconds (1/1000) of a second. [An example](https://github.com/dmep2017/dmep2017.github.io/blob/master/d3-map-sichuan-earthquate/Data Process.ipynb) to parse this timestamp format into datetime format.

Datasets:

NBC Russian Troll on Twitter dataset (used last week)
Twitter Data of the Donald & Ivanka Trump analysis -- reproduce the charts.

Week 11 - Graph theory and social network analysis

Objective:

Understand the basics of graph theory

Understand most common applications in social network analysis

Can conduct graph analysis and visualisation in networkx

Graph metrics and algorithms:

Shortest path
Graph profiling: diameter, degree distribution, clustering coefficient
Centrality: degree, PageRank, betweenness, closeness, ...
Community detection

Challenges:

Generate the Zachary's Karate Club data: https://en.wikipedia.org/wiki/Zachary's_karate_club .
SNAP dataset
Cosponsorship Network Data
Analyse the Les Miserables' graph data.

References:

數據新聞﹕政商網絡系列（下）（文：陳電鋸） -- articulation via centrality
Clustering Game of Thrones -- application of community detection
大家都叫我老杨, 推特上有多少「新五毛」？. The analysis is done in R but the dataset and topic is interesting to look at.
Some books for further reading: http://www.socilab.com/#books

Week 12 - 2D analysis

Objective

Understand correlation and can calculate correlation

Can articulate correlation and causality

Following are advanced topics for your own reading. We do not discuss those topics due to lack of regular class hours.

Week 13 - High-dimensional analysis

Objective:

Understand correlation and causality. Can conduct visual (explorative) analysis of correlation

Can interpret common statistic quantities

Dimensionality reduction

Challenge:

Explore the HK Legco voting records

Modules:

sklearn
- decomposition.PCA
seaborn
(optional) scipy.statsmodel

References:

HK Legco 2012 - 2016 dataset from Initium Media, 2016
HK Legco voting analysis with PCA, an early version, 2014.

Week 14 - Clustering

Week 15 - Classification

Week 16 - Regression

Week 17 - Recommender System

Open topics

Those topics may be discussed if there is plenty Q/A time left in certain week. Or, you are welcome to explore those topics via group project.

Cloud (AWS)
Deep learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outline.md

outline.md

Course Outline

Week 0 - GitHub

Week 1 - Hands-on the Terminal

Week 2 - Use Python as a daily tool

Week 3 - Python for anything

Week 4 - JSON and API

Week 5 - Web Scraping Basics

Week 6 - Advanced Web Scraping

Week 7 - Table manipulation and 1-D analysis

Week 8 - Visualisation, presentation and reproducible reports

Week 9 - Text analysis

Week 10 - Time series

Week 11 - Graph theory and social network analysis

Week 12 - 2D analysis

Week 13 - High-dimensional analysis

Week 14 - Clustering

Week 15 - Classification

Week 16 - Regression

Week 17 - Recommender System

Open topics

Files

outline.md

Latest commit

History

outline.md

File metadata and controls

Course Outline

Week 0 - GitHub

Week 1 - Hands-on the Terminal

Week 2 - Use Python as a daily tool

Week 3 - Python for anything

Week 4 - JSON and API

Week 5 - Web Scraping Basics

Week 6 - Advanced Web Scraping

Week 7 - Table manipulation and 1-D analysis

Week 8 - Visualisation, presentation and reproducible reports

Week 9 - Text analysis

Week 10 - Time series

Week 11 - Graph theory and social network analysis

Week 12 - 2D analysis

Week 13 - High-dimensional analysis

Week 14 - Clustering

Week 15 - Classification

Week 16 - Regression

Week 17 - Recommender System

Open topics