Skip to content

Latest commit

 

History

History
453 lines (314 loc) · 19.3 KB

File metadata and controls

453 lines (314 loc) · 19.3 KB

Course Outline

Week 0 - GitHub

Objective

  • Can use GitHub for resource hosting, project management and discussion forum.
  • Can use GitHub Desktop to sync local repos with remote repos.
  • Can use gh-pages to host static web pages as one's portfolio.

Week 1 - Hands-on the Terminal

Objective:

  • Able to navigate file system in Terminal, using shell
  • Create the first python script and execute it

MAC:

  • Cmd+space to open Spotlight; search “Terminal” to open terminal

Shell commands:

  • cd to switch working folder
    • Path separated by /
    • Special paths: .,..,-,~
  • lsto list files/ folders in current folder
  • pwd to check current working folder
  • ls/pwd is your friend; type often to make sure where you are
  • touch to create empty new file; mkdir to create new directory
  • python to execute python scripts (usually in .py but not necessary)
  • Format of shell commands:
    • <command-name> <arg1> <arg2>.. (space separated arguments)

Challenge:

  1. Write a Python script to output "Good evening" in the Terminal.

References:

Week 2 - Use Python as a daily tool

Objective:

  • Can use Python as a daily tool -- at least a powerful calculator
  • Become comfortable with Python interpreter -- the REPL pattern (Read-Evaluate-Print Loop)
  • Can use help to get inline documentation on new modules and functions

Python language introduction:

  • Variables and assignment
  • Basic data types: int, float, str, bool
  • Arithmetic:
    • +, -, *, /, //, %, **
    • math, numpy (may need pip)
  • Use functions and modules:
    • import (and import ... from ...)
    • . notation to reference to member variable/ method
    • () notation to call function
  • Common modules and functions
    • str.* functions
      • String templating 1: str.format
      • String templating 2: format_str % (var1, var2, ...)
    • random
    • numpy, scipy

Challenge:

  1. Build a mortgage calculator - given principal P, interest rate r and load period n, calculated the amortised monthly payment A
  2. Calculate the area of a circle given its radius r
  3. Given the length of hypotenuse of a right triangle, calculate the length of its legs. You may want to get values like $$\sin(\frac{\pi}{6})$$ via numpy.pi and numpy.sin
  4. Generate 10 random numbers. (it is OK to run your script 10 times)

References:

Week 3 - Python for anything

Objective:

  • Master the composite data type [] and {} in Python
  • Master the control logics in Python, especially if and for
  • Further understand the different roles of text editor and interpreter. Be comfortable writing batch codes in .py file and execute in Shell environment.
  • [O] Understand Python engineering

Python language:

  • help
  • bool and comparisions
    • str comparison and int comparison
  • Composite data types: list [], dict {}
  • Control flow:
    • for, while
    • if
    • try..except
  • Function, class, module:
    • def
    • class
    • *.py; from, import

Workflow:

  • Python interpreter
  • pip: pip3 for python3
    • --user option in shared computer

Challenge:

  1. Distances among cities:
    1. Calculate the "straight line" distance on earth surface from several source cities to Hong Kong. The source cities: New York, Vancouver, Stockholm, Buenos Aires, Perth. For each source city, print one line containing the name of the city and distance. "Great-circle distance" is the academic name you use to search for the formula.
    2. Use list and for loop to handle multiple cities
    3. Use function to increase the reusability
  2. Divide HW1 groups randomly: (case contribution)
    1. Get the list of student IDs from the lecturer
    2. Generate the grouping randomly
  3. Solve the "media business model" calculator.

References:

Week 4 - JSON and API

Objective:

  • Learn to use Jupyter notebook. All demos following this week will be conducted in Jupyter notebook.
  • Understand API/ JSON and can retrieve data from online databases (twitter, GitHub, weibo, douban, ...)
  • Understand basic file format like json and csv.
    • Be able to comfortably navigate through compound structures like {} and [].
    • Be able to comfortably use (multiple layer of) for-loop to re-format data.
    • Be able to use serialisers to handle input/ output to files.

The brief of Application Programming Interface (API):

  • Operate in client-and-server mode.
  • Client does not have to download the full volume of data from server. Only use the data on demand.
  • Server can handle intensive computations that not available by client.
  • Server can send updated data upon request.

Modules:

  • Handle HTTP request/ response: requests
  • Serialiser: json (.loads, .dumps) and csv

Challenges:

  • Taiwan had an earthquake in early Feb. Let's discuss this issue:
    • Search for the earthquake instances around Taiwan in recent 100 years and analyse the occurrences of earthquakes. You can refer to the same database used here. Checkout the API description. The count and query API are useful.
    • Search on Twitter and collect user's discussions about this topic. See if there is any findings. You can approach from the human interface here (hard mode) or use python-twitter module (need to register developer and obtain API key).
  • Retrieve and analyse the recent movie. Douban's API will be helpful here.
  • Use Google Map API to retrieve geo-locaitons and canonical names: e.g. Get the location of HKBU
  • Lookup real estate properties on HK gov open data portal. e.g. the dataset page, the API result
  • blockchain.info provides a set of API for one to retrieve information related with bitcoin transactions. Pick one wallet address, check its UTXO sets and sum up the values to get the total balance in this wallet.
  • A free crypocurrency API for you to retrieve and study historical exchange rates.
  • Implement a basic version of first automated writer - QuakeBot from LA Times
    • Get data from USGS API
    • Print a story to the screen using string templating/ string interpolation
    • See here for an introduction of the bot. See here for an incident and think how to avoid it?

Exercise:

  • Request cerntain API to acquire information
  • Convert a JSON to CSV in Python
  • Convert a CSV to JSON in Python

Further readings:

Week 5 - Web Scraping Basics

Objective:

  • Understand the basics of HTML language, HTTP protocol, web server and Internet architecture
  • Able scrape static web pages and turn them into CSV files

Tools: ( Step-by-step reference )

  • Virtualenv -- Create isolated environment to avoid projects clutter each other
  • Jupyter notebook -- Web-based REPL; ready to distribute; all-in-one presentation

Modules:

  • Handle HTTP request/ response: requests
  • Parse web page: lxml, Beautiful Soup, HTMLPaser, help(str)
    • Useful string functions: strip(), split(), find(), replace(), str[begin:end]
  • Serialiser: csv, json

Challenges: (save to *.csv

  • Use lxml / bs4 requests
  • Bonus:
    • Collect the tweets from a celebrity like this post. You can search "python twitter" for many useful modules online.

References:

Further reading:

  • Study urllib as an alternative to requests
  • Study Regular Expression and re library in Python
  • See how reproducibility is improved with Jupyter notebook and other tools (not only Python).

Week 6 - Advanced Web Scraping

Objective:

  • Bypass anti-crawler by modifying user-agent
  • Handle glitches: encoding, pagination, ...
  • Handle dynamic page with headless browser
  • Handle login with headless browser
  • Scrape social networks
  • Case studies on different websites
  • Further strengthen the list-of-dict data types; organise multi-layer loops/ item based parsing logics.

Cases:

Week 7 - Table manipulation and 1-D analysis

Objective:

  • Master the schema of "data-driven story telling": the crowd (pattern) and the outlier (anomaly)
  • Can efficiently manipulate structured table formatted datasets
  • Use pandas for basic calculation and plotting

Modules:

  • pandas
  • seaborn
  • matplotlib

Statistics:

  • mean, media, percentile
  • min, max
  • variance
  • histogram
  • sort
  • central tendency and spread of data
  • Scatter plot and correlation

Datasets to work on:

References:

Additional notes:

  • You need to finish Dataprep before analysis. That is, we start with structured data. Preparing the structured and cleaned data has no common schema. We have pointers in Dataprep for your own reading.

Week 8 - Visualisation, presentation and reproducible reports

Objective

  • Understand the theory and common tricks of visualisation.
  • Can plot charts using various visualisation libraries.
  • Can plot maps.
  • Understand the concept of "reproducibility" and can use GitHub repo plus Jupyter notebook to create such reports.

Libraries:

  • py-plotly
  • pyecharts

Week 9 - Text analysis

Objective:

  • Further strengthen the proficiency of pandas: DataFrame and Series
  • Learn to plot and adjust charts with matplotlib
  • Master basic string operations
  • Understand some major text mining models and be able to apply algorithm from 3rd party libraries.

Modules & topics:

  • str - basic string processing
    • .split(), in, .find()
    • %s format string
    • ''.format() function
  • collections.Counter for word frequency calculation
  • jieba - the most widely used Chinese word segmentation package.
  • (optional) re- Regular Expression (regex) is the swiss knife for text pattern matching.
  • (optional) nltk - contains common routines for text analysis
  • (optional) gensim - topic mining package. It also contains the Word2Vec routine.
  • (optional) Sentiment analysis - construct classifier using sklearn or use an API like text-processing. TextBlob is also useful and applied in group 2's work.

Related cases:

References:

  • Construct Naive Bayes based classifier for sentiment analysis. Read here

Datasets to work on:

Week 10 - Time series

  • Understand the principle of timestamp and datetime format
  • Master basic computation on datetime values
  • Understand periodical analysis (daily, weekly, monthly, seasonal, etc)
  • Can handle timezone conversion

Modules:

  • datetime
  • dtparser
  • pandas
    • basic visualisation .plot
    • zoom in/ out: .resample, .aggregate
  • seaborn

References:

Datasets:

Week 11 - Graph theory and social network analysis

Objective:

  • Understand the basics of graph theory
  • Understand most common applications in social network analysis
  • Can conduct graph analysis and visualisation in networkx

Graph metrics and algorithms:

  • Shortest path
  • Graph profiling: diameter, degree distribution, clustering coefficient
  • Centrality: degree, PageRank, betweenness, closeness, ...
  • Community detection

Challenges:

References:

Week 12 - 2D analysis

Objective

  • Understand correlation and can calculate correlation
  • Can articulate correlation and causality

Following are advanced topics for your own reading. We do not discuss those topics due to lack of regular class hours.

Week 13 - High-dimensional analysis

Objective:

  • Understand correlation and causality. Can conduct visual (explorative) analysis of correlation
  • Can interpret common statistic quantities
  • Dimensionality reduction

Challenge:

  1. Explore the HK Legco voting records

Modules:

  • sklearn
    • decomposition.PCA
  • seaborn
  • (optional) scipy.statsmodel

References:

Week 14 - Clustering

Week 15 - Classification

Week 16 - Regression

Week 17 - Recommender System

Open topics

Those topics may be discussed if there is plenty Q/A time left in certain week. Or, you are welcome to explore those topics via group project.

  • Cloud (AWS)
  • Deep learning