Data Manipulation & Analysis
This course aims to help students get started with their own data harvesting, processing, aggregation, and analysis.
Data analysis is crucial to evaluating and designing solutions and applications, as well as understanding user's information needs and use. In many cases the data we need to access is distributed online among many web pages, stored in a database, or available in a large text file. Often these data (e.g.web server logs) are too large to obtain and/or process manually. Instead, we need an automated way of gathering the data, parsing it, and summarizing it, before we can do more advanced analysis.
Therefore, students will learn to use Python and its modules to accomplish these tasks in a 'quick and easy' yet useful and repeatable way. Next, students will learn techniques of exploratory data analysis, using scripting, text parsing, structured query language, regular expressions, graphing, and clustering methods to explore data. Students will be able to make sense of and see patterns in otherwise intractable quantities of data. The skills students will learn include the following: Big data processing; Converting messy data into a form that can be analyzed using Pandas; Compute and visualize summary statistics of datasets; Master the specification of graphical displays using Seaborn and matplotlib; Combine the use of graphics with data manipulation to visualize relationships between variables; Use machine learning techniques including clustering and classification. Use dimension reduction techniques.
Data manipulation I: pandas DataFrames
Data manipulation II: pandas (Homework 1 (Pandas data manipulation: Olympics))
Data analysis I: univariate stats, visualization, seaborn, intro to correlation (Homework 2 (more data manipulation))
Data analysis II: ANOVA, t-test, linear models (Homework 3 (Visualization and correlation))
Categorical data (contingency tables, chi-square, mosaic plots) Text processing: Regular Expressions (Homework 4 (Linear models))
Natural language processing (NLTK, gensim) (Project Proposal)
Machine Learning I: Clustering (Homework 5 (NLP))
Machine Learning II: Classification (Homework 6 (Clustering))
Machine Learning III: Dimensionality reduction (PCA and t-SNE)
Spark I
Spark II
Homework 1 (Pandas data manipulation: Olympics)
Homework 2 (more data manipulation)
Homework 3 (Visualization and correlation)
Homework 4 (Linear models)
Homework 5 (NLP)
Homework 6 (Clustering)
Homework 7 (Classification)
Homework 8 (Dimension Reduction)
Homework 9 (Spark)