-
Notifications
You must be signed in to change notification settings - Fork 27
/
Copy pathindex.Rmd
259 lines (171 loc) · 14.3 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
---
title: "**STOR 390: Introduction to Data Science**"
output:
html_document:
theme: cosmo
toc: yes
toc_float: yes
---
This course is an application-driven introduction to data science. Statistical and computational tools are valued throughout the modern workplace from Silicon Valley startups, to marine biology labs, to Wall Street firms. These tools require technical skills such as programming and statistics. They also require professional skills such as communication, teamwork, problem solving, and critical thinking.
You will learn these tools and hone these skills through hands-on experience working with datasets such as: Museum of Modern Art records, TCGA Gene Expressions and the text script of Beauty and the Beast. The first half of the semester will cover R programming skills. The second half will cover a number of topics: exploratory data analysis, web scraping, text processing, and effective visualization through a series of modules.
- Instructor: [Iain Carmichael](https://idc9.github.io/)
- Instructional Assistant: [Brendan Brown](http://stat-or-old.oasis.unc.edu/people/graduate-students/brendan-brown)
- Graduate Research Consultant: [Varun Goel](https://varungoel.web.unc.edu/)
See the **[course syllabus](https://idc9.github.io/stor390/course_info/syllabus.html)** for more information.
# **Course Material**
Most of the course material can be found in the notes linked to below. The notes are supplemented by readings (mostly from [R for Data Science](http://r4ds.had.co.nz/)) which are listed in the [#reading](#reading) section.
| Date | Lecture | **Notes** | Slides |
|------|---------|-------|-------|
|January 12 |install R, basic commands | [**getting started**](notes/getting_started/getting_started.html) | [slides](slides/welcome.pdf)|
|January 17 | R Markdown, working directory, R projects |[**workflow**](notes/workflow/workflow.html) | |
|January 19 | select, filter, mutate | [**dplyr**](notes/dplyr/dplyr.html)| |
|January 24 | pipes, group_by, summarise| [**dplyr**](notes/dplyr/dplyr.html)| [slides](slides/dplyr2.html)|
| January 26 | if/else, loops, functions|[**intro programming**](notes/programming/prog_notes.html) |[slides](slides/intro_prog.html) |
| January 31 |vectors, lists and eda |[**more prog**](notes/programming/prog_notes.html) and [**EDA**](notes/EDA/EDA.html) | [prog](slides/intro_prog.html) and [eda](slides/EDA.html)|
|February 2| spread, gather |[**tidy data**](notes/tidy_data/tidy_data.html) | |
|February 7| inner, outer joins ([Marshall Markham](https://www.linkedin.com/in/marshall-markham-840ab24a/)) |[**slides**](slides/joins.pdf) | |
|February 9| match, extract, replace with `stringr`| [**regular expressions**](notes/strings/strings.html) |[slides](slides/regex.html) |
|February 14| look ahead/behinds|[**regular expressions**](notes/strings/strings.html) | |
|February 16| | | |
|February 21| least squares, factors, `lm()`| [**linear regression**](notes/linear_regression/linear_regression.html)| |
|February 23|test set, nonlinear, interactions|[**predictive modeling**](notes/predictive_modeling/predictive_modeling.html) | [slides](slides/predictive_modeling.html) |
|February 28| exploratory data analysis| | |
|March 2| more exploratory analysis| | |
|March 7| nearest centroid, KNN | [**classification**](notes/classification/classification.html)|[slides](slides/classification.html) |
|March 9| cross-validation| [**cross-validation**](notes/cross_validation/cross_validation.html) | [slides](slides/cv_slides.html)|
|March 21|support vector machine, kernels| [**more classification**](notes/more_classification/more_classification.html) | |
|March 23| |[**more classification**](notes/more_classification/more_classification.html) |[slides](slides/slides_more_classification.html) |
|March 28| more classification| | |
|March 30| APIs ([Marshall Markham](https://www.linkedin.com/in/marshall-markham-840ab24a/))|[**APIs**](notes/API/APIs.html) | [slides](slides/API.pdf) |
|April 4|web scraping, twitter, ggplot|[**web scraping**](notes/web_scraping/web_scraping.html), [**rtweets**](notes/API/twitter/rtweet.html), [**custom viz**](notes/custom_viz/custom_viz.html) | |
|April 6| [Shiny](https://shiny.rstudio.com/) ([Frances Tong](https://www.linkedin.com/in/francestong))| [**shiny**](notes/shiny.html)| |
|April 11| communication | [**effective communication**](notes/communication/communication.html)|[slides](slides/communication.pdf) |
|April 13|natural language processing| [**TidyText**](http://tidytextmining.com/tidytext.html) |[NLP](slides/nlp_intro.pdf), [guest Dan Yang](slides/dan_yang.pdf) |
|April 18|tf-idf, stemming | [**text classification**](notes/natural_language_processing/text_classification.html) | [slides](slides/tfidf_slides.html)|
|April 20| K-means,([Ryan Thornburg](https://ryanthornburg.com/)) |[**clustering**](notes/clustering/clustering.html) |[slides](notes/clustering/clustering_slides.html) |
|April 25|in class presentations|[in class presentations](final_project/class_presentation.html) | |
|April 27|data ethics, [Quinn Underriner](https://www.linkedin.com/in/quinn-underriner-69138249/) |[**data ethics**](slides/Data_Ethics.pdf) | [course summary](slides/course_summary.pdf)|
- the datasets use in the class are on [data.world](https://data.world/iain/stor-390/) and [github](https://github.com/idc9/stor390/tree/master/data/)
- all of the course material is on the [github repo](https://github.com/idc9/stor390) including
- [.Rmd files](https://github.com/idc9/stor390/tree/master/notes/) for the notes
- [example code](https://github.com/idc9/stor390/tree/master/example_code/) are also on github
- most of the course material is in the lecture notes (linked to above) and reading -- the slides are visual aids for the lectures.
- options for [extra credit](course_info/extra_credit.html)
# **Reading**
Readings should be complete by the following class. There are three (free) primary references:
- [R for Data Science](http://r4ds.had.co.nz/) (r4ds)
- [Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/) (ISLR)
- [Text Mining with R](http://tidytextmining.com/) (TMR)
**January 12**
- read through the [getting started notes](https://idc9.github.io/stor390/notes/getting_started/getting_started.html)
- read [before we start](http://www.datacarpentry.org/R-ecology-lesson/00-before-we-start.html) from data carpentry
- sections 1, 2, 3.1-3.5 of r4ds
- I suggest copying/pasting and running some of the example code
- [Data science done well looks easy - and that is a big problem for data scientists](http://simplystatistics.org/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists/)
- This tutorial may be helpful for [getting started with R Markdown](http://stat545.com/block007_first-use-rmarkdown.html)
- (Optional) [For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) and [David Mimno's response](http://www.mimno.org/articles/carpentry/)
- [big data is like teenage sex](https://www.facebook.com/dan.ariely/posts/904383595868)
**January 17**
- sections 3.5 - 3.10 ([data viz](http://r4ds.had.co.nz/data-visualisation.html)) and section 8 ([workflow](http://r4ds.had.co.nz/workflow-projects.html)) from r4ds
- [Reproducibility is not just for researchers](http://www.dataschool.io/reproducibility-is-not-just-for-researchers/)
- (optionally) [The real reason reproducible research is important](http://simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important/) from Simply Statistics
**January 19**
- section 5 ([data transformation](http://r4ds.had.co.nz/transform.html)) from r4ds
- (optionally) [the dplyr flights vignettes](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html)
**January 24**
- section 18 ([pipes](http://r4ds.had.co.nz/pipes.html)) and sections 19.1-19.4 ([functions](http://r4ds.had.co.nz/functions.html)) from r4ds
**January 26**
- section 19.5-7 and sections 21.1-21.3 ([loops](http://r4ds.had.co.nz/iteration.html)) from r4ds
**January 31**
- r4ds section 12 ([tidy data](http://r4ds.had.co.nz/tidy-data.html))
**February 2**
- r4ds section 13.1-13.4 ([relational data](http://r4ds.had.co.nz/relational-data.html))
- read [about non-tidy data](http://simplystatistics.org/2016/02/17/non-tidy-data/)
- (optional) [how to share data with a statistician](https://github.com/jtleek/datasharing)
**February 7**
- r4ds chapter 14 ([strings](http://r4ds.had.co.nz/strings.html#introduction-8))
- (optional) the rest of the relational data chapter (13.5-13.7)
**February 16**
- r4ds section 22, 23.1-23.3 ([models](http://r4ds.had.co.nz/model-intro.html))
**February 21**
- r4ds sections 23.4-23.6 ([models](http://r4ds.had.co.nz/model-intro.html))
- alternate sources for regression (optional)
- [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) sections 3.1-3.2
- [Machine Learning for Hackers](https://www.amazon.com/Machine-Learning-Hackers-Studies-Algorithms/dp/1449303714) chapter 5
- [Machine Learning a Probabilistic Perspective](https://www.amazon.com/Machine-Learning-Probabilistic-Perspective-Computation/dp/0262018020) chapter 7
**February 23**
- [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) sections 2.1-2.2 for an overview of modeling
- r4ds [chapter 24](http://r4ds.had.co.nz/model-building.html)
**February 28**
- [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) sections 6.1-6.1 on variable selection
**March 2**
- make sure you have read and understand linear regression and model selection from [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) i.e. section 2.1, 2.2, 3.1, 3.2, 6.1, 6.2
- ISLR sections 4.1-4.3 about classification
**March 9**
- [What is artificial intelligence?](http://simplystatistics.org/2017/01/19/what-is-artificial-intelligence/) by Jeff Leek
**March 21**
- [An example that isn't that artificial or intelligent
](http://simplystatistics.org/2017/01/20/not-artificial-not-intelligent/)
- be prepared to discuss this article and the other AI article from Jeff Leek in class on Thursday
- [ISLR sections 9.1-9.2](http://www-bcf.usc.edu/~gareth/ISL/) on support vector machines
**March 28**
- [ISLR section 9.3, 9.4](http://www-bcf.usc.edu/~gareth/ISL/) on SVMs.
- [ISLR section 5.1](http://www-bcf.usc.edu/~gareth/ISL/) on cross validation.
- read through the [custom viz notes](https://idc9.github.io/stor390/notes/custom_viz/custom_viz.html)
**Apri 4**
- read the [markup section](https://en.wikipedia.org/wiki/HTML#Markup) of Wikipedia about HTML
**April 11**
- read the following three short articles about communication
- [so what?](http://www.storytellingwithdata.com/blog/2017/3/22/so-what) -- convey message
- [how to asking questions](https://stackoverflow.com/help/how-to-ask)
- [reproducible examples](http://adv-r.had.co.nz/Reproducibility.html)
**April 13**
- [chapter 1](http://tidytextmining.com/tidytext.html) of Text Mining with R
**April 18**
- [chapter 3](http://tidytextmining.com/tfidf.html) from TMR
- (optional) [chapter 4](http://tidytextmining.com/ngrams.html) from TMR
- [stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)
**April 20**
- [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) sections 10.1, 10.3.1, and 10.3.3 about K-means and clustering
**April 25**
- [interview with Cathy O'Neil on weapons of math destruction](https://qz.com/819245/data-scientist-cathy-oneil-on-the-cold-destructiveness-of-big-data/)
- [Unroll.me Service Faces Backlash Over a Widespread Practice: Selling User Data](https://www.nytimes.com/2017/04/24/technology/personal-data-firm-slice-unroll-me-backlash-uber.html)
- [Data, privacy, and the greater good](http://erichorvitz.com/data_privacy_greater_good.pdf) (optional, but super interesting)
# **Homework**
The due date is in the link.
| Assigned | Labs | Assignments | In class exercises |
|----------|------|-------------|--------------------|
| January 12 | [data.gov lab 1](labs/1/gov_data.html) | | |
| January 17 |[reproducible data.gov lab 2](labs/2/reproducible_gov_data.html) | | [command line](class_exercises/command_line/command_line.html) and [swirl](class_exercises/swirl/swirl.html)|
| January 19 | |[dplyr and unc departments](assignments/unc_depts/unc_depts.html) | |
| January 26 |[prog lab 3](labs/3/prog_lab.html) | | |
| February 2 | [whales and tidy data lab 4](labs/4/tidy_lab.html) | | |
| February 7 |[joins lab 5](labs/5/joins_lab.html)|| |
| February 9 |[strings lab 6](labs/6/strings_lab.html)| | |
| February 16 | | [harry potter](assignments/harry_potter/harry_potter.html)| |
| February 23 |[Ira Glass on overfitting](labs/7/overfitting_with_scissors) | | |
| February 28 | | [bikes, EDA and predictive modeling](assignments/bikes/bike_sharing.html) |
| March 21 | | [classification: does your iPhone know what you're doing?](assignments/iphone/classification.html) |
| March 21 | | [extra credit: naive bayes](assignments/iphone/naive_bayes.html) | |
| March 23 | | |[What is AI?](class_exercises/AI/what_is_AI.html) |
| March 30 |[APIs](labs/8/moives_api.html) | | |
| April 4 |[Web scraping](labs/9/web_scraping.html) | | |
| April 11 |[effective viz](labs/10/plots_lab.html) | | |
| April 25 | | | [final project presentation](final_project/class_presentation.html)|
| April 27 | | | [data ethics](https://goo.gl/forms/9o91AoU1YlZKpzIv2)|
# **Final project**
See [**this document for the final project description**](final_project/description.html).
| Deliverable | Due Date |
| ------------|----------|
|[proposal](final_project/proposal.html) | 4/6 |
|[exploratory analysis](final_project/exploratory_analysis.html) | 4/25|
|[final analysis](final_project/technical_documents.html) | 5/7|
|[blog post](final_project/blog_post.html) | 5/9 |
I've posted some ideas for project ideas in [this document](final_project/project_ideas.html).
# **Additional resources**
- [the extra resources page](resources/extra_resources.html) contains a number of books, blogs, MOOCS and other courses about data science
- [a collection of lot's of datasets](resources/dataset_collection.html)
- [finding a job/internship](resources/jobs.html)
# **Miscellaneous**
This course was made possible by a grant from the [Data@Carolina](http://data.web.unc.edu/) initiative and a ton of [**input from lots of very smart people**](course_info/acknowledgments.html).
This page was last updated on `r Sys.time()` Eastern Time.