-
Notifications
You must be signed in to change notification settings - Fork 40
/
09-reproducibleresearch2.Rmd
433 lines (262 loc) · 25 KB
/
09-reproducibleresearch2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
# Reproducible research #2
[Download slides for Nichole Monhait's guest lecture](https://github.com/nmonhait/erhs_535/raw/master/Data%20Products%20and%20Interactivity%20with%20R.pdf)
Nichole's full repo for the lecture is at: https://github.com/nmonhait/erhs_535.
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week9.pdf) a pdf of the lecture slides covering this topic.
```{r echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_knit$set(error = TRUE)
```
<!-- ## R Notebooks -->
<!-- From RStudio's [article on R Notebooks](http://rmarkdown.rstudio.com/r_notebooks.html): -->
<!-- > "An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input." -->
<!-- R Notebooks are a new feature. Right now, if you want to use them, you need to update to RStudio's Preview version. You can get that [here](https://www.rstudio.com/products/rstudio/download/preview/). -->
<!-- You can render an R Notebook document to a final, static version (e.g., pdf, Word, HTML) just like an R Markdown file. -->
<!-- Therefore, you can use R Notebooks as an alternative to R Markdown, with the ability to try out and change chunks interactively as you write the document. -->
<!-- You can open a new R Notebook file by going in RStudio to "File" -> "New File". In the Preview version of RStudio, there's an option there for "R Notebook". -->
<!-- As with R Markdown files, when you choose to create a new R Notebook file, RStudio opens a skeleton file with some example text and formatting already in the file. -->
<!-- The syntax is very similar to an R Markdown file, but the YAML now specifies: -->
<!-- ``` -->
<!-- output: html_notebook -->
<!-- ``` -->
## Templates
R Markdown **templates** can be used to change multiple elements of the style of a rendered document. You can think of these as being the document-level analog to the themes we've used with `ggplot` objects.
To do this, some kind of style file is applied when rendering document. For HTML documents, Cascading Style Sheets (CSS) (`.css`) can be used to change the style of different elements. For pdf files, LaTeX package (style) files (`.sty`) are used.
To open a new R Markdown file that uses a template, in RStudio, go to "File" -> "New File" -> "R Markdown" -> "From Template".
Different templates come with different R packages. A couple of templates come with the `rmarkdown` package, which you likely already have.
Many of these templates will only render to pdf.
To render a pdf from R Markdown, you need to have a version of TeX installed on your computer. Like R, TeX is open source software. RStudio recommends the following installations by system:
- For Macs: MacTeX
- For PCs: MiKTeX
Links for installing both can be found at http://www.latex-project.org/ftp.html
Current version of TeX: 3.14159265.
The `tufte` package has templates for creating handouts typeset like Edward Tufte's books.
This package includes templates for creating both pdf and HTML documents in this style.
The package includes special functions like `newthought`, special chunk options like `fig.fullwidth`, and special knitr engines like `marginfigure`. Special features available in the tufte template include:
- Margin references
- Margin figures
- Side notes
- Full width figures
The `rticles` package has templates for several journals:
- *Journal of Statistical Software*
- *The R Journal*
- *Association for Computing Machinery*
- ACS publications (*Journal of the American Chemical Society*, *Environmental Science & Technology*)
- Elsevier publications
Some of these templates create a whole directory, with several files besides the .Rmd file. For example, the template for *The R Journal* includes:
- The R Markdown file in which you write your article
- "RJournal.sty": A LaTeX package (style) file specific to *The R Journal*. This file tells LaTeX how to render all the elements in your article in the style desired by this journal.
- "RJreferences.bib": A BibTeX file, where you can save citation information for all references in your article.
- "Rlogo.png": An example figure (the R logo).
Once you render the R Markdown document from this template, you'll end up with some new files in the directory:
- "[your file name].tex": A TeX file with the content from your R Markdown file. This will be "wrapped" with some additional formatting input to create "RJwrapper.tex".
- "RJwrapper.tex": A TeX file that includes both the content from your R Markdown file and additional formatting input. Typically, you will submit this file (along with the BibTeX, any figure and image files, and possibly the style file) to the journal.
- "RJwrapper.pdf": The rendered pdf file (what the published article would look like)
This template files will often require some syntax that looks more like LaTeX than Markdown.
For example, for the template for *The R Journal*, you need to use `\citep{}` and `\citet{}` to include citations. These details will depend on the style file of the template.
As a note, you can always use raw LaTeX in R Markdown documents, not just in documents you're creating with a template. You just need to be careful not to mix the two. For example, if you use a LaTeX environment to begin an itemized list (e.g., with `begin{itemize}`), you must start each item with `item`, not `-`.
You can create your own template. You create it as part of a custom R package, and then will have access to the template once you've installed the package. This can be useful if you often write documents in a certain style, or if you ever work somewhere with certain formatting requirements for reports.
RStudio has full instructions for creating your own template: http://rmarkdown.rstudio.com/developer_document_templates.html
## R Projects
### Organization
So far, you have run much of your analysis within a single R script or R Markdown file. Often, any associated data are within the same working directory as your script or R Markdown file, but the files for one project are not separated from files for other projects.
As you move to larger projects, this kind of set-up won't work as well. Instead, you'll want to start keeping all materials for a project in a single and exclusive directory.
Often, it helps to organize the files in a project directory into subdirectories. Common subdirectories include:
- `data-raw`: Raw data and R scripts to clean the raw data.
- `data`: Cleaned data, often saved as `.RData` after being generated by a script in `data-raw`.
- `R`: Code for any functions used in analysis.
- `reports`: Any final products rendered from R Markdown and their original R Markdown files (e.g., paper drafts, reports, presentations).
### Creating R Projects
RStudio allows you to create "Projects" to organize code, data, and results within a directory. When you create a project, RStudio adds a file with the extension ".Rproj" to the directory.
There are some advantages to setting a directory to be an R Project. The project:
- Automatically uses the directory as your current working directory when you open the project.
- Coordinates well with git version control and GitHub repository system.
- Opens a "Files" window for navigating project files in an RStudio pane when you open the project.
You can create a new project from scratch or from an existing directory.
To create an R project from a working directory, in RStudio go to "File" -> "New Project" -> "New Directory". You can then choose where you want to save the new project directory.
## git
Git is a version control system.
It saves information about all changes you make on all files in a repository. This allows you to revert back to previous versions and search through the history for all files in the repository.
Git is open source. You can download it for different operating systems here:
https://git-scm.com/downloads
You will need git on your computer to use git with RStudio and create local git repositories you can sync with GitHub repositories.
Before you use git, you should configure it. For example, you should make sure it has your name and email address.
You can configure git with commands at the shell. For example, I would run the following code at a shell to configure git to have my proper user name and email:
```
git config --global user.name "Brooke Anderson"
git config --global user.email "[email protected]"
```
Sometimes, RStudio will automatically find git (once you've installed git) when you start RStudio.
However, in some cases, you may need to take some more steps to activate git in RStudio. To do this, go to "RStudio" -> "Preferences" -> "Git/SVN". Choose "Enable version control". If RStudio doesn't find your version of git in the "Git executable" box, browse for it.
### Initializing a git repository
You can initialize a git repository using commands from the shell. To do that, take the following steps (first check that it is not already a git repository):
1. Use a shell ("Terminal" on Macs) to navigate to to that directory. You can use `cd` to do that (similar to `setwd` in R).
2. Once you are in the directory, run `git status`. If you get the message `fatal: Not a git repository (or any of the parent directories): .git`, it is not yet a git repository.
3. If you do not get an error from `git status`, the directory is already a repository. If you do get an error, run `git init` to initialize it as a repository.
For example, if I wanted to make the "fars_analysis" directory, which is a direct subdirectory of my home directory, a git repository, I could open a shell and run:
```
cd ~/fars_analysis
git init
```
You can also initialize a git repository for a directory that is an R Project directory through R Studio.
1. Open the Project.
2. Go to "Tools" -> "Version Control" -> "Project Setup".
3. In the box for "Version control system", choose "Git".
**Note:** If you have just installed git, and have not restarted RStudio, you'll need to do that before RStudio will recognize git. If you do not see "Git" in the box for "Version control system", it means either that you do not have git installed on your computer or that RStudio was unable to find it.
Once you initialize the project as a git repository, you should have a "Git" window in one of your RStudio panes (top right pane by default).
As you make and save changes to files, they will show up in this window for you to commit. For example, this is what the Git window for our coursebook looks like when I have changes to the slides for week 9 that I need to commit:
```{r echo = FALSE, out.width="0.8\\textwidth", fig.align="center"}
knitr::include_graphics("figures/ExampleGitWindow.png")
```
### Committing
When you want git to record changes, you *commit* the files with the changes. Each time you commit, you have to include a short commit message with some information about the changes.
You can make commits from a shell. However, in this course we'll just make commits from the RStudio environment.
To make a commit from RStudio, click on the "Commit" button in the Git window. That will open a separate commit window that looks like this:
```{r echo = FALSE, out.width="0.8\\textwidth", fig.align="center"}
knitr::include_graphics("figures/ExampleCommitWindow.png")
```
In this window, to commit changes:
1. Click on the files you want to commit to select them.
2. If you'd like, you can use the bottom part of the window to look through the changes you are committing in each file.
3. Write a message in the "Commit message" box. Keep the message to one line in this box if you can. If you need to explain more, write a short one-line message, skip a line, and then write a longer explanation.
4. Click on the "Commit" button on the right.
Once you commit changes to files, they will disappear from the Git window until you make and save more changes in them.
### Browsing history
On the top left of the Commit window, you can toggle to "History". This window allows you to explore the history of commits for the repository.
```{r echo = FALSE, out.width="0.8\\textwidth", fig.align="center"}
knitr::include_graphics("figures/ExampleHistoryWindow.png")
```
GitHub allows you to host git repositories online. This allows you to:
- Work collaboratively on a shared repository
- Fork someone else's repository to create your own copy that you can use and change as you want
- Suggest changes to other people's repositories through pull requests
To push local repositories to GitHub and fork other people's repositories, you will need a GitHub account.
You can sign up at https://github.com. A free account is fine.
The basic unit for working in GitHub is the repository. You can think of a repository as very similar to an R Project--- it's a directory of files with some supplemental files saving some additional information about the directory.
While R Projects have this additional information saved as an ".RProj" file, git repositories have this information in a directory called ".git". Because this pathname starts with a dot, it won't show up in many of the ways you list files in a directory. From a shell, you can see files that start with `.` by running `ls -a` from within that directory.
### Linking local repo to GitHub repo
If you have a local directory that you would like to push to GitHub, these are the steps to do it.
First, you need to make sure that the directory is under git version control. See the notes on initializing a repository.
Next, you need to create an empty repository on GitHub to sync with your local repository. Do that by:
1. In GitHub, click on the "+" in the upper right corner ("Create new").
2. Choose "Create new repository".
3. Give your repository the same name as the local directory you'd like to connect it to. For example, if you want to connect it to a directory called "fars_analysis" on your computer, name the repository "fars_analysis".
4. Leave everything else as-is (unless you'd like to add a short description in the "Description" box). Click on "Create repository" at the bottom of the page.
Now you are ready to connect the two repositories.
First, you'll want to change some settings in RStudio so GitHub will recognize that your local repository belongs to you, rather than asking for you password every time.
- In RStudio, go to "RStudio" -> "Preferences" -> "Git / svn". Choose to "Create RSA key".
- Click on "View public key". Copy what shows up.
- Go to your GitHub account and navigate to "Settings". Click on "SSH and GPG keys".
- Click on "New SSH key". Name the key something like "RStudio" (you might want to include the device name if you'll have SSH keys from RStudio on several computers). Past in your public key in the "Key box".
### Syncing RStudio and GitHub
Now you're ready to push your local repository to the empty GitHub repository you created.
1. Open a shell and navigate to the directory you want to push. (You can open a shell from RStudio using the gear button in the Git window.)
2. Add the GitHub repository as a remote branch with the following command (this gives an example for adding a GitHub repository named "ex_repo" in my GitHub account, "geanders"):
```
git remote add origin [email protected]:geanders/ex_repo.git
```
3. Push the contents of the local repository to the GitHub repository.
```
git push -u origin master
```
To pull a repository that already exists on GitHub and to which you have access (or that you've forked), first use `cd` to change a shell into the directory where you want to put the repository then run `git clone` to clone the repository locally. For example, if I wanted to clone a GitHub repository called "ex_repo" in my GitHub account, I would run:
```
git clone [email protected]:geanders/ex_repo.git
```
Once you have linked a local R project with a GitHub repository, you can push and pull commits using the blue down arrow (pull from GitHub) and green up arrow (push to GitHub) in the Git window in RStudio.
```{r echo = FALSE, out.width="0.8\\textwidth", fig.align="center"}
knitr::include_graphics("figures/ExampleGitWindow.png")
```
GitHub helps you work with others on code. There are two main ways you can do this:
- **Collaborating:** Different people have the ability to push and pull directly to and from the same repository. When one person pushes a change to the repository, other collaborators can immediately get the changes by pulling the latest GitHub commits to their local repository.
- **Forking:** Different people have their own GitHub repositories, with each linked to their own local repository. When a person pushes changes to GitHub, it only makes changes to his own repository. The person must issue a pull request to another person's fork of the repository to share the changes.
### Issues
Each original GitHub repository (i.e., not a fork of another repository) has a tab for "Issues". This page works like a Discussion Forum.
You can create new "Issue" threads to describe and discuss things that you want to change about the repository.
Issues can be closed once the problem has been resolved. You can close issues on the "Issue" page with the "Close issue" button.
If a commit you make in RStudio closes an issue, you can automatically close the issue on GitHub by including "Close #[issue number]" in your commit message and then pushing to GitHub.
For example, if issue #5 is "Fix typo in section 3", and you make a change to fix that typo, you could make and save the change locally, commit that change with the commit message "Close #5", and then push to GitHub, and issue #5 in "Issues" for that GitHub repository will automatically be closed, with a link to the commit that fixed the issue.
### Pull request
You can use a *pull request* to suggest changes to a repository that you do not own or otherwise have the permission to directly change.
You can also use pull requests within your own repositories. Some people will create a pull request every time they have a big issue they want to fix in one of their repositories.
In GitHub, each repository has a "Pull requests" tab where you can manage pull requests (submit a pull request to another fork or merge in someone else's pull request for your fork).
Take the following steps to suggest changes to someone else's repository:
1. Fork the repository
2. Make changes (locally or on GitHub)
3. Save your changes and commit them
4. Submit a pull request to the original repository
5. If there are not any conflicts and the owner of the original repository likes your changes, he or she can merge them directly into the original repository. If there are conflicts, these need to be resolved before the pull request can be merged.
### Merge conflicts
At some point, you will get *merge conflicts*. These happen when two people have changed the same piece of code in two different ways at the same time.
For example, say Rachel and are both working on local versions of the same repository, and I change a line to `mtcars[1, ]` while Rachel changes the same line to `head(mtcars, 1)`. Rachel pushes to the GitHub version of the repository before I do.
When I pull the latest commits to the GitHub repository, I will have a merge conflict for this line. To be able to commit a final version, I'll need to decide which version of the code to use and commit a version of the file with that code.
Merge conflicts can come up in a few situations:
- You pull in commits from the GitHub branch of a repository you've been working on locally.
- Someone sends a pull request for one of your repositories.
If there are merge conflicts, they'll show up like this in the file:
```
<<<<<<< HEAD
mtcars[1, ]
=======
head(mtcars, 1)
>>>>>>> remote-branch
```
To fix them, search for all these spots in files with conflicts, pick the code you want to use, and delete everything else.
For the example conflict, I might change the file from this:
```
<<<<<<< HEAD
mtcars[1, ]
=======
head(mtcars, 1)
>>>>>>> remote-branch
```
To this:
```
head(mtcars, 1)
```
Then you can save and commit the file.
### Find out more
If you'd like to find out more, Hadley Wickham has a great chapter on using git and GitHub with RStudio in his *R Packages* book:
http://r-pkgs.had.co.nz/git.html
## In-course exercise Chapter 9
<!-- ### Trying out templates -->
<!-- - Open a "Tufte" template document. To open a new document from a template, in RStudio go to "File" -> "New File" -> "R Markdown" -> "From Template". If "Tufte Handout" does not show up as a choice, install and load the `tufte` package, restart R, and try again. -->
<!-- - Render the document to an HTML and open it in a browser. Explore the rendered document. What elements does it include that are not typical for R Markdown documents? -->
<!-- - Now go and look at the .Rmd file. Find the spots where these new elements are specified in the R Markdown file. How are elements like margin plots and all caps at the starts of some paragraphs included in the document? -->
<!-- - Download and install the `rticles` package. Create a new document using the template for *The R Journal*. When you do that, a whole directory will be copied to your current working directory. Check out the files in that directory. If you have a version of TeX installed on your computer, try compiling the example file. -->
<!-- - Select three or four key journals in your field. Go to the journal's website and see if you can figure out: -->
<!-- + Do they allow authors to submit ".tex" files? -->
<!-- + If so, do they provide a LaTeX style file for their journal? -->
### Organizing a project
In this part of the group exercise, you will set up an R Project to use for the next homework assignment. You will want to set up a similar project for your final group project.
- First, you need to create a new project. In RStudio, go to "File" -> "New Project" -> "New Directory". Choose where you want to save this directory and what you want to name it.
- Once you open the project, one of the RStudio panes should have a tab called "Files". This shows the files in this project directory and allows you to navigate through them. Currently, you won't have any files other than the R project file (".Rproj"). As a next step, create several subdirectories. We'll use these to structure the data and R Markdown files for your homework. Create the following subdirectories (you can use the "New Folder" button in the RStudio "Files" pane):
+ `data`
+ `writing`
- Download the data for the homework from the *Washington Post's* GitHub page: https://github.com/washingtonpost/data-homicides. Save this data inside your R project in the `data`
subdirectory
### Initializing git for an R Project
- If you do not already have one, sign up for a GitHub account. The free option is fine.
- If you do not already have git installed on your computer, install it: https://git-scm.com/downloads
- Restart RStudio. go to “RStudio” -> “Preferences” ->“Git/SVN”. Choose “Enable version control”. If RStudio doesn’t find yourversion of git in the “Git executable” box, browse for it.
- Open your homework project in RStudio. Change your Project settings to initialize git for this project (see the course notes for tips on how to do that).
- Open a shell from R using the gear symbol in the "Git" pane you should now see in RStudio. Configure git
from this shell. For example, I would open a shell and run:
```
git config --global user.name "Brooke Anderson"
git config --global user.email "[email protected]"
```
Note that you only need to do this once (until you get a new computer or, maybe, update git).
- Go to the "Commit" window. Click on all of the files you see there, and then make an initial commit
using "Initial commit" as your commit message.
- The `writing` subdirectory will have your R Markdown file and its output. Create a new R Markdown file ("File" -> "New File" -> "R Markdown") and save it to this subdirectory. You can change the name and date for the file if you'd like. Delete all the text that comes as a default. Write a piece of code that lists the files you saved in the `data` subdirectory. Remember that the working directory for an R Markdown file is the directory in which it's saved, so you may need to use a relative pathname that goes up one directory (`..`) and then goes into `data` subdirectory.
- Commit this change using the "Commit" window. After you commit the changes, look at the "History" window to see the history of your commits.
### Linking your project with a GitHub repository
(See course notes for more on these steps.)
- Login to your GitHub account online. Create an empty GitHub repository for the project.
Give it the same name as the name of your R project directory.
- If you do not already have an RSA kay, create one in RStudio and add it as an SSH key in your
GitHub settings. If you already have a key (you almost certainly know if you do), see if you can copy it and
submit it in GitHub.
- Set this empty online GitHub repository as the remote branch of your local git repository for the project.
- Push your local repository to this GitHub repository.
- Go to your GitHub account and make sure the repository was pushed.
- Try making some more changes to your local repository. Commit the changes, then use the green up arrow in the Git window to push the changes to your GitHub repository.