01-prelim.Rmd

# (PART) Part I: Preliminaries {-}

```{r echo = FALSE}
library(fortunes)
```

# R Preliminaries 

The video lectures for this chapter are embedded at relevant places in the text, 
with links to download a pdf of the associated slides for each video. 
You can also access [a full playlist for the videos for this chapter](https://www.youtube.com/playlist?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P).

## Objectives

After this chapter, you should: 

- Know what free and open source software is and some of its advantages over proprietary software
- Understand the difference between R and RStudio
- Be able to download both R and RStudio to your own computer
- Understand that R has a basic core of code that you initially download, and
that this "base R" can be expanded by installing a variety of packages
- Be able to install a package from CRAN to your computer
- Be able to load a package that you have installed to use its functions within an R session
- Be able to access help documentation (vignettes, helpfiles) for a package and its functions
- Be able to submit R expressions at the console prompt to communicate with R
- Understand the structure for calling a function and specifying options for that function
- Know what an R object is and how to assign an R object a name to reference it in later code
- Be able to create vector objects of numeric and character classes
- Be able to explore and extract elements from vector objects
- Be able to create dataframe objects
- Be able to explore and extract elements from dataframe objects
- Be able to describe the difference between running R code from the console
versus writing and running R code in an R script

## R and R Studio

<iframe width="768" height="480" src="https://www.youtube.com/embed/9S3ceDvG_78?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_1.pdf) 
a pdf of the lecture slides for this video.

### What is R?

R in an open-source programming language that evolved from the S language. The S
language was developed at Bell Labs in the 1970s, which is the same place (and
about the same time) that the C programming language was developed.

R itself was developed in the 1990s--2000s at the University of Auckland. It is
open-source software, freely and openly distributed under the GNU General Public
License (GPL). The base version of R that you download when you install R on
your computer includes the critical code for running R, but you can also install
and run "packages" that people all over the world have developed to extend R.

With new developments, R is becoming more and more useful for a variety of
programming tasks. However, where it really shines is in working with data and
doing statistical analysis. R is currently popular in a number of fields,
including:

- Statistics
- Machine learning
- Data analysis

R is an **interpreted language**. That means that you can communicate with it 
interactively, from a command line. Other common interpreted languages include
Python and Perl.

```{r interpreted-language, echo = FALSE, out.width = "600pt", fig.align = "center", fig.cap = "Broad types of software programs. R is an interpreted language. 'Point-and-click' programs, like Excel and Word, are often easiest for a new user to get started with, but are slower for the computer and are restricted in the functionality they offer. By contrast, compiled languages (like C and Java), assembly languages, and machine code are faster for the computer and allow you to create a wider range of things, but can take longer to code and take longer for a new user to learn to work with."}
knitr::include_graphics("figures/program_types2.jpg")
```

R has some of the same strengths (quick and easy to code, interfaces well with
other languages, easy to work interactively) and weaknesses (slower than
compiled languages) as Python. For data-related tasks, R and Python are fairly
neck-and-neck (with Julia an up-and-coming option). However, R is still the
first choice of statisticians in most fields, so I would argue that R has a an
advantage if you want to have access to cutting-edge statistical methods.

> "The best thing about R is that it was developed by statisticians. The worst thing about R is that... it was developed by statisticians."
> -Bo Cowgill, Google, at the Bay Area R Users Group

### Free and open-source software

> "Life is too short to run proprietary software." -- Bdale Garbee

R is **free and open-source software**. Many other popular statistical
programming languages, conversely, are proprietary (for example, SAS and SPSS).
It's useful to know what it means for software to be "open-source", both
conceptually and in terms of how you will be able to use and add to R in your
own work.

R is free, and it's tempting to think of open-source software just as "free
software". Things, however, are a little more subtle than that. It helps to
consider some different meanings of the word "free". "Free" can mean:

- *Gratis*: Free as in beer
- *Libre*: Free as in speech

```{r open-source-overview, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "An overview of how software can be each type of free (beer and speech). For software programs developed using a compiled programming language, the final product that you open on your computer is run by machine-readable binary code. A developer can give you this code for free (as in beer) without sharing any of the original source code with you. This means you can't dig in to figure out how the software works and how you can extend it. By contrast, open-source software (free as in speech) is software for which you have access to the human-readable code that was used as in input in creating the software binaries. With open-source code, you can figure out exactly how the program is coded."}
knitr::include_graphics("figures/OpenSourceOverview.png")
```

Open-source software software is the *libre* type of free (Figure
\@ref(fig:open-source-overview)). This means that, with software that is
open-source, you can:

- Access all of the code that makes up the software
- Change the code as you'd like for your own applications
- Build on the code with your own extensions
- Share the software and its code, as well as your extensions, with others

Often, open-source software is also free, making it "free and open-source software", 
or "FOSS".

Popular open source licenses for R and R packages include the GPL and MIT licenses.

> “Making Linux GPL'd was definitely the best thing I ever did.” -- Linus Torvalds

In practice, this means that, once you are familiar with the software, you can
dig deeply into the code to figure out exactly how it's performing certain
tasks. This can be useful for finding bugs and eliminating bugs, and also can
help researchers figure out if there are any limitations in how the code works
for their specific research.

It also means that you can build your own software on top of existing R software
and its extensions. I explain a bit more about R packages a bit later, but this
open-source nature of R (and other languages, including Python) has created a
large community of people worldwide who develop and share extensions to R. As a
result, you can pull in packages that let you do all kinds of things in R, like
visualizing Tweets, cleaning up accelerometer data, analyzing complex surveys,
fitting maching learning models, and a wealth of other cool things.

> "Despite its name, open-source software is less vulnerable to hacking than the secret, black box systems like those being used in polling places now. That’s because anyone can see how open-source systems operate. Bugs can be spotted and remedied, deterring those who would attempt attacks. This makes them much more secure than closed-source models like Microsoft’s, which only Microsoft employees can get into to fix." -- [Woolsey and Fox. *To Protect Voting, Use Open-Source Software.* New York Times. August 3, 2017.](https://www.nytimes.com/2017/08/03/opinion/open-source-software-hacker-voting.html?mcubz=3)

You can download the latest version of R from
[CRAN](https://cran.r-project.org). Be sure to select the distribution for your
type of computer system. R is updated occasionally; you should plan to
re-install R at least once a year, to make sure you're working with one of the
newer versions. Check your current R version (one way is by running
`sessionInfo()` at the R console) to make sure you're not using an outdated
version of R. Defaults should be fine for everything.

> "The R engine ... is pretty well uniformly excellent code but you
have to take my word for that. Actually, you don't. The whole engine is open source so, if you wish, you can check every line of it. If people were out to push dodgy software, this is not the way they'd go about it."
   - Bill Venables, R-help (January 2004)

> “Talk is cheap. Show me the code.” - Linus Torvalds

<iframe width="659" height="412" src="https://www.youtube.com/embed/DfW4vskkWbY?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_2.pdf) 
a pdf of the lecture slides for this video.

### What is RStudio?

To get the R software, you'll [download R](https://www.r-project.org) from the R
Project for Statistical Computing. This is enough for you to use R on your own
computer. However, I would suggest one additional, free piece of software to
improve your experience while working with R, RStudio.

RStudio is an integrated development environment (IDE) for R. This basically
means that it provides you an interface for running R and coding in R, with a
lot of nice extras that will make your life easier.

You download RStudio separately from R---you'll want to download and install R
itself first, and then you can [download
RStudio](https://www.rstudio.com/products/rstudio/download2/). You want the
Desktop version with the free license. Defaults should be fine for everything.

RStudio (the company) is a leader in the R community. Currently, the company:

- Develops and freely provides the RStudio IDE
- Provides excellent resources for learning and using R (e.g., cheatsheets, free
online books)
- Is producing some of the most-used R packages
- Employs some of the top people in R development
- Is a key member of The R Consortium (others include Microsoft, IBM, and Google)

R has been advancing by leaps in bounds in terms of what it can do and the
elegance with which it does it, in large part because of the enormous
contributions of people involved with RStudio.

<iframe width="659" height="412" src="https://www.youtube.com/embed/SYk8LTuB-bY?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_3.pdf) 
a pdf of the lecture slides for this video.

## Communicating with R

Because R is an interpreted language, you can communicate with it interactively. You do
this using the following general steps: 

1. Open an **R session**
2. At the **prompt** in the **console**, enter an **R expression**
3. Read R's "response" (the **output**)
4. Repeat 2 and 3
5. Close the R session

### R sessions, the console, and the command prompt

An **R session** is an instance of you using R. To open an R session,
double-click on the icon for "RStudio" on you computer. When RStudio opens, you
will be in a "fresh" R session, unless you restore a saved session (which I
strongly recommend against). This means that, once you open RStudio, you will
need to "set up" your session, including loading any packages you need (which
we'll talk about later) and reading in any data (which we'll also talk about).

In RStudio, there screen is divided into several "panes". We'll start with the
pane called "Console". The **console** lets you "talk" to R. This is where you
can "talk" to R by typing an **expression** at the **prompt** (the caret symbol,
">"). You press the "Return" key to send this message to R.

```{r r-console, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "Finding the 'Console' pane and the command prompt in RStudio."}
knitr::include_graphics("figures/r_console.jpg")
```

Once you press "Return", R will respond in one of three ways:

1. R does whatever you asked it to do with the expression and prints the output
(if any) of doing that, as well as a new prompt so you can ask it something new
2. R doesn't think you've finished asking you something, and instead of giving you
a new prompt (">") it gives you a "+". This means that R is still listening, waiting 
for you to finish asking it something. 
3. R tries to do what you asked it to, but it can't. It gives you an **error message**, 
as well as a new prompt so you can try again or ask it something new. 

### R expressions, function calls, and objects

To "talk" with R, you need to know how to give it a complete **expression**. 
Most expressions you'll want to give R will be some combination of two elements: 

1. **Function calls**
2. **Object assignments**

We'll go through both these pieces and also look at how you can combine them 
together for some expressions.

According to John Chambers, one of the creators of R's precursor S:

1. Everything that exists in R is an **object**
2. Everything that happens in R is a **call to a function**

<iframe width="659" height="412" src="https://www.youtube.com/embed/ZMWDZeXUySw?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_4.pdf) 
a pdf of the lecture slides for this video.

## Functions

In general, function calls in R take the following structure: 

```{r eval = FALSE}
## Generic code (this won't run)
function_name(formal_argument_1 = named_argument_1, 
              formal_argument_2 = named_argument_2,
              [etc.])
```

```{block, type = "rmdwarning"}
Sometimes, we'll show "generic" code in a code block, that doesn't actually work if you put it in R, but instead shows the generic structure of an R call. We'll try to always include a comment with any generic code, so you'll know not to try to run it in R.
```

A function call forms a complete R expression, and the output will 
be the result of running `print` or `show` on the object that is output
by the function call. Here is an example of this structure: 

```{r}
print(x = "Hello world")
```

Figure \@ref(fig:function-call) shows an example of the typical elements of a
function call. In this example, we're **calling** a function with the **name**
`print`. It has one **argument**, with a **formal argument** of `x`, which in
this call we've provided the **named argument** "Hello world".

```{r function-call, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "Main parts of a function call. This example is calling a function with the name 'print'. The function call has one argument, with a formal argument of 'x', which in this call is provided the named argument 'Hello world'."}
knitr::include_graphics("figures/function_call.jpg")
```

The **arguments** are how you customize the call to an R function. For example,
you can use change the named argument value to print different messages with the
`print` function:

```{r}
print(x = "Hello world")
print(x = "Hi Fort Collins")
```

Some functions do not require any arguments. For example, the `getRversion` function will 
print out the version of R you are using.

```{r}
getRversion()
```

Some functions will accept multiple arguments. For example, the `print` function allows you 
to specify whether the output should include quotation marks, using the `quote`
formal argument: 

```{r}
print(x = "Hello world", quote = TRUE)
print(x = "Hello world", quote = FALSE)
```

Arguments can be **required** or **optional**. 

For a required argument, if you don't provide a value for the argument when you
call the function, R will respond with an error. For example, `x` is a **required argument**
for the `print` function, so if you try to call the function without it, you'll get an 
error: 

```{r eval = FALSE}
print()
```

```
Error in print.default() : argument "x" is 
  missing, with no default
```

For an **optional argument** on the other hand, R knows a **default value** for that 
argument, so if you don't give it a value for that argument, it will just use the 
default value for that argument. 

For example, for the `print` function, the `quote` argument has the default value 
`TRUE`. So if you don't specify a value for that argument, R will assume it should 
use `quote = TRUE`. That's why the following two calls give the same result: 

```{r}
print(x = "Hello world", quote = TRUE)
print(x = "Hello world")
```

Often, you'll want to find out more about a function, including:

- Examples of how to use the function
- Which arguments you can include for the function
- Which  arguments are required versus optional
- What the default values are for optional arguments. 

You can find out all this information in the function's **helpfile**, which 
you can access using the function `?`. For example, the `mean` function will let you calculate the mean (average) of a 
group of numbers. To find out more about this function, at the console type:

```{r eval = FALSE}
?mean
```

This will open a helpfile in the "Help" pane in RStudio. Figure
\@ref(fig:helpfile) shows some of the key elements of an example helpfile, the
helpfile for the `mean` function. In particular, the "Usage" section helps you
figure out which arguments are **required** and which are **optional** in the
Usage section of the helpfile.

```{r helpfile, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "Navigating a helpfile. This example shows some key parts of the helpfile for the 'mean' function."}
knitr::include_graphics("figures/helpfile_arguments.jpg")
```

There's one class of functions that looks a bit different from others. These are
the infix **operator** functions. Instead using parentheses after the function
name, they usually go *between* two arguments. One common example is the `+`
operator:

```{r}
2 + 3
```

There are operators for several mathematical functions: `+`, `-`, `*`, `/`.
There are also other operators, including **logical operators** and **assignment
operators**, which we'll cover later.

<iframe width="659" height="412" src="https://www.youtube.com/embed/M8YrqQzG8Cg?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_5.pdf) 
a pdf of the lecture slides for this video.

## Objects and assignment

In R, a variety of different types and structures of data can be saved in what's
called **objects**. For right now, you can just think of an R object as a discrete
container of data in R.

Function calls will produce an object. If you just
call a function, as we've been doing, then R will respond by printing out that
object. However, we'll often want to use that object some more. For example, we
might want to use it as an argument later in our "conversation" with R, when we
call another function later. If you want to re-use the results of a function
call later, you can **assign** that **object** to an **object name**. This kind
of expression is called an **assignment expression**.

Once you do this, you can use that *object name* to refer to the object. This
means that you don't need to re-create the object each time you need
it---instead you can create it once and then just reference it by name each time
you need it after that. For example, you can read in data from an external file
as a dataframe object and assign it an object name. Then, when you need that
data later, you won't need to read it in again from the external file.

The **gets arrow**, `<-`, is R's assignment operator. It takes whatever you've
created on the right hand side of the `<-` and saves it as an object with the
name you put on the left hand side of the `<-` :

```{r eval = FALSE}
## Note: Generic code-- this will not work
[object name] <- [object]
```

For example, if I just type `"Hello world"`, R will print it back to me, but
won't save it anywhere for me to use later:

```{r}
"Hello world"
```

However, if I assign it to an object, I can "refer" to that object in a later expression. 
For example, the code below assigns the **object** `"Hello world"` the **object name** `message`. 
Later, I can just refer to this object using the name `message`, for example in a function
call to the `print` function:

```{r}
message <- "Hello world"
print(x = message)
```

When you enter an **assignment expression** like this at the R console, if everything
goes right, then R will "respond" by giving you a new prompt, without any kind of 
message. 

However, there are three ways you can check to make sure that the object was 
assigned to the object name: 

1. Enter the object's name at the prompt and press return. The default if you do this
is for R to "respond" by calling the `print` function with that object as the `x`
argument. 
2. Call the `ls` function (which doesn't require any arguments). This will list all the
object names that have been assigned in the current R session. 
3. Look in the "Environment" pane in RStudio. This also lists all the object names that
have been assigned in the current R session.

Here's are examples of these strategies:

1. Enter the object's name at the prompt and press return:

```{r}
message
```

2. Call the `ls` function:

```{r}
ls()
```

3. Look in the "Environment" pane in RStudio (see Figure \@ref(fig:environment)).

```{r environment, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "'Environment' pane in RStudio. This shows the names and first few values of all objects that have been assigned to object names in the global environment."}
knitr::include_graphics("figures/environment_pane.jpg")
```

You can make assignments in R using either the gets arrow (`<-`) or `=`. When
you read other people's code, you'll see both. R gurus advise using `<-` rather
than `=` when coding in R, and as you move to doing more complex things, some
subtle problems might crop up if you use `=`. I have heard from someone in the
know that you can tell the age of a programmer by whether he or she uses the
gets arrow or `=`, with `=` more common among the young and hip. For this
course, however, I am asking you to code according to [Hadley Wickham's R style
guide](http://adv-r.had.co.nz/Style.html), which specifies using the gets arrow
for assignment.

While you will be coding with the gets arrow exclusively in this course, it will
be helpful for you to know that the two assignment arrows do pretty much the
same thing:

```{r}
one_to_ten <- 1:10
one_to_ten

one_to_ten = 1:10
one_to_ten
```

While the gets arrow takes two key strokes instead of one (like the equals
sign), you can somewhat get around this limitation by using RStudio's keyboard
shortcut for the gets arrow. This shortcut is Alt + - on Windows and Option + -
on Macs. To see a full list of RStudio keyboard shortcuts, go to the "Help" tab
in RStudio and select "Keyboard Shortcuts".

There are some absolute **rules** for the names you can use for an object name:

- Use only letters, numbers, and underscores 
- Don't start with anything but a letter

If you try to assign an object to a name that doesn't follow the "hard" rules,
you'll get an error. For example, all of these expressions will give you an
error:

```{r eval = FALSE}
1message <- "Hello world"
_message <- "Hello world"
message! <- "Hello world"
```

In addition to these fixed rules, there are also some guidelines for naming
objects that you should adopt now, since they will make your life easier as you
advance to writing more complex code in R. The following three guidelines for
naming objects are from [Hadley Wickham's R style
guide](http://adv-r.had.co.nz/Style.html):

- Use lower case for variable names (`message`, not `Message`)
- Use an underscore as a separator (`message_one`, not `messageOne`)
- Avoid using names that are already defined in R (e.g., don't name an object
`mean`, because a `mean` function exists)

> "Don't call your matrix 'matrix'. Would you call your dog 'dog'? Anyway, it
might clash with the function 'matrix'." - Barry Rowlingson, R-help (October 2004)

Another good practice is to name objects after nouns (e.g., `message`) and
later, when you start writing functions, name those after verbs (e.g.,
`print_message`). You'll want your object names to be short enough that they
don't take forever to type as you're coding, but not so short that you can't
remember what they stand for.

```{block, type = "rmdtip"}
Sometimes, you'll want to create an object that you won't want to keep for very
long. For example, you might want to create a small object to test some code,
but you plan to not need the object again once you've done that. You may want to
come up with some short, generic object names that you use for these kinds of
objects, so that you'll know that you can delete them without problems when you
want to clean up your R session.

There are all kinds of traditions for these placeholder variable names in
computer science. `foo` and `bar` are two popular choices, as are, evidently,
`xyzzy`, `spam`, `ham`, and `norf`. There are different placeholder names in
different languages: for example, `toto`, `truc`, and `azerty` (French); and
`pippo`, `pluto`, `paperino` (Disney character names; Italian). See the
Wikipedia page on [metasyntactic
variables](https://en.wikipedia.org/wiki/Metasyntactic_variable) to find out
more.
```

<iframe width="659" height="412" src="https://www.youtube.com/embed/2NHfrMclytc?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_6.pdf) 
a pdf of the lecture slides for this video.

## More on communicating with R

What if you want to "compose" a call from more than one function call? One way
to do it is to assign the output from the first function call to a name and then
use that name for the next call. For example:

```{r}
message <- paste("Hello", "world")
print(x = message)
```

If you give two objects the same name, the most recent definition will be used (i.e., objects can be overwritten by assigning new content to the same object name). For example: 

```{r}
a <- 1:10
b <- LETTERS [1:3]

a
b

a <- b
a
```

To create an R expression you can "nest" one function call inside another
function call. For example:

```{r}
print(x = paste("Hello", "world"))
```

Just like with math, the order that the functions are evaluated moves from the
inner set of parentheses to the outer one (Figure
\@ref(fig:composing-functions)). There's one more way we'll look at later called
"piping".

```{r composing-functions, echo = FALSE, out.width = "500pt", fig.align = "center", fig.cap = "'Environment' pane in RStudio. This shows the names and first few values of all objects that have been assigned to object names in the global environment."}
knitr::include_graphics("figures/composing_function_calls.jpg")
```

## R scripts

This is a good point in learning R for you to start putting your code in R
scripts, rather than entering commands at the console.

An R script is a plain text file where you can save a series of R commands. You
can save the script and open it up later to see (or re-do) what you did earlier,
just like you could with something like a Word document when you're writing a
paper.
To open a new R script in RStudio, go to the menu bar and select "File" -> "New
File" -> "R Script". Alternatively, you can use the keyboard shortcut
Command-Shift-N. Figure \@ref(fig:rscript) gives an example of an R script file
opened in RStudio and points out some interesting elements.

```{r rscript, echo = FALSE, fig.align="center", fig.cap = "Example of an R script in RStudio.", out.width= "600pt"}
knitr::include_graphics("figures/ExampleOfRScript.jpg")
```

To save a script you're working on, you can click on the "Save" button (which
looks like a floppy disk) at the top of your R script window in RStudio or use
the keyboard shortcut Command-S. You should save R scripts using a ".R" file
extension.

Within the R script, you'll usually want to type your code so there's one
command per line. If your command runs long, you can write a single call over
multiple lines. It's unusual to put more than one command on a single line of a
script file, but you can if you separate the commands with semicolons (`;`).
These rules all correspond to how you can enter commands at the console.

Running R code from a script file is very easy in RStudio. You can use either
the "Run" button or Command-Return, and any code that is selected (i.e., that
you've highlighted with your cursor) will run at the console. You can use this
functionality to run a single line of code, multiple lines of code, or even just
part of a specific line of code. If no code is highlighted, then R will instead
run all the code on the line with the cursor and then move the cursor down to
the next line in the script.

You can also run all of the code in a script. To do this, use the "Source"
button at the top of the script window. You can also run the entire script
either from the console or from within another script by using the `source()`
function, with the filename of the script you want to run as the argument. For
example, to run all of the code in a file named "MyFile.R" that is saved in your
current working directory, run:

```{r, eval = FALSE}
source("MyFile.R")
```

You can add comments into an R script to let others know (and remind yourself)
what you're doing and why. To do this, use R's comment character, `#`. Any line
on a script line that starts with `#` will not be read by R. You can also take
advantage of commenting to comment out certain parts of code that you don't want
to run at the moment.

While it's generally best to write your R code in a script and run it from there
rather than entering it interactively at the R console, there are some
exceptions. A main example is when you're initially checking out a dataset, to
make sure you've read it in correctly. It often makes more sense to run commands
for this task, like `str()`, `head()`, `tail()`, and `summary()`, at the
console. These are all examples of commands where you're trying to look at
something about your data **right now**, rather than code that builds toward
your analysis, or helps you read in or clean up your data.

### Commenting code

Sometimes, you'll want to include notes in your code. You can do this in all
programming languages by using a *comment character* to start the line with your
comment. In R, the comment character is the hash symbol, `#`. R will skip any
line that starts with `#` in a script. For example, if you run the following
code:

```{r}
# Don't print this.
"But print this"
```

R will only print the second, uncommented line. 

You can also use a comment in the middle of a line, to add a note on what you're
doing in that line of the code. R will skip any part of the code from the hash
symbol on. For example:

```{r}
"Print this" ## But not this, it's a comment.
```

There's typically no reason to use code comments when running commands at the R
console. However, it's very important to get in the practice of including
meaningful comments in R scripts. This helps you remember what you did when you
revisit your code later.

> “You know you're brilliant, but maybe you'd like to understand what you did 2 weeks from now.” -- Linus Torvalds

<iframe width="659" height="412" src="https://www.youtube.com/embed/yEmBs5EBos4?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_7.pdf) 
a pdf of the lecture slides for this video.

## The "package" system

### R packages

> "Any doubts about R's big-league status should be put to rest, now that we have a Sudoku Puzzle Solver. Take that, SAS!"
   - David Brahm (announcing the sudoku package), R-packages (January 2006)

Your original download of R is only a starting point. You can expand
functionality of R with what are called *packages*, or extensions with new code
and functionality that add to the basic "base R" environment. To me, this is a
bit like the toy train set that my son was obsessed with for a while. You first
buy a very basic set that looks something like Figure
\@ref(fig:toy-train-basic).

```{r toy-train-basic, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap = "The toy version of base R."}
knitr::include_graphics("figures/TrainBasic.JPG")
```

To take full advantage of R, you'll want to add on packages. In the case of the
train set, at this point, a doting grandparent adds on extensively through
birthday presents, so you end up with something that looks like Figure
\@ref(fig:toy-train-fancy).

```{r toy-train-fancy, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap = "The toy version of what your R set-up will look like once you find cool packages to use for your research."}
knitr::include_graphics("figures/TrainComplex.JPG")
```

Each package is basically a bundle of extra R functions. They may also include 
help documentation, datasets, and some other objects, but typically the heart of
an R package is the new functions it provides.

You can get these "add-on" packages in a number of ways. The main source for
installing packages for R remains the Comprehensive R Archive Network, or
[CRAN](https://cran.r-project.org). However, [GitHub](https://github.com) is
growing in popularity, especially for packages that are still in development.
You can also create and share packages among your collaborators or co-workers,
without ever posting them publicly. In the "Advanced" section of this course,
you will learn some about writing your own R package.

### Installing from CRAN

```{r cran10000, echo = FALSE, out.width = "600pt", fig.align = "center", fig.cap = "Celebrating CRAN's 10,000th package."}
knitr::include_graphics("figures/CRAN_package_10000.png")
```

The most popular place from which to get packages is currently CRAN, which has
over 10,000 R packages available (Figure \@ref(fig:cran10000)). You can install
packages from CRAN using R code, with the `install.packages` function. For
example, telephone keypads include letters for each number (Figure
\@ref(fig:phone-keypad)), which allow companies to have "named" phone numbers
that are easier for people to remember, like 1-800-GO-FEDEX and 1-800-FLOWERS.

```{r phone-keypad, echo = FALSE, out.width = "150pt", fig.align = "center", fig.cap="Telephone keypad with letters corresponding to each number."}
knitr::include_graphics("figures/telephone_keypad.png")
```

The `phonenumber` package is a cool little package that will convert between
numbers and letters based on the telephone keypad. Since this package is on
CRAN, you can install the package to your computer using the `install.packages`
function:

```{r, eval = FALSE, messages = FALSE, warnings = FALSE, results = FALSE}
install.packages(pkgs = "phonenumber")
```

This downloads the package from CRAN and saves it in a special location on your
computer where R can load it when you're ready to use it. Once you've installed
a package to your computer this way, you don't need to re-run this
`install.packages` for the package ever again (unless the package maintainer
posts an updated version).

Just like R itself, packages often evolve and are updated by their maintainers.
You should update your packages as new versions come out. Typically, you have to
reinstall packages when you update your version of R, so this is a good chance
to get the most up-to-date version of the packages you use.

### Loading an installed package

Once you have installed a package, it will be saved to your computer. However,
you won't be able to access its functions within an R session until you *load*
it in that R session. Loading a package essentially makes all of the package's
functions available to you. 

You can load a package in an R session using the
`library` function, with the package name inside the parentheses.

```{r messages = FALSE, warnings = FALSE, results = FALSE}
library(package = "phonenumber")
```

Figure \@ref(fig:install-vs-load) provides a conceptual
picture of the different steps of installing and loading a package.

```{r install-vs-load, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="Install a package (with 'install.packages') to get it onto your computer. Load it (with 'library') to get it into your R session."}
knitr::include_graphics("figures/install_vs_library.jpg")
```

Once a package is loaded, you can use all its exported (i.e., public) functions
by calling them directly. For example, the `phonenumber` has a function called
`letterToNumber` that converts a character string to a number. If you have not
loaded the `phonenumber` package in your current R session and try to use this
function, you will get an error. However, once you've loaded `phonenumber` using
the `library` function, you can use this function in your R session:

```{r}
fedex_number <- "GoFedEx"
letterToNumber(value = fedex_number)
```

```{block, type = "rmdnote"}
R vectors can have several different *classes*. One common class is the
character class, which is the class of the character string we're using here
("GoFedEx"). You'll always put character strings in quotation marks. Another key
class is numeric (numbers). Later in the course, we'll introduce other classes
that vectors can have, including factors and dates. For the simplest vector
classes, these classes are determined by the type of data that the vector
stores.
```

When you open RStudio, unless you reload the history of a previous R session
(which I typically strongly **do not** recommend), you will start your work in a
"fresh" R session. This means that, once you open RStudio, you will need to run
the code to load any packages, define any objects, and read in any data that you
will need for analysis in that session.

If you are using a package in academic research, you should cite it, especially
if it implements an algorithm or method that is not standard. You can use the
`citation` function to get the information you need about how to cite a package:

```{r}
citation(package = "phonenumber")
```

```{block, type = "rmdnote"}
We've talked here about loading packages using the `library` function to access
their functions. However, this is not the only way to access the package's
functions. The syntax `[package name]::[function name]` (e.g.,
`phonenumber::letterToNumber(fedex)`) will allow you to use a function from a
package you have installed on your computer, even if its package has not been
loaded in the current R session. Typically, this syntax is not used much in data
analysis scripts, in part because it makes the code much longer. However, you
will occassionally see it used to distinguish between two functions from
different packages that have the same name, as this format makes the desired
function unambiguous. One example where this syntax often is needed is when both
`plyr` and `dplyr` packages are loaded in an R session, since these share
functions with the same name.
```

Packages typically include some documentation to help users. These include: 

- **Package vignettes**: Longer, tutorial-style documents that walk the user
through the basics of how to use the package and often give some helpful example
cases of the package in use.
- **Function helpfiles**: Files for each external function (i.e., the package
maintainer wants it to be used by others) within the package, following an
established structure. These include information about what inputs are required
and optional for the function, what output will be created, and what options can
be selected by the user. In many cases, these also include examples of using the
function.

To determine which vignettes are available for a package, you can use the
`vignette` function, with the package's name specified for the `package` option:

```{r eval = FALSE}
vignette(package = "phonenumber")
```

From the output of this, you can call any of the package's vignettes directly.
For example, the previous call tells you that this package only has one
vignette, and that vignette has the same name as the package ("phonenumber").
Once you know the name of the vignette you would like to open, you can also use
`vignette` to open it:

```{r eval = FALSE}
vignette(topic = "phonenumber")
```

To access the helpfile for any function within a package you've loaded, you can
use `?` followed by the function's name:

```{r eval = FALSE}
?letterToNumber
```

<iframe width="659" height="412" src="https://www.youtube.com/embed/drZMFYN8bGY?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_8.pdf) 
a pdf of the lecture slides for this video.

## R's most basic object types

An R object stores some type of data that you want to use later in your R code,
without fully recreating it. The content of R objects can vary from very simple
(the `"GoFedEx"` string in the example code above) to very complex objects with
lots of elements (for example, a machine learning model).

Objects can be structured in different ways, in terms of how they "hold" data.
These difference structures are called **object classes**. One class of objects
can be a subtype of a more general object class.

There are a variety of different object types in R, shaped to fit different
types of objects ranging from the simple to complex. In this section, we'll
start by describing two object types that you will use most often in basic data
analysis, **vectors** (1-dimensional objects) and **dataframes** (2-dimensional
objects).

For these two object classes (vectors and dataframes), we'll look at: 

1. How that class is structured
2. How to make a new object with that class
3. How to extract values from objects with that class

In later classes, we'll spend a lot of time learning how to do other things
with objects from these two classes, plus learn some other classes.

### Vectors

To get an initial grasp of the *vector* object type in R, think of it as a
1-dimensional object, or a string of values. Figure \@ref(fig:vector-example)
provides an example of the structure for a very simple vector, one that holds
the names of the three main characters in the *Harry Potter* book series.

```{r vector-example, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="An example of the structure of an R object with the vector class. This object class contains data as a string of values, all with the same data type."}
knitr::include_graphics("figures/example_vector.jpg")
```

All values in a vector must be of the same data type (i.e., all numbers, all
characters, all dates). If you try to create a vector with elements from
different types (like "FedEx", which is a character, and 3, a number), R will
coerce all of the elements to the most generic type of any of the elements
(i.e., "FedEx" and "3" will both become characters, since "3" can be changed to
a character, but "FedEx" can't be changed to a number). Figure
\@ref(fig:vector-example-classes) gives some examples of different classes of
vectors.

```{r vector-example-classes, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="Examples of vectors of different classes. All the values in a vector must be of the same type (e.g., all numbers, all characters). There are different classes of vectors depending on the type of data they store."}
knitr::include_graphics("figures/vector_class_examples.jpg")
```

To create a vector from different elements, you'll use the concatenation
function, `c` to join them together, with commas between the elements. For
example, to create the vector shown in Figure \@ref(fig:vector-example), you 
can run:

```{r}
c("Harry", "Ron", "Hermione")
```

If you want to use that object later, you can assign it an object name in the expression: 

```{r}
main_characters <- c("Harry", "Ron", "Hermione")
print(x = main_characters)
```

This **assignment expression**, for assigning a vector an object name, follows
the structure we covered earlier for function calls and assignment expressions
(Figure \@ref(fig:vector-assignment)).

```{r vector-assignment, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="Elements of the assignment expression for creating a vector and assigning it an object name."}
knitr::include_graphics("figures/vector_class_examples.jpg")
```

If you create a numeric vector, you should not put the values in quotation marks:

```{r}
n_kids <- c(1, 7, 1)
```

If you mix classes when you create the vector, R will coerce all the elements
to most generic of the elements' classes:

```{r}
mixed_classes <- c(1, 3, "five")
mixed_classes
```

Notice that the two integers, 1 and 3, are now in quotation marks, once they 
are put in a vector with a value with the character data type. You can use the
`class` function to determine the class of an object: 

```{r}
class(x = mixed_classes)
```

A vector's *length* is the number of elements in the vector. You can use the
`length` function to determine a vector's length:

```{r}
length(x = mixed_classes)
```

Once you create an object, you will often want to reference the whole object in
future code. However, there will be some times when you'll want to reference
just certain elements of the object (for example, the first three values). You
can pull out certain values from a vector by using indexing with square brackets
(`[...]`) to identify the locations of the element you want to extract. For
example, to extract the second element of the `main_characters` vector, you can
run:

```{r}
main_characters[2] # Get the second value
```

You can use this same method to extract more than one value. You just need to
create a numeric vector with the position of each element you want to extract
and pass that in the square brackets. For example, to extract the first and
third elements of the `main_characters` vect, you can run:

```{r}
main_characters[c(1, 3)] # Get first and third values

```

The `:` operator can be very helpful with extracting values from a vector. 
This operator creates a sequence of values from the value before the `:` to the
value after `:`, going by units of 1. For example, if you want to create a list
of the numbers between 1 and 10, you can run: 

```{r}
1:10
```

If you want to extract the first two values from the `main_characters` vector, you 
can use the `:` operator: 

```{r}
main_characters[1:2] # Get the first two values
```

You can also use logic to pull out some values of a vector. For example, you
might only want to pull out even values from the `fibonacci` vector. We'll cover
using logical expressions to index vectors later in the book.

```{block, type = 'rmdtip'}
One thing that people often find confusing when they start using R is knowing
when to use and not use quotation marks. The general rule is that you use
quotation marks when you want to refer to a character string literally, but no
quotation marks when you want to refer to the value in a previously-defined
object. For example, if you saved the string `"Anderson"` as the object
`my_name` (`my_name <- "Anderson"`), then in later code, if you type `my_name`
(no quotation marks), you'll get `"Anderson"`, while if you type out `"my_name"`
(with quotation marks), you'll get `"my_name"` (what you typed, literally).

One thing that makes this rule confusing is that there are a few cases in R
where you really should (by this rule) use quotation marks, but the function is
coded to let you be lazy and get away without them. One example is the `library`
function. In the code earlier in this section to load the "phonenumber" package,
you want to literally load the package "phonenumber", rather than load whatever
character string is saved in the object named `phonenumber`. However, `library`
is one of the functions where you can be lazy and skip the quotation marks, and
it will still load "phonenumber" for you. Therefore, if you want, this function
also works if you call `library(package = phonenumber)` (without the quotation marks)
instead of how we actually called it (`library(package = phonenumber)`). 
```

<iframe width="659" height="412" src="https://www.youtube.com/embed/COA9VuHBmIw?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_9.pdf) 
a pdf of the lecture slides for this video.

### Dataframes

A dataframe is a 2-dimensional object, and is made of one or more vectors of the
same length stuck together side-by-side. It is the closest R has to an Excel
spreadsheet-type structure. Figure \@ref(fig:example-dataframe) gives a
conceptual example of a dataframe created from several of the vector examples in
Figure \@ref(vector-example-classes).

```{r example-dataframe, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="An example dataframe, created from several vectors of the same length and with observations aligned across vector positions (for example, the first value in each vector provides a value for Harry, the second for Ron)."}
knitr::include_graphics("figures/example_dataframe.jpg")
```

Here's how the dataframe in Figure \@ref(fig:example-dataframe) will look in R: 

```{r echo = FALSE, message = FALSE, results = TRUE}
library(package = "tibble")
hp_data <- tibble(first_name = c("Harry", "Ron", "Hermione"),
                  last_name = c("Potter", "Weasley", "Granger"),
                  n_kids = c(1, 7, 1),
                  survived = c(TRUE, TRUE, TRUE))
hp_data
```

This dataframe is arranged in rows and columns, with column names for each
column (Figure \@ref(fig:annotated-dataframe)). Note that each row of this
dataframe gives a different observation (in this case, our unit of observation
is a Harry Potter character). Each column gives a different type of information
(first name, last name, birth year, and whether they're still alive) for each of
the observations (Beatles). Notice that the number of elements in each of the
columns must be the same in this dataframe, but that the different columns can
have different classes of data (e.g., character vectors for `first_name` and
`last_name`, logical value of TRUE or FALSE for `alive`).

```{r annotated-dataframe, echo = FALSE, out.width = "400pt", fig.align = "center", fig.cap="The elements of a dataframe: columns, rows, and column names."}
knitr::include_graphics("figures/example_dataframe_labeled.jpg")
```

We'll be working with a specific class of dataframe called a **tibble**. You can
create tibble dataframes using the `tibble` function from the `tibble` package.
However, most often you will create a dataframe by reading in data from a file,
using something like `read_csv` from the `readr` package.

```{block, type = "rmdnote"}
There are base R functions for both of these tasks (`data.frame` and `read.csv`,
respectively), eliminating the need to load additional packages with a `library`
call. However, the series of packages that make up what's called the "tidyverse"
have brought a huge improvement in the ease and speed of working with data in R.
We will be teaching these tools in this course, and that's why we're going
directly to `tibble` and `read_csv` from the start, rather than base R
equivalents. Later in the course, we'll talk more about this "tidyverse" and
what makes it so great. 
```

To create a dataframe, you can use the `tibble` function from the `tibble`
package. The general format for using `tibble` is:

```{r eval = FALSE}
## Note: Generic code
[name of object] <- tibble([1st column name] = [1st column content],
                           [2nd column name] = [2nd column content])
```

with an equals sign between the column name and column content for each column,
and commas between each of the columns.

Here is an example of the code used to create the *Harry Potter* tibble
dataframe shown above:

```{r}
library(package = "tibble")
hp_data <- tibble(first_name = c("Harry", "Ron", "Hermione"),
                  last_name = c("Potter", "Weasley", "Granger"),
                  n_kids = c(1, 7, 1),
                  survived = c(TRUE, TRUE, TRUE))
hp_data
```

You can also create a dataframe by sticking together vectors you already have saved as R objects. For example:

```{r}
hp_data <- tibble(first_name = main_characters,
                  last_name = c("Potter", "Weasley", "Granger"),
                  n_kids = n_kids,
                  survived = c(TRUE, TRUE, TRUE))
hp_data
```

Note that this call requires that the `main_characters` and `n_kids` vectors are
the same length, although they don't have to be (and in this case aren't) the
same class of objects (`main_characters` is a character class, `n_kids` is
numeric).

```{block, type = "rmdnote"}
You can put more than one function call in a single line of R code, as in this
example (the `c` creates a vector, while the `tibble` creates a dataframe,
using the vectors created by the calls to `c`). When you use multiple functions
within a single R call, R will evaluate starting from the inner-most parentheses
out, much like the order of operations in a math equation with parentheses.
```

So far, we've only shown how to create dataframes from scratch within an R
session. Usually, however, you'll create R dataframes instead by reading in data
from an outside file using the `read_csv` from the `readr` package and related
functions. For example, you might want to analyze data on all the guests that
came on the *Daily Show*, *circa* Jon Stewart. If you have this data in a
comma-separated (csv) file on your computer called "daily_show_guests.csv" 
(see the In-Course Exercise for instructions on downloading it), you
can read it into your R session with the following code:

```{r echo = FALSE, message = FALSE}
library(package = "readr")
daily_show <- read_csv(file = "data/daily_show_guests.csv",
                       skip = 4)
```
```{r eval = FALSE}
library(package = "readr")
daily_show <- read_csv(file = "daily_show_guests.csv",
                       skip = 4)
```

In this code, the `read_csv` function is reading in the data from the file
"daily_show_guests.csv", while the gets arrow (`<-`) assigns that data to the
object `daily_show`, which you can then reference in later code to explore and
plot the data.

You can use the functions `dim`, `nrow`, and `ncol` to figure out the dimensions
(number of rows and columns) of a dataframe:

```{r}
dim(x = daily_show)
nrow(x = daily_show)
ncol(x = daily_show)
```

Base R also has some useful functions for quickly exploring dataframes:

- `str`: Show the structure of an R object, including a dataframe
- `summary`: Give summaries of each column of a dataframe.

For example, you can explore the data we just pulled in on the *Daily Show* 
with: 

```{r}
str(object = daily_show)
summary(object = daily_show)
```

To extract data from a dataframe, you can use some functions from the `dplyr`
package, `select` and `slice`. The `select` function will pull out columns,
while the `slice` function will pull out rows. In this chapter, we'll talk about 
how to extract certain rows or columns of a dataframe by their *position* (i.e., 
row or column number). Later in the book, we'll talk about other ways to extract
values from dataframes.

For example, if you wanted to get the first two rows of the `hp_data`
dataframe, you could run:

```{r message = FALSE, warning = FALSE}
library(package = "dplyr")
slice(.data = hp_data, c(1:2))
```

If you wanted to get the first and fourth columns, you could run: 

```{r}
select(.data = hp_data, c(1, 4))
```

You can compose calls from both functions. For example, you could extract the
values in the first and fourth columns of the first two rows with: 

```{r}
select(.data = slice(.data = hp_data, c(1:2)), c(1, 4))
```

You can use square-bracket indexing (`[..., ...]`) for dataframes, too, but now
they'll have two dimensions (rows, then columns). Put the rows you want before
the comma, the columns after. If you want all of something (e.g., all rows in
the dataframe), leave the designated spot blank. Here are two examples of using
square-bracket indexing to pull a subset of the `hp_data` dataframe we
created above:

```{r}
hp_data[1:2, 2] # First two rows, second column
hp_data[3, ] # Last row, all columns
```

```{block type = "rmdnote"}
If you forget to put the comma in the indexing for a dataframe (e.g.,
`fibonacci_seq[1:2]`), you will index out the *columns* that fall at that
position or positions. To avoid confusion, I suggest that you always use
indexing with a comma when working with dataframes. 
```


<!-- ## Using R functions -->

<!-- ### Function structure -->

<!-- In general, functions in R take the following structure: -->

<!-- ```{r, eval = FALSE} -->
<!-- ## Generic code -->
<!-- function.name(parameter 1 = argument 1, parameter 2 = argument 2, -->
<!--               parameter 3 = argument 3)   -->
<!-- ``` -->

<!-- The result of the function will be output to your R session, unless you choose to save the output in an object: -->

<!-- ```{r, eval = FALSE} -->
<!-- ## Generic code -->
<!-- new_object <- function.name(parameter 1 = argument 1, -->
<!--                             parameter 2 = argument 2, -->
<!--                             parameter 3 = argument 3)   -->
<!-- ``` -->

<!-- Here are some example function calls, to give you examples of this structure: -->

<!-- - The `head` function prints out the first few rows of a dataframe. By default, it prints out six rows. With the `n` option, you can change the number of rows it prints out: -->

<!-- ```{r} -->
<!-- head(daily_show) -->
<!-- head(daily_show, n = 3) -->
<!-- ``` -->

<!-- The `read_csv` function (from the `readr` package) reads data from a comma-separated flat file into R. The `skip` option allows you to skip some of the first few lines of the file, in case the data starts a few lines into the file. -->

<!-- ```{r eval = FALSE} -->
<!-- daily_show <- read_csv("daily_show_guests.csv", skip = 4) -->
<!-- ``` -->

<!-- ```{r echo = FALSE, message = FALSE} -->
<!-- daily_show <- read_csv("data/daily_show_guests.csv", skip = 4) -->

<!-- ``` -->


<!-- Within the function call, *parameters* allow you to customize the function to run in a certain way (e.g., use a certain dataframe as an input, change the number of lines printed out from the default value, skip some lines when reading in data). Some function parameters will have *default arguments*, which means that you don't have to put a value for that parameter for the function to run, but you can if you want the function to do something other than the default.  -->

<!-- ### Function help files -->

<!-- You can find out more about a function, include what parameters it has and what the default values, if any, are by using `?` before the function name in the R console. For example, to find out more about the `read_csv` command, run:  -->

<!-- ```{r eval = FALSE} -->
<!-- ?read_csv -->
<!-- ``` -->

<!-- From the "Usage" section of the help file, you can figure out that the only required parameter is `file`, the pathname of the file that you want to read in, since this is the only argument in the "Usage" example without an argument value: -->

<!-- ``` -->
<!-- read_csv(file, col_names = TRUE, col_types = NULL, -->
<!--   locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, -->
<!--   quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, -->
<!--   guess_max = min(1000, n_max), progress = show_progress()) -->
<!-- ``` -->

<!-- You can also see from this "Usage" section that the default value of `col_names` is `TRUE` (use the first row of the data as the column names), the default value of `skip` is 0 (don't skip any lines), etc. -->

<!-- The "Arguments" section explains each of the parameters, and possible arguments that each can take. For example, here is the explanation of the `n_max` parameter in the `read_csv` function:  -->

<!-- ``` -->
<!-- n_max:	Maximum number of records to read. -->
<!-- ``` -->

<!-- From this, you can determine that you should put in a whole number, 0 or higher, and the function will skip that many line of the file before it starts reading in a dataframe when you run `read_csv`. -->

<!-- ### Function parameters -->

<!-- Each function parameter has a name (e.g., `n_max`, `col_names`, `file`). The safest way to call a function in R is to use the structure `parameter name = argument value` for every parameter, like this:  -->

<!-- ```{r eval = FALSE} -->
<!-- head(x = daily_show, n = 3) -->
<!-- ``` -->

<!-- However, you can also give argument values by position. For example, in the `head` function, the first parameter is `x`, the object you want to look at, and the second is `n`, the number of elements you want to include when you look at the object. If you know this, you can call `head` using the shorter call:  -->

<!-- ```{r eval = FALSE} -->
<!-- head(daily_show, 3) -->
<!-- ``` -->

<!-- If you use position alone, you will have problems if you don't include arguments in exactly the right order. However, if you use parameter names to set each argument, it doesn't matter what order you include arguments when calling a function:  -->

<!-- ```{r eval = FALSE} -->
<!-- # These two calls return the exact same object -->
<!-- head(x = daily_show, n = 3) -->
<!-- head(n = 3, x = daily_show) -->
<!-- ``` -->

<!-- Because code tends to be more robust to errors when you use parameter names to set arguments, we recommend against using position, rather than name, to give arguments when calling functions, at least while you're learning R. It's too easy to forget the exact order and get errors in your code. However, there is one exception-- the first argument to a function is almost always required (i.e., there's not a default value), and you very quickly learn what the first parameter of most functions are as soon as you start using the function regularly. Therefore, it's fine to use position alone to specify the first argument in a function, but for now always use the parameter name to set any later arguments: -->

<!-- ```{r eval = FALSE} -->
<!-- head(daily_show, n = 3) -->
<!-- ``` -->

<!-- ```{block, type='rmdtip'} -->
<!-- Using the full parameter names for arguments can take a bit more time, since it requires more typing. However, RStudio helps you out with that by offering *code completion* or *tab completion*. Once you start typing the first few letters of a parameter name within a function call, try pressing the tab key. All possible arguments that start with those letters should show up, and you can scroll through to pick the right one, or keep typing until the argument you want is atnthe top of the list of choices, and then press the tab key again. -->
<!-- ``` -->

<iframe width="659" height="412" src="https://www.youtube.com/embed/oXL1B3AWvA8?list=PLuGPtwgRXxqK1d-3V2MKCad24h-v75k8P" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week1_part_10.pdf) 
a pdf of the lecture slides for this video.

## In-course Exercise Chapter 1

You will take turns sharing your screens as you work through this exercise. Before you 
start, open you R session and use the `sample` function, with all of your group members'
names, to randomly shuffle your names (revisit the in-course exercise in the "Course Overview"
chapter if you need a reminder). 

You should do this on only one groups members computer. The order that you get from R is the 
order that you should take turns sharing your screen and leading the effort on coding for
your group. When you are not sharing your screen, help out with suggestions, especially 
for general directions you could take to approach a question. (There are standards for 
pair programming that we'll discuss next week, and these will provide more advice on how 
to productively code in a group.)

### Trying out the code in slides for first lecture videos

Have one person in your group share their screen and take the lead in typing the code
or doing the other work for this part. 

To start, you'll try running some simple code in R, using examples from the video lectures for Chapter 1. Take the following steps:

1. Open an R session and find the "Console" pane. 
2. Go through the slides for video lectures 4 ("Function calls") and 5 ("Objects and assignments). Find any examples of R
expressions and try them out at the prompt in the console.
3. Once you've run an assignment expression, find the "Environment" pane. Check
that the object name that you assigned now appears there.

### Writing your code as an R script

While the R console is fine for initially exploring data, you should get in the
habit of writing up R code in an R script for most of your data analysis
projects in R.

* Open a new R script and save it to your current working directory (i.e.,
wherever you saved the data you downloaded for this exercise).
* Take some of the code that you wrote for this exercise. Put it in the R
script. Do not put more than one function call per line (but it's fine to have
longer function calls span a few lines).
* Use the "Run" button to run a single line of this code. Check the console to
see what happens when you do.
* Highlight a few lines of the code and use "Run" to run them.
* Try using the keyboard shortcut (Command-Return) to run the line of code your
cursor is currently on. Try doing this with a function call that runs across
several lines of the R script file-- what do you see at the console?
* Try running the whole script using "Source". Again, look at the console after
you "source" the script.
* Close your R session (and save any changes to your R script). Do **not** save
your R session history. Re-open R and see if you can re-open your R script and
re-run it. Try using `ls()` to list the objects in your R session before and
after you re-run your script. Does anything about the result surprise you?

### About the dataset

Trade the screen sharing to the next member of your group.

For the rest of today's class, you'll be using a dataset of all the guests on
*The Daily Show* when Jon Stewart was the host. This data was originally
collected by Nate Silver's website,
[FiveThirtyEight](http://fivethirtyeight.com) and is available on
[FiveThirtyEight's GitHub
page](https://github.com/fivethirtyeight/data/tree/master/daily-show-guests)
under the [Creative Commons Attribution 4.0 International
License](http://creativecommons.org/licenses/by/4.0/). I have copied this data
into my GitHub repository for this class. The only change made to the original
file was to add (commented) attribution information at the start of the file.

**First, check out a bit more about this data and its source:**

<!-- * Check out [the Creative Commons -->
<!-- license](http://creativecommons.org/licenses/by/4.0/). What are we allowed to do -->
<!-- with this data? What restrictions are there on using the data? -->
* It's often helpful to use prior knowledge to help check out or validate your
dataset. One thing we might want to know about this data is if it covers the
whole time that Jon Stewart hosted *The Daily Show*. Use Google to find out the
dates he started and finished as host.
* Briefly browse around [FiveThirtyEight's GitHub data
page](https://github.com/fivethirtyeight/data). What are some other datasets
available that you find interesting? For any dataset, you can scroll to the
bottom of the page to get to the compiled README.md content, which gives the
full titles and links to relevant datasets. You can also click on any dataset to
get more information.
* Look at [the GitHub page for this *Daily Show*
data](https://github.com/fivethirtyeight/data/tree/master/daily-show-guests).
How many columns will be in this dataset? What kind of information does the data
include? What do the columns show? What do the rows show?

```{block, type = "rmdnote"}
In this exercise, you're using data posted by
[FiveThirtyEight](http://fivethirtyeight.com) on [GitHub](https://github.com).
We'll be using a lot of data that's on GitHub this semester, and GitHub is being
used behind-the-scenes for both this book and the course note slides. We'll talk
more about GitHub later, but you might find it interesting to explore a bit now.
It's a place where people can post, work on, and share code in a number of
programming languages-- it's been referred to as "Facebook for Nerds". You can
search GitHub repositories and code specifically by programming language, so it
can be a good way to find example R code from which to learn.
```

<!-- If you have extra time: -->

<!-- * Check out [the related -->
<!-- article](http://fivethirtyeight.com/datalab/every-guest-jon-stewart-ever-had-on-the-daily-show/) -->
<!-- on FiveThirtyEight. What are some specific questions they used this data to -->
<!-- answer for this article? -->
<!-- * Who is Nate Silver? -->

### Manually creating vectors 

Start by manually creating some vectors and data frames with a small subset of
this data.

* Use the concatenate function (`c`) to create a vector "from scratch" with the
names of the five guests to appear on the show (these could be the first five
guests, or you could also randomly pick five guests). Assign this vector the
object name `five_guests`. What class (numeric or character) do you think this
vector will be? Will you need to use quotation marks for each element you add to
the vector?
* Use square bracket indexing to print out the following subsets of this vector
(you'll have one R expression per subset): (1) The first guest in the vector; (2) The
third and fifth guests; (3) The second through fourth guests.
* Create a new vector called `first_guest` with just the first guest in the
vector, using the square bracket indexing you used in the previous step.
* In the same way, create a vector with the year of each of these five guests'
appearances. Assign this vector to an object named `appearance_year`. What class
(numeric or character) do you think this vector will be? Will you need to use
quotation marks for each element you add to the vector?
* Use the `class` function to determine the classes (e.g., numeric, character)
of each of the vectors you just created.

Example R code:

```{r}
# I picked five random guests from throughout the dataset. The guests you pick will 
# likely be different.

# Create a vector with the names of five guests
five_guests <- c("Miss Piggy", "Stanley Tucci", "Kermit the Frog",
                 "Hank Azaria", "Al Gore")

# Use square-bracket indexing to print out some subsets of the data
five_guests[1]
five_guests[c(3, 5)]
five_guests[2:4]

# Save just the first guest in a separate object
first_guest <- five_guests[1]
first_guest

# Create a vector with the year of the appearance of each guest
appearance_year <- c(1999, 2000, 2001, 2001, 2002)

# Figure out the classes of the two vectors you just created
class(x = five_guests)
class(x = appearance_year)
```

### Installing and using a package

Trade the screen sharing to the next member of your group. Have the person who 
was sharing their screen save their R script and send it to this person through 
the Zoom chat. The new person should open this R script and use it to re-run the
last part of the analysis, so that the vectors defined in the last part of the
exercise can be used here.

The `stringr` package includes a number of functions that make it easier to work
with character strings in R. In particular, it includes functions to change the
capitalization of words in character stings. Here, you'll install and load this
package and then use it to work with the `five_guests` vector we created in the
last section.

* If you have not already installed the `stringr` package, install it from CRAN.
* Load the `stringr` package in your current R session, so you will be able to
use its functions.
* Check if the package has a vignette. If so, check out out that vignette.
* See if you can use the `str_to_lower` function from the `stringr` package to
convert all the names in your `five_guests` vector to lowercase.
* See if you can find a function in the `stringr` package that you can use to
convert all the names in your `five_guests` vector to uppercase. (Hint: At the R
console, try typing `?stringr::` and then the Tab key.)

Example R code: 

```{r eval = FALSE}
# If you need to, install the package from CRAN
install.packages(pkgs = "stringr")
```
```{r}
# Load the package into your current R session
library(package = "stringr")
```
```{r eval = FALSE}
# Open the package's vignette
vignette(topic = "stringr")
```
```{r}
# Convert the `five_guests` strings to lowercase
str_to_lower(string = five_guests)

# Convert the `five_guests` strings to uppercase
str_to_upper(string = five_guests)
```

### Manually creating a dataframe

* Combine the two vectors you created earlier, `five_guests` and
`appearance_year` to create a dataframe named `guest_list`. For the columns, use
the same column names used in the original, raw data for the guest names and
appearance year. Print out this dataframe at the R console to make sure it looks
like you thought it would.
* Use functions from the `dplyr` package to print out the following subsets of
this dataframe (you'll have one R call per subset): (1) The appearance year of
the first guest; (2) Names of the third through fifth guests; (3) Names of all
guests; (4) Both names and appearance years of the first and third guests.
* The `str` function can be used to figure out the structure of a dataframe. Run
this command on the `guest_list` dataframe you created. What information does
this give you? Use the helpfile for `str` to help you figure this out (which you
can access by running `?str`). Do you see anything that surprises you?
* Use the `ls` function to list all the objects you currently have defined in
your R session. Compare this list to the "Environment" pane in RStudio.

Example R code:

```{r}
# Create the data frame, then print it out to make sure it looks like you thought
# it would
library(package = "tibble")
guest_list <- tibble(Raw_Guest_List = five_guests,
                     YEAR = appearance_year)
guest_list

# Use functions from the dplyr package to extract values from the dataframe
library(package = "dplyr")
slice(.data = select(.data = guest_list, 2), 1)
slice(.data = select(.data = guest_list, 1), 3:5)
select(.data = guest_list, 1)
slice(.data = guest_list, c(1, 3))

# Use `str` to check out the structure of the data frame you created
str(guest_list)
```

### Getting the data onto your computer

Next, we will work with the whole dataset. Download the data [from
GitHub](https://github.com/geanders/RProgrammingForResearch/blob/master/data/daily_show_guests.csv)
onto your computer. It is **very important** for you to use [this link](https://github.com/geanders/RProgrammingForResearch/blob/master/data/daily_show_guests.csv) rather than downloading the data from the FiveThirtyEight GitHub page, because there's a small difference between the two files. 

In class, we created an R Project for you to use for this
class. Put the *Daily Show* data in that directory.

**Take the following steps to get the data onto your computer**

* Download the file [from
GitHub](https://github.com/geanders/RProgrammingForResearch/blob/master/data/daily_show_guests.csv).
Right click on `Raw` and then choose "Download linked file". Put the file into
the directory you created for this course.
* Use the `list.files` command to make sure that the "daily_show_guests.csv"
file is in your current working directory (we'll talk more about working
directories, listing files in your working directory, and R Projects later in
the semester).

```{r eval = FALSE}
# List the files in your current working directory 
list.files()
```

```
[1] "daily_show_guests.csv"
```

### Getting the data into R

Now that you have the dataset in your working directory, you can read it into R.
This dataset is in a *csv* (comma separated values) format. (We will talk more
about different file formats in Week 2.) You can read csv files into R using the
function `read_csv` from the `readr` package.

**Read the data into your R session**

* If you do not already have it, install the `readr` package. Then load this
package within your current R session using `library`.
* Use the `read_csv` function from the `readr` package to read the data into R
and save it as the object `daily_show` (see tips in the next few bullets).
* Use the help file for the `read_csv` function to figure out how this function
works. To pull that up, type `?read_csv` at the R console. Can you figure out
why it's critical to use the `skip` option and set it to 4? (We will be talking
a lot more about the `read_csv` function in Week 2, so don't worry if you don't
completely understand it right now.)
* Note that you need to put the file name in quotation marks.
* What would have happened if you'd used `read_csv` but hadn't saved the result
as the object `daily_show`? (For example, you'd run the code
`read_csv("daily_show_guests.csv", skip = 4)` rather than `daily_show <-
read_csv("daily_show_guests.csv")`.)

Example R code:

```{r, eval=FALSE}
# Install (if needed) and load the `readr` package
install.packages(pkgs = "readr") # You only need to do this if you 
                          # do not already have the `readr`
                          # package.
library(package = "readr")

# Read in dataframe from the csv file with Daily Show guests
daily_show <- read_csv(file = "daily_show_guests.csv", skip = 4)
```

```{r echo = FALSE}
daily_show <- read_csv(file = "data/daily_show_guests.csv", skip = 4)
```

```{r}
# Print out the first few rows
daily_show
```

If you have extra time:

* Say this was a really big dataset. You want to check out just the first 10
rows to make sure that you've got your code right before you take the time to
pull in the whole dataset. Use the help file for `read_csv` to figure out how to
only read in a few rows.
* Look through the help file for other options available for `read_csv`. Can you
think of examples when some of these options would be useful?
* Look again at the version of this raw data on FiveThirtyEight's GitHub page
(rather than the course's GitHub repository, where you downloaded the data for
the course exercise). How are these two versions of the raw data different? How
would you need to change your `read_csv` call if you changed to use the
FiveThirtyEight version of the raw data?

Example R code:

```{r, eval=FALSE}
# Read in only the first 10 rows of the dataset
daily_show_first10 <- read_csv(file = "daily_show_guests.csv", 
                       skip = 4, n_max = 10)
```

```{r echo = FALSE}
daily_show_first10 <- read_csv(file = "data/daily_show_guests.csv", 
                       skip = 4, n_max = 10)
```

```{r}
# Check the dataframe
daily_show_first10
```

### Checking out the data

Trade who is sharing their screen again. The new coder will need to download
the data file fresh and move it into a "data" subdirectory of the R project
created at the start of the class meeting. The previous coder should save and
share his or her's R script and send that to the new person by Zoom. The new 
person should start by running that code and making sure everything's working
well on their computer.


You now have the data available in your current R session as the `daily_show`
object. You'll want to check it out to make sure it read in correctly, and also
to get a feel for the data. Throughout, you can use the help pages to figure out
more about any of the functions being used (for example, `?dim`).

**Take the following steps to check out the dataset**

* Use the `dim` function to find out how many rows and columns this dataframe
has. Based on what you found out about the data from the GitHub page, does it
have the number of columns you expected? Based on what you know about the data
(that it includes all the guests who came on The Daily Show with Jon Stewart),
do you think it has about the right number of rows?
* Use functions from the `dplyr` package to look at the first two rows of the
dataset. Based on this, what does each row "measure" (**unit of observation**)?
What information (**variables**) do you get for each "measurement"?
* The `head` function can be used to explore the first few rows of dataframes
(see the helpfile at `?head`). Use the `head` function to look at the first few
rows of the dataframe. Does it look like the rows go in order by date? What was
the date of Jon Stewart's first show? Does it look like this dataset covers that
first show?
* Use the `tail` function to look at the last few rows of the dataframe. What is
the last show date covered by the dataframe? Who was the last guest?

Example R code:

```{r message = FALSE}
# Extract values from the dataframe
library(package = "dplyr") # Load the 'dplyr' package
slice(.data = daily_show, 1:2) # Look at the first two rows of data

# Check the dimensions of the data
dim(x = daily_show)
head(x = daily_show)
tail(x = daily_show)
```

If you have extra time:

* Say you wanted to look at the first ten rows of the dataframe, rather than the
first six. How could you use an option with `head` to do this?

Example R code:

```{r}
# Look at the first few rows of the data
head(x = daily_show, n = 10)
```

### Using the data to answer questions

Nate Silver was a guest on *The Daily Show*. Let's use this data to figure out
how many times he was a guest and when he was on the show.

**Find out more about Nate Silver on The Daily Show**

(Don't worry if you don't make it to this sections! I've put it here for groups
that move through the rest quickly.)

* The `filter` function from the `dplyr` package can be combined with logical
statements to help you create a specific subset of data. For example, if you
only wanted data from guest visits in 1999, you could run `filter(.data =
daily_show, YEAR == 1999)`. Check out the helpfile for `filter` and use the
function to create a new dataframe that only has the rows of `daily_show` when
Nate Silver was a guest (`Raw_Guest_List == "Nate Silver"`). Save this as an
object named `nate_silver`.
* Print out the full `nate_silver` dataframe by typing `nate_silver`. (You could
just use this to answer both questions, but still try the next steps. They would
be important with a bigger dataset.)
* To count the number of times Nate Silver was a guest, you'll need to count the
number of rows in the new dataset. You can either use the `dim` function or the
`nrow` function to do this. What additional information does the `dim` function
give you?
* To get the dates when Nate Silver was a guest, you can print out just the
`Show` column of the dataframe. There are a few ways you can do this using the
`select` function from the `dplyr` package.

Example R code:

```{r}
library(package = "dplyr")
# Create a subset of the data with just Nate Silver appearances
nate_silver <- filter(.data = daily_show, Raw_Guest_List == "Nate Silver")

# Investigate this subset of the data
nate_silver
dim(x = nate_silver)
nrow(x = nate_silver)
select(.data = nate_silver, 3)
```

If you have extra time:

* Was Nate Silver the only statistician to be a guest on the show?
* What were the occupations that were only represented by one guest visit? Since
`GoogleKnowlege_Occupation` is a factor, you can use the `table` function to
create a new vector with the number of times each value of
`GoogleKnowlege_Occupation` shows up. You can put this information into a new
vector and then pull out only the values that equal 1 (so, only had one guest).
(Note that "Statistician" doesn't show up-- there was only one person who was a
guest, but he had three visits.) Pick your favorite "one-off" example and find
out who the guest was for that occupation.

Example R code:

```{r}
statisticians <- filter(.data = daily_show,
                        GoogleKnowlege_Occupation == "Statistician")
statisticians
```
```{r}
num_visits <- table(daily_show$GoogleKnowlege_Occupation)
head(x = num_visits) # Note: This is a vector rather than a dataframe

single_visits <- num_visits[num_visits == 1] # This is using a "logical operator" to extract values that meet a condition
names(single_visits)

filter(.data = daily_show, GoogleKnowlege_Occupation == "chess player")
filter(.data = daily_show, GoogleKnowlege_Occupation == "mathematician")
filter(.data = daily_show, GoogleKnowlege_Occupation == "orca trainer")
filter(.data = daily_show, GoogleKnowlege_Occupation == "Puzzle Creator")
filter(.data = daily_show, GoogleKnowlege_Occupation == "Scholar")
```