week5.Rmd

---
title: "Week 5 -- Data manipulation with dplyr"
---

> #### Learning objectives
>
> * Chain operations together into a workflow using pipes
> * Learn the three key **dplyr** functions for manipulating your data
>     * **`arrange()`** for sorting the observations in your table 
>     * **`mutate()`** for creating a new variable or modifying an existing variable
>     * **`summarise()`** for collapsing values in one of more columns to a single summary value
> * Learn about faceting in ggplot2 to split your data into separate categories and create a series of sub-plots arranged in a grid

---

# Data manipulation with dplyr

**dplyr** is one of the packages that gets loaded as part of the tidyverse.

```{r}
library(tidyverse)
```

dplyr is the Swiss army knife in the tidyverse, providing many useful functions
for manipulating tabular data in data frames or tibbles. We're going to look at
the key functions for filtering our data, modifying the contents and computing
summary statistics.

We'll also introduce the pipe operator, **`%>%`**, for chaining operations
together into mini workflows in a way that makes for more readable and
maintainable code.

Finally, we'll return to plotting and look at a powerful feature of ggplot2,
**faceting**, that allows you to divide your plots into subplots by splitting
the observations based on one or more categorical variables.

We'll again use the METABRIC data set to illustrate how these operations work.

```{r message = FALSE}
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
metabric
```

---


# dplyr verbs

We will be looking at the 3 key dplyr functions this week:

* **`arrange()`** for sorting rows
* **`mutate()`** for modifying columns or creating new ones
* **`summarise()`** for computing summary values

In looking at each of these in turn, we'll be applying these to the entire data
set. It is possible to combine these with the `group_by()` function to instead
operate on separate groups within our data set but this is something we'll cover
in detail next week.

The dplyr operations are commonly referred to as "verbs" in a data manipulation
grammar. These verbs have a common syntax and work together in a consistent and
uniform manner. They all have the following shared behaviours:

* The first argument in each function is a data frame (or tibble)

* Any additional arguments describe what operation to perform on the data frame

* Variable names, i.e. column names, are referred to without using quotes

* The result of an operation is a new data frame

---

# Chaining operations using `%>%`

Let's consider again an earlier example in which we filtered the METABRIC data
set to retain just the patients who were still alive at the time of the study
and had survived for more than 10 years (120 months). We use `filter()` to
select the rows corresponding to the patients meeting these criteria and can
then use `select()` to only display the variables (columns) we're most
interested in.

```{r}
patients_of_interest <- filter(metabric, Survival_status == "LIVING", Survival_time > 120)
patient_details_of_interest <- select(patients_of_interest, Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
patient_details_of_interest
```

Here we've used an intermediate variable, `patients_of_interest`, which we only
needed in order to get to the final result. We could just have used the same
name to avoid cluttering our environment and overwritten the results from the
`filter()` operation with those of the `select()` operation.

```{r eval = FALSE}
patients_of_interest <- select(patients_of_interest, Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
```

Another less readable way of writing this code is to nest the `filter()`
function call inside the `select()`. Although this looks very unwieldy and is
not easy to follow, nested function calls are very common in a lot of R code
you may come across.

```{r}
patients_of_interest <- select(filter(metabric, Survival_status == "LIVING", Survival_time > 120), Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
nrow(patients_of_interest)
```

However, there is another way chaining together a series of operations into a
mini workflow that is elegant, intuitive and makes for very readable R code.
For that we need to introduce a new operator, the **pipe** operator, **`%>%`**.

```{block type = "rmdblock"}
**The pipe operator `%>%`**

The pipe operator takes the output from one operation, i.e. whatever is on the
left-hand side of `%>%` and passes it in as the first argument to the second
operation, or function, on the right-hand side.

**`x %>% f(y)`** is equivalent to **`f(x, y)`**

For example:

`select(starwars, name, height, mass)`

can be rewritten as

`starwars %>% select(name, height, mass)`

This allows for chaining of operations into workflows, e.g.

`starwars %>%`<br/>
&nbsp;&nbsp;&nbsp;&nbsp;`filter(species == "Droid") %>%`<br/>
&nbsp;&nbsp;&nbsp;&nbsp;`select(name, height, mass)`

The `%>%` operator comes from the `magrittr` package (do you get the reference?)
and is available when we load the tidyverse using `library(tidyverse)`.

Piping in R was motivated by the Unix pipe, `|`, in which the output from one
process is redirected to be the input for the next. This is so named because the
flow from one process or operation to the next resembles a pipeline.
```

We can rewrite the code for our filtering and column selection operations as
follows.

```{r}
patients_of_interest <- metabric %>%
  filter(Survival_status == "LIVING", Survival_time > 120) %>%
  select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
```

Note how each operation takes the output from the previous operation as its
first argument. This way of coding is embraced wholeheartedly in the tidyverse
hence almost every tidyverse function that works on data frames has the data
frame as its first argument. It is also the reason why tidyverse functions
return a data frame regardless of whether the output could be recast as a vector
or a single value.

"Piping", the act of chaining operations together, becomes really useful when
there are several steps involved in filtering and transforming a data set.

The usual way of developing a workflow is to build it up one step at a time,
testing the output produced at each stage. Let's do that for this case.

We start by considering just the patients who are living.

```{r}
patients_of_interest <- metabric %>%
  filter(Survival_status == "LIVING")
```

We then add another filter for the survival time.

```{r}
patients_of_interest <- metabric %>%
  filter(Survival_status == "LIVING") %>%
  filter(Survival_time > 120)
```

At each stage we look at the resulting `patients_of_interest` data frame to check
we're happy with the result.

Finally we only want certain columns, so we add a `select()` operation.

```{r}
patients_of_interest <- metabric %>%
  filter(Survival_status == "LIVING") %>%
  filter(Survival_time > 120) %>%
  select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
# print out the result
patients_of_interest
```

When continuing our workflow across multiple lines, we need to be careful to
ensure the `%>%` is at the end of the line. If we try to place this at the start
of the next line, R will think we've finished the workflow prematurely and will
report an error at what it considers the next statement, i.e. the line that
begins with `%>%`.

```{r eval = FALSE}
# R considers the following to be 2 separate commands, the first of which ends
# with the first filter operation and runs successfully.
# The second statement is the third and last line, is not a valid commmand and
# so you'll get an error message
patients_of_interest <- metabric %>%
  filter(Survival_status == "LIVING")
  %>% filter(Survival_time > 120)
```

This is very similar to what we saw with adding layers and other components to
a ggplot using the `+` operator, where we needed the `+` to be at the end of a
line.

We'll be using the pipe `%>%` operator throughout the rest of the course so
you'd better get used to it. But actually we think you'll come to love it as
much as we do.

---

# Sorting using `arrange()`

It is sometimes quite useful to sort our data frame based on the values in one
or more of the columns, particularly when displaying the contents or saving them
to a file. The `arrange()` function in dplyr provides this sorting capability.

For example, we can sort the METABRIC patients into order of increasing
survival time.

```{r}
arrange(metabric, Survival_time)
```

Or we might be more interested in the patients that survived the longest so
would need the order to be of decreasing survival time. For that we can use
the helper function `desc()` that works specifically with `arrange()`.

```{r}
arrange(metabric, desc(Survival_time))
```

As with the other tidyverse functions and, in particular, the other 4 key dplyr
'verbs', `arrange()` works rather well in workflows in which successive
operations are chained using `%>%`.

```{r}
patients_of_interest <- metabric %>%
  filter(Survival_status == "LIVING") %>%
  filter(Survival_time > 120) %>%
  select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index) %>%
  arrange(desc(Survival_time))
# print out the result
patients_of_interest
```

We can sort by more than one variable by adding more variable arguments to
`arrange()`.

```{r}
arrange(patients_of_interest, Tumour_stage, Nottingham_prognostic_index)
```

Here we've sorted first by tumour stage from lowest to highest value and then
by the Nottingham prognostic index for when there are ties, i.e. where the
tumour stage is the same.

Sorting is most commonly used in workflows as one of the last steps before
printing out a data frame or writing out the table to a file.

---

# Modifying data using `mutate()`

In one of the examples of filtering observations using `filter()` above, we
created a new logical variable called `Deceased`.

```{r}
metabric$Deceased <- metabric$Survival_status == "DECEASED"
```

This is an example of a rather common type of data manipulation in which we
crate a new column based on the values contained in one or more other columns.
The `dplyr` package provides the `mutate()` function for this purpose.

```{r}
metabric <- mutate(metabric, Deceased = Survival_status == "DECEASED")
```

Both of these methods adds the new column to the end.

Note that variables names in the `mutate()` function call do not need to be
prefixed by `metabric$` just as they don't in any of the `dplyr` functions.

We can overwrite a column and this is quite commonly done when we want to change
the units. For example, our tumour size values are in millimetres but the
formula for the Nottingham prognostic index needs the tumour size to be in
centimetres. We can convert to `Tumour_size` to centimetres by dividing by 10.

```{r}
metabric <- mutate(metabric, Tumour_size = Tumour_size / 10)
```

```{block type = "rmdblock"}
**Nottingham Prognostic Index**

The Nottingham prognostic index (**NPI**) is used to determine prognosis
following surgery for breast cancer. Its value is calculated using three
pathological criteria: the size of the tumour, the number of lymph nodes
involved, and the grade of the tumour.

The formula for the Nottingham prognostic index is:

**`NPI = (0.2 * S) + N + G`**

where

* **S** is the size of the tumour in centimetres
* **N** is the node status (0 nodes = 1, 1-3 nodes = 2, >3 nodes = 3)
* **G** is the grade of tumour (Grade I = 1, Grade II = 2, Grade III = 3)
```

We'll recalculate the Nottingham prognostic index using the values
`Tumour_size`, `Neoplasm_histologic_grade` and `Lymph_node_status` columns in
our METABRIC data frame.

```{r}
metabric <- mutate(metabric, NPI = 0.2 * Tumour_size + Lymph_node_status + Neoplasm_histologic_grade)
select(metabric, Tumour_size, Lymph_node_status, Neoplasm_histologic_grade, NPI, Nottingham_prognostic_index)
```

There is a discrepency. Can you see what the problem is?

It appears that the tumour size wasn't correctly converted into centimetres in
the original NPI calculation or that the wrong scaling factor for tumour size
was used. This could explain why the plots we created in weeks 1 and 2 featuring
the Nottingham prognostic index looked so odd. We'll see if they look any better
with our recalculated values.

```{r scatter_plot_1}
ggplot(data = metabric) +
  geom_point(mapping = aes(x = Age_at_diagnosis, y = NPI), na.rm = TRUE)
```

There is still some banding but nothing like the NPI values downloaded from
cBioPortal which line up very closely to whole numbers.

The **`round()`** function is really useful for rounding numerical values to a
specified number of decimal places. We'll read in the METABRIC data again and
create a small workflow that carries out the tumour size conversion, computes
the NPI, rounds the tumour size and the resulting NPI value to 1 decimal place
and displays the results in decreasing order of NPI.

```{r message = FALSE}
read_csv("data/metabric_clinical_and_expression_data.csv") %>%
  mutate(Tumour_size = Tumour_size / 10) %>%
  mutate(NPI = 0.2 * Tumour_size + Lymph_node_status + Neoplasm_histologic_grade) %>%
  mutate(Tumour_size = round(Tumour_size, digits = 1)) %>%
  mutate(NPI = round(NPI, digits = 1)) %>%
  arrange(desc(NPI)) %>%
  select(Tumour_size, Lymph_node_status, Neoplasm_histologic_grade, NPI)
```

## Mutating multiple columns

In that last workflow we included the same rounding operation applied to two
different variables. It would be nice to be able to carry out just the one
`mutate()` but apply it to both `Tumour_size` and `NPI` columns and we can using
**`mutate_at()`**.

```{r}
metabric %>%
  mutate_at(vars(Tumour_size, NPI), round, digits = 1) %>%
  select(Patient_ID, Tumour_size, NPI)
```

This is slightly more complicated. We had to group the variables for which we
want the function to apply using `vars()`. We then gave the function name as the
next argument and finally any additional arguments that the function needs, in
this case the number of digits.

We can use the range operator and the same helper functions as we did for
selecting columns using `select()` inside `vars()`.

For example, we might decide that our expression values are given to a much
higher degree of precision than is strictly necessary.

```{r}
metabric %>%
  mutate_at(vars(ESR1:MLPH), round, digits = 2) %>%
  select(Patient_ID, ESR1:MLPH)
```

Or we could decide that all the columns whose names end with "_status" are in
fact categorical variables and should be converted to factors.

```{r}
metabric %>%
  mutate_at(vars(ends_with("_status")), as.factor) %>%
  select(Patient_ID, ends_with("_status"))
```

## Anonymous functions

The `mutate_at()` function, and the related `mutate_if()` and `mutate_all()` 
functions, are really very powerful but with that comes additional complexity.

For example, we may come across situations where we'd like to apply the same
operation to multiple columns but where there is no available function in R
to do what we want.

Let's say we want to convert the petal and sepal measurements in the `iris`
data set from centimetres to millimetres. We'd either need to create a new
function to do this conversion or we could use what is known as an anonymous
function, also known as a lambda expression.

There is no 'multiply by 10' function and it seems a bit pointless to create
one just for this conversion so we'll use an anonymous function instead --
anonymous because it has no name, it's an _in situ_ function only used in our
`mutate_at()` function call.

```{r}
iris %>%
  as_tibble() %>%
  mutate_at(vars(Sepal.Length:Petal.Width), ~ . * 10)
```

The **`~`** denotes that we're using an anonymous function (it is the symbol for
formulae in R) and the `.` is a placeholder for the column being operated on. In
this case, we're multiplying each of the columns between `Sepal.Length` and
`Petal.Width` inclusive by 10.

If you think this is getting fairly complicated you'd be right. We'll leave it
there for now but point you to the help page for `mutate_at` if you're
interested in finding out more.

---

# Computing summary values using `summarise()`

We'll cover the fifth of the main dplyr 'verb' functions, **`summarise()`**,
only briefly here. This function computes summary values for one or more
variables (columns) in a table. Here we will summarise values for the entire
table but this function is much more useful in combination with `group_by()`
in working on groups of observations within the data set. We will look at
summarizing groups of observations next week.

Any function that calculates a single scalar value from a vector can be used
with `summarise()`. For example, the `mean()` function calculates the arithmetic
mean of a numeric vector. Let's calculate the average ESR1 expression for tumour
samples in the METABRIC data set.

```{r}
mean(metabric$ESR1)
```

The equivalent operation using `summarise()` is:

```{r}
summarise(metabric, mean(ESR1))
```

If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in
luck as dplyr accommodates the alternative spelling.

Both of the above statements gave the same average expression value but these
were output in differing formats. The `mean()` function collapses a vector to
single scalar value, which as we know is in fact a vector of length 1. The
`summarise()` function, as with most tidyverse functions, returns another data
frame, albeit one in which there is a single row and a single column.

Returning a data frame might be quite useful, particularly if we’re summarising
multiple columns or using more than one function, for example computing the
average and standard deviation.

```{r}
summarise(metabric, ESR1_mean = mean(ESR1), ESR1_sd = sd(ESR1))
```

Notice how we also named the output columns in this last example.

```{block type = "rmdblock"}
**`summarise()`**

`summarise()` collapses a data frame into a single row by calculating summary
values of one or more of the columns.

It can take any function that takes a vector of values and returns a single
value. Some of the more useful functions include:

* Centre: **`mean()`**, **`median()`**
* Spread: **`sd()`**, **`mad()`**
* Range: **`min()`**, **`max()`**, **`quantile()`**
* Position: **`first()`**, **`last()`**
* Count: **`n()`**

Note the `first()`, `last()` and `n()` are only really useful when working on
groups of observations using **`group_by()`**.

**`n()`** is a special function that returns the number of observations; it
doesn't take a vector argument, i.e. a column.
```

It is also possible to summarise using a function that takes more than one
value, i.e. from multiple columns. For example, we could compute the correlation
between the expression of FOXA1 and MLPH.

```{r}
summarise(metabric, correlation = cor(FOXA1, MLPH))
```

## Summarizing multiple columns

Much like `mutate()` with its `mutate_at()`, `mutate_if()` and `mutate_all()`
variants, there is a family of `summarise()` functions similarly named for
applying the same summarization function to multiple columns in a single
operation. These work in much the same way as their `mutate` cousins.

**`summarise_at()`** allows us to select the columns on which to operate using
an additional `vars()` argument.

```{r}
summarise_at(metabric, vars(FOXA1, MLPH), mean)
```

Selecting the columns is done in the same way as for `mutate_at()` and
`select()`.

**`summarise_all()`** summarises values in all columns.

```{r}
metabric %>%
  select(ESR1:MLPH) %>%
  summarise_all(mean)
```

You have to be careful with `summarise_all()` that all columns can be summarised
with the given summary function. For example, what happens if we try to compute
an average of a set of character values?

```{r}
summarise_all(metabric, mean, na.rm = TRUE)
```

We get a lot of warning messages and `NA` values for those columns for which
computing an average does not make sense.

**`summarise_if()`** can be used to select those values for which a
summarization function is appropriate.

```{r}
summarise_if(metabric, is.numeric, mean, na.rm = TRUE)
```

It is possible to summarise using more than one function in which case a list of
functions needs to be provided.

```{r}
summarise_at(metabric, vars(ESR1, ERBB2, PGR), list(mean, sd))
```

Pretty neat but I'm not sure about those column headings in the output --
fortunately we have some control over these.

```{r}
summarise_at(metabric, vars(ESR1, ERBB2, PGR), list(average = mean, stdev = sd))
```

## Anonymous functions

The `mutate_` functions and `summarise_` functions work in a very similar
manner, very much in line with the coherent and consistent framework provided
by `dplyr` and the entire tidyverse. For example, we could use an anonymous
function in a `summarise_at()` operation applied to multiple variables. In the
assignment from last week, we asked you to compute the correlation of the
expression for FOXA1 against all other genes to see which was most strongly
correlated. Here is how we could do this in a single `summarise_at()` statement
using an anonymous function.

```{r}
summarise_at(metabric, vars(ESR1:MLPH, -FOXA1), ~ cor(., FOXA1))
```

Notice how we selected all genes between ESR1, the first gene column in our
data frame, and MLPH, the last gene column, but then excluded FOXA1 as we're not
all that interested in the correlation of FOXA1 with itself (we know the answer
is 1).

---

# Faceting with ggplot2

Finally, let's change track completely and take a look at a very useful feature
of ggplot2 -- **faceting**.

Faceting allows you to split your plot into subplots, or facets, based on one
or more categorical variables. Each of the subplots displays a subset of the
data.

There are two faceting functions, **`facet_wrap()`** and **`facet_grid()`**.

Let's create a scatter plot of GATA3 and ESR1 expression values where we're
displaying the PR positive and PR negative patients using different colours.
This is a very similar to a plot we created last week.

```{r scatter_plot_2}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
  geom_point(size = 0.5, alpha = 0.5)
```

An alternative is to use faceting with **`facet_wrap()`**.

```{r scatter_plot_3}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(PR_status))
```

This produces two plots, side-by-side, one for each of the categories in the
`PR_status` variable, with a banner across the top of each for the category.

The variable(s) used for faceting are specified using `vars()` in a similar way
to the selection of variables for `mutate_at()` and `summarise_at()`.

We can still use separate colours if we prefer things to be, well, colourful.

```{r scatter_plot_4}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(PR_status))
```

Faceting is usually better than displaying groups using different colours when
there are more than two or three groups when it can be difficult to really tell
which points belong to each group. A case in point is for the 3-gene
classification in the GATA3 vs ESR1 scatter plot we created last week. Let's
create a faceted version of that plot.

```{r scatter_plot_5}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(`3-gene_classifier`))
```

This helps explain why the function is called `facet_wrap()`. When it has too
many subplots to fit across the page, it wraps around to another row. We can
control how many rows or columns to use with the `nrow` and `ncol` arguments.

```{r scatter_plot_6}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(`3-gene_classifier`), nrow = 1)
```

```{r scatter_plot_7}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(`3-gene_classifier`), ncol = 2)
```

We can combine faceting on one variable with a colour aesthetic for another
variable. For example, let's show the tumour stage status using faceting and the HER2
status using colours.

```{r scatter_plot_8}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = HER2_status)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(Neoplasm_histologic_grade))
```

Instead of this we could facet on more than variable.

```{r scatter_plot_9}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_wrap(vars(Neoplasm_histologic_grade, HER2_status))
```

Faceting on two variables is usually better done using the other faceting
function, **`facet_grid()`**. Note the change in how the formula is written.

```{r scatter_plot_10}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(Neoplasm_histologic_grade), vars(HER2_status))
```

Again we can use colour aesthetics alongside faceting to add further information
to our visualization.

```{r scatter_plot_11}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(Neoplasm_histologic_grade), vars(HER2_status))
```

Finally, we can use a `labeller` to change the labels for each of the
categorical values so that these are more meaningful in the context of this
plot.

```{r scatter_plot_12}
grade_labels <- c("1" = "Grade I", "2" = "Grade II", "3" = "Grade III")
her2_status_labels <- c("Positive" = "HER2 positive", "Negative" = "HER2 negative")
#
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(Neoplasm_histologic_grade),
             vars(HER2_status),
             labeller = labeller(
               Neoplasm_histologic_grade = grade_labels,
               HER2_status = her2_status_labels
              )
            )
```

This would certainly be necessary if we were to use ER and HER2 status on one
side of the grid.

```{r scatter_plot_13}
er_status_labels <- c("Positive" = "ER positive", "Negative" = "ER negative")
#
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
  geom_point(size = 0.5, alpha = 0.5) +
  facet_grid(vars(Neoplasm_histologic_grade),
             vars(ER_status, HER2_status),
             labeller = labeller(
               Neoplasm_histologic_grade = grade_labels,
               ER_status = er_status_labels,
               HER2_status = her2_status_labels
              )
            )
```

---

# Summary

In this session we have covered the following concepts:

* Building up workflows by chaining operations together using the pipe operator
* Sorting rows based on values in one or more columns
* Modifying a data frame by either adding new columns or modifying existing ones
* Summarizing the values in one or more columns
* Faceting of ggplot2 visualizations

---

## Exercises

The week's exercises will test your ability to manipulate the METABRIC data set
by changing the values of existing columns or adding new columns by computing
new variables from existing ones.

We are expecting you to use the 5 main dplyr 'verb' functions: `select()`,
`filter()`, `arrange()`, `mutate()` and `summarize()`. Please use the pipe
operator, `%>%`, in cases where more than one operation is required to achieve
the outcome requested.

```{r message = FALSE, warning = FALSE}
library(tidyverse)
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
```

---

**1. Investigate the subset of long-surviving breast cancer patients that didn't receive chemo or radiotherapy**

First obtain the subset of patients that received neither chemotherapy or
radiotherapy and survived for more than 20 years.

```{r}
patients_of_interest <- filter(metabric, Chemotherapy == "NO", Radiotherapy == "NO", Survival_time / 12 > 20)
# alternative using separate steps
patients_of_interest <- metabric %>%
  filter(Chemotherapy == "NO") %>%
  filter(Radiotherapy == "NO") %>%
  mutate(Survival_time_years = Survival_time / 12) %>%
  filter(Survival_time_years > 20)
#
patients_of_interest
```

Now look at the breakdown of these patients in terms of ER status. Count the
numbers of ER positive and ER negative patients in this subset. Calculate the
proportion that are ER positive.

```{r}
table(patients_of_interest$ER_status)
mean(patients_of_interest$ER_status == "Positive")
```

What does this tell us? Calculate the proportion of ER positive patients in the
whole cohort by way of a comparison.

```{r}
table(metabric$ER_status)
mean(metabric$ER_status == "Positive")
```


**2. Which patients have spent the largest proportion of their lives dealing with breast cancer?**

```{r}
### Hint 
### Create Survival_time_years
### Add it to Age_at_diagnosis
### Create Proportion: Survival_time_years/Age

metabric %>%
  mutate(Survival_time_years = Survival_time / 12) %>%
  mutate(Age = Age_at_diagnosis + Survival_time_years) %>%
  mutate(Proportion_of_life_since_diagnosis = Survival_time_years / Age) %>%
  select(Patient_ID, Age, Age_at_diagnosis, Survival_time_years, Proportion_of_life_since_diagnosis) %>%
  arrange(desc(Proportion_of_life_since_diagnosis)) %>%
  mutate_if(is.numeric, round, digits = 2)

```

---

**3. Convert the expression values for each of the genes into standardized z-scores**

Some genes are generally expressed at higher levels than others. This can make
comparisons of changes between groups for a set of genes somewhat difficult,
particularly if the expression for those genes are on very different scales. The
expression values in our METABRIC are on a log~2~ scale which helps to reduce
the range of values but another method for representing expression measurements
is to standardize these to produce z-scores.

Standardization of a set of measurements involves subtracting the mean from each
and dividing by the standard deviation. This will produce values with a mean of
0 and a standard deviation of 1.

Create a modified version of the `metabric` data frame containing a new column
with the standardized expression values (z-scores) for the ESR1 gene.

```{r}
### Hint 
### ESR1 - mean(ESR1)) / sd(ESR1)
metabric_z_scores <- mutate(metabric, ESR1_z_score = (ESR1 - mean(ESR1)) / sd(ESR1))
```

Check that you've done this correctly by calculating the mean and standard
deviation of your new z-score variable.

```{r}
summarise(metabric_z_scores, mean = mean(ESR1_z_score), sd = sd(ESR1_z_score))
```

Add another column to your modified `metabric` data frame containing a z-score
for GATA3 and then create a scatter plot of the z-scores of GATA3 against ESR1.
Modify your plot to facet by the PAM50 classification.

```{r}
metabric_z_scores <- mutate(metabric_z_scores, GATA3_z_score = (GATA3 - mean(GATA3)) / sd(GATA3))
ggplot(data = metabric_z_scores) +
  geom_point(mapping = aes(x = GATA3_z_score, y = ESR1_z_score)) +
  facet_wrap(vars(PAM50))
```

---

**4. Which Star Wars characters need to go on a diet?**

Compute the body mass index (BMI) of characters in the `starwars` tibble.

```{r}
starwars <- starwars %>%
  mutate(height = height / 100) %>%
  mutate(BMI = mass / height ^ 2)
```

Filter for human characters that are overweight (BMI > 25) and display in
decreasing order of BMI.

```{r}
starwars %>%
  filter(species == "Human") %>%
  filter(BMI > 25) %>%
  arrange(desc(BMI)) %>%
  select(name, height, mass, BMI)
```