forked from bioinformatics-core-shared-training/r-intro
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathweek5.Rmd
881 lines (661 loc) · 31 KB
/
week5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
---
title: "Week 5 -- Data manipulation with dplyr"
---
> #### Learning objectives
>
> * Chain operations together into a workflow using pipes
> * Learn the three key **dplyr** functions for manipulating your data
> * **`arrange()`** for sorting the observations in your table
> * **`mutate()`** for creating a new variable or modifying an existing variable
> * **`summarise()`** for collapsing values in one of more columns to a single summary value
> * Learn about faceting in ggplot2 to split your data into separate categories and create a series of sub-plots arranged in a grid
---
# Data manipulation with dplyr
**dplyr** is one of the packages that gets loaded as part of the tidyverse.
```{r}
library(tidyverse)
```
dplyr is the Swiss army knife in the tidyverse, providing many useful functions
for manipulating tabular data in data frames or tibbles. We're going to look at
the key functions for filtering our data, modifying the contents and computing
summary statistics.
We'll also introduce the pipe operator, **`%>%`**, for chaining operations
together into mini workflows in a way that makes for more readable and
maintainable code.
Finally, we'll return to plotting and look at a powerful feature of ggplot2,
**faceting**, that allows you to divide your plots into subplots by splitting
the observations based on one or more categorical variables.
We'll again use the METABRIC data set to illustrate how these operations work.
```{r message = FALSE}
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
metabric
```
---
# dplyr verbs
We will be looking at the 3 key dplyr functions this week:
* **`arrange()`** for sorting rows
* **`mutate()`** for modifying columns or creating new ones
* **`summarise()`** for computing summary values
In looking at each of these in turn, we'll be applying these to the entire data
set. It is possible to combine these with the `group_by()` function to instead
operate on separate groups within our data set but this is something we'll cover
in detail next week.
The dplyr operations are commonly referred to as "verbs" in a data manipulation
grammar. These verbs have a common syntax and work together in a consistent and
uniform manner. They all have the following shared behaviours:
* The first argument in each function is a data frame (or tibble)
* Any additional arguments describe what operation to perform on the data frame
* Variable names, i.e. column names, are referred to without using quotes
* The result of an operation is a new data frame
---
# Chaining operations using `%>%`
Let's consider again an earlier example in which we filtered the METABRIC data
set to retain just the patients who were still alive at the time of the study
and had survived for more than 10 years (120 months). We use `filter()` to
select the rows corresponding to the patients meeting these criteria and can
then use `select()` to only display the variables (columns) we're most
interested in.
```{r}
patients_of_interest <- filter(metabric, Survival_status == "LIVING", Survival_time > 120)
patient_details_of_interest <- select(patients_of_interest, Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
patient_details_of_interest
```
Here we've used an intermediate variable, `patients_of_interest`, which we only
needed in order to get to the final result. We could just have used the same
name to avoid cluttering our environment and overwritten the results from the
`filter()` operation with those of the `select()` operation.
```{r eval = FALSE}
patients_of_interest <- select(patients_of_interest, Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
```
Another less readable way of writing this code is to nest the `filter()`
function call inside the `select()`. Although this looks very unwieldy and is
not easy to follow, nested function calls are very common in a lot of R code
you may come across.
```{r}
patients_of_interest <- select(filter(metabric, Survival_status == "LIVING", Survival_time > 120), Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
nrow(patients_of_interest)
```
However, there is another way chaining together a series of operations into a
mini workflow that is elegant, intuitive and makes for very readable R code.
For that we need to introduce a new operator, the **pipe** operator, **`%>%`**.
```{block type = "rmdblock"}
**The pipe operator `%>%`**
The pipe operator takes the output from one operation, i.e. whatever is on the
left-hand side of `%>%` and passes it in as the first argument to the second
operation, or function, on the right-hand side.
**`x %>% f(y)`** is equivalent to **`f(x, y)`**
For example:
`select(starwars, name, height, mass)`
can be rewritten as
`starwars %>% select(name, height, mass)`
This allows for chaining of operations into workflows, e.g.
`starwars %>%`<br/>
`filter(species == "Droid") %>%`<br/>
`select(name, height, mass)`
The `%>%` operator comes from the `magrittr` package (do you get the reference?)
and is available when we load the tidyverse using `library(tidyverse)`.
Piping in R was motivated by the Unix pipe, `|`, in which the output from one
process is redirected to be the input for the next. This is so named because the
flow from one process or operation to the next resembles a pipeline.
```
We can rewrite the code for our filtering and column selection operations as
follows.
```{r}
patients_of_interest <- metabric %>%
filter(Survival_status == "LIVING", Survival_time > 120) %>%
select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
```
Note how each operation takes the output from the previous operation as its
first argument. This way of coding is embraced wholeheartedly in the tidyverse
hence almost every tidyverse function that works on data frames has the data
frame as its first argument. It is also the reason why tidyverse functions
return a data frame regardless of whether the output could be recast as a vector
or a single value.
"Piping", the act of chaining operations together, becomes really useful when
there are several steps involved in filtering and transforming a data set.
The usual way of developing a workflow is to build it up one step at a time,
testing the output produced at each stage. Let's do that for this case.
We start by considering just the patients who are living.
```{r}
patients_of_interest <- metabric %>%
filter(Survival_status == "LIVING")
```
We then add another filter for the survival time.
```{r}
patients_of_interest <- metabric %>%
filter(Survival_status == "LIVING") %>%
filter(Survival_time > 120)
```
At each stage we look at the resulting `patients_of_interest` data frame to check
we're happy with the result.
Finally we only want certain columns, so we add a `select()` operation.
```{r}
patients_of_interest <- metabric %>%
filter(Survival_status == "LIVING") %>%
filter(Survival_time > 120) %>%
select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index)
# print out the result
patients_of_interest
```
When continuing our workflow across multiple lines, we need to be careful to
ensure the `%>%` is at the end of the line. If we try to place this at the start
of the next line, R will think we've finished the workflow prematurely and will
report an error at what it considers the next statement, i.e. the line that
begins with `%>%`.
```{r eval = FALSE}
# R considers the following to be 2 separate commands, the first of which ends
# with the first filter operation and runs successfully.
# The second statement is the third and last line, is not a valid commmand and
# so you'll get an error message
patients_of_interest <- metabric %>%
filter(Survival_status == "LIVING")
%>% filter(Survival_time > 120)
```
This is very similar to what we saw with adding layers and other components to
a ggplot using the `+` operator, where we needed the `+` to be at the end of a
line.
We'll be using the pipe `%>%` operator throughout the rest of the course so
you'd better get used to it. But actually we think you'll come to love it as
much as we do.
---
# Sorting using `arrange()`
It is sometimes quite useful to sort our data frame based on the values in one
or more of the columns, particularly when displaying the contents or saving them
to a file. The `arrange()` function in dplyr provides this sorting capability.
For example, we can sort the METABRIC patients into order of increasing
survival time.
```{r}
arrange(metabric, Survival_time)
```
Or we might be more interested in the patients that survived the longest so
would need the order to be of decreasing survival time. For that we can use
the helper function `desc()` that works specifically with `arrange()`.
```{r}
arrange(metabric, desc(Survival_time))
```
As with the other tidyverse functions and, in particular, the other 4 key dplyr
'verbs', `arrange()` works rather well in workflows in which successive
operations are chained using `%>%`.
```{r}
patients_of_interest <- metabric %>%
filter(Survival_status == "LIVING") %>%
filter(Survival_time > 120) %>%
select(Patient_ID, Survival_time, Tumour_stage, Nottingham_prognostic_index) %>%
arrange(desc(Survival_time))
# print out the result
patients_of_interest
```
We can sort by more than one variable by adding more variable arguments to
`arrange()`.
```{r}
arrange(patients_of_interest, Tumour_stage, Nottingham_prognostic_index)
```
Here we've sorted first by tumour stage from lowest to highest value and then
by the Nottingham prognostic index for when there are ties, i.e. where the
tumour stage is the same.
Sorting is most commonly used in workflows as one of the last steps before
printing out a data frame or writing out the table to a file.
---
# Modifying data using `mutate()`
In one of the examples of filtering observations using `filter()` above, we
created a new logical variable called `Deceased`.
```{r}
metabric$Deceased <- metabric$Survival_status == "DECEASED"
```
This is an example of a rather common type of data manipulation in which we
crate a new column based on the values contained in one or more other columns.
The `dplyr` package provides the `mutate()` function for this purpose.
```{r}
metabric <- mutate(metabric, Deceased = Survival_status == "DECEASED")
```
Both of these methods adds the new column to the end.
Note that variables names in the `mutate()` function call do not need to be
prefixed by `metabric$` just as they don't in any of the `dplyr` functions.
We can overwrite a column and this is quite commonly done when we want to change
the units. For example, our tumour size values are in millimetres but the
formula for the Nottingham prognostic index needs the tumour size to be in
centimetres. We can convert to `Tumour_size` to centimetres by dividing by 10.
```{r}
metabric <- mutate(metabric, Tumour_size = Tumour_size / 10)
```
```{block type = "rmdblock"}
**Nottingham Prognostic Index**
The Nottingham prognostic index (**NPI**) is used to determine prognosis
following surgery for breast cancer. Its value is calculated using three
pathological criteria: the size of the tumour, the number of lymph nodes
involved, and the grade of the tumour.
The formula for the Nottingham prognostic index is:
**`NPI = (0.2 * S) + N + G`**
where
* **S** is the size of the tumour in centimetres
* **N** is the node status (0 nodes = 1, 1-3 nodes = 2, >3 nodes = 3)
* **G** is the grade of tumour (Grade I = 1, Grade II = 2, Grade III = 3)
```
We'll recalculate the Nottingham prognostic index using the values
`Tumour_size`, `Neoplasm_histologic_grade` and `Lymph_node_status` columns in
our METABRIC data frame.
```{r}
metabric <- mutate(metabric, NPI = 0.2 * Tumour_size + Lymph_node_status + Neoplasm_histologic_grade)
select(metabric, Tumour_size, Lymph_node_status, Neoplasm_histologic_grade, NPI, Nottingham_prognostic_index)
```
There is a discrepency. Can you see what the problem is?
It appears that the tumour size wasn't correctly converted into centimetres in
the original NPI calculation or that the wrong scaling factor for tumour size
was used. This could explain why the plots we created in weeks 1 and 2 featuring
the Nottingham prognostic index looked so odd. We'll see if they look any better
with our recalculated values.
```{r scatter_plot_1}
ggplot(data = metabric) +
geom_point(mapping = aes(x = Age_at_diagnosis, y = NPI), na.rm = TRUE)
```
There is still some banding but nothing like the NPI values downloaded from
cBioPortal which line up very closely to whole numbers.
The **`round()`** function is really useful for rounding numerical values to a
specified number of decimal places. We'll read in the METABRIC data again and
create a small workflow that carries out the tumour size conversion, computes
the NPI, rounds the tumour size and the resulting NPI value to 1 decimal place
and displays the results in decreasing order of NPI.
```{r message = FALSE}
read_csv("data/metabric_clinical_and_expression_data.csv") %>%
mutate(Tumour_size = Tumour_size / 10) %>%
mutate(NPI = 0.2 * Tumour_size + Lymph_node_status + Neoplasm_histologic_grade) %>%
mutate(Tumour_size = round(Tumour_size, digits = 1)) %>%
mutate(NPI = round(NPI, digits = 1)) %>%
arrange(desc(NPI)) %>%
select(Tumour_size, Lymph_node_status, Neoplasm_histologic_grade, NPI)
```
## Mutating multiple columns
In that last workflow we included the same rounding operation applied to two
different variables. It would be nice to be able to carry out just the one
`mutate()` but apply it to both `Tumour_size` and `NPI` columns and we can using
**`mutate_at()`**.
```{r}
metabric %>%
mutate_at(vars(Tumour_size, NPI), round, digits = 1) %>%
select(Patient_ID, Tumour_size, NPI)
```
This is slightly more complicated. We had to group the variables for which we
want the function to apply using `vars()`. We then gave the function name as the
next argument and finally any additional arguments that the function needs, in
this case the number of digits.
We can use the range operator and the same helper functions as we did for
selecting columns using `select()` inside `vars()`.
For example, we might decide that our expression values are given to a much
higher degree of precision than is strictly necessary.
```{r}
metabric %>%
mutate_at(vars(ESR1:MLPH), round, digits = 2) %>%
select(Patient_ID, ESR1:MLPH)
```
Or we could decide that all the columns whose names end with "_status" are in
fact categorical variables and should be converted to factors.
```{r}
metabric %>%
mutate_at(vars(ends_with("_status")), as.factor) %>%
select(Patient_ID, ends_with("_status"))
```
## Anonymous functions
The `mutate_at()` function, and the related `mutate_if()` and `mutate_all()`
functions, are really very powerful but with that comes additional complexity.
For example, we may come across situations where we'd like to apply the same
operation to multiple columns but where there is no available function in R
to do what we want.
Let's say we want to convert the petal and sepal measurements in the `iris`
data set from centimetres to millimetres. We'd either need to create a new
function to do this conversion or we could use what is known as an anonymous
function, also known as a lambda expression.
There is no 'multiply by 10' function and it seems a bit pointless to create
one just for this conversion so we'll use an anonymous function instead --
anonymous because it has no name, it's an _in situ_ function only used in our
`mutate_at()` function call.
```{r}
iris %>%
as_tibble() %>%
mutate_at(vars(Sepal.Length:Petal.Width), ~ . * 10)
```
The **`~`** denotes that we're using an anonymous function (it is the symbol for
formulae in R) and the `.` is a placeholder for the column being operated on. In
this case, we're multiplying each of the columns between `Sepal.Length` and
`Petal.Width` inclusive by 10.
If you think this is getting fairly complicated you'd be right. We'll leave it
there for now but point you to the help page for `mutate_at` if you're
interested in finding out more.
---
# Computing summary values using `summarise()`
We'll cover the fifth of the main dplyr 'verb' functions, **`summarise()`**,
only briefly here. This function computes summary values for one or more
variables (columns) in a table. Here we will summarise values for the entire
table but this function is much more useful in combination with `group_by()`
in working on groups of observations within the data set. We will look at
summarizing groups of observations next week.
Any function that calculates a single scalar value from a vector can be used
with `summarise()`. For example, the `mean()` function calculates the arithmetic
mean of a numeric vector. Let's calculate the average ESR1 expression for tumour
samples in the METABRIC data set.
```{r}
mean(metabric$ESR1)
```
The equivalent operation using `summarise()` is:
```{r}
summarise(metabric, mean(ESR1))
```
If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in
luck as dplyr accommodates the alternative spelling.
Both of the above statements gave the same average expression value but these
were output in differing formats. The `mean()` function collapses a vector to
single scalar value, which as we know is in fact a vector of length 1. The
`summarise()` function, as with most tidyverse functions, returns another data
frame, albeit one in which there is a single row and a single column.
Returning a data frame might be quite useful, particularly if we’re summarising
multiple columns or using more than one function, for example computing the
average and standard deviation.
```{r}
summarise(metabric, ESR1_mean = mean(ESR1), ESR1_sd = sd(ESR1))
```
Notice how we also named the output columns in this last example.
```{block type = "rmdblock"}
**`summarise()`**
`summarise()` collapses a data frame into a single row by calculating summary
values of one or more of the columns.
It can take any function that takes a vector of values and returns a single
value. Some of the more useful functions include:
* Centre: **`mean()`**, **`median()`**
* Spread: **`sd()`**, **`mad()`**
* Range: **`min()`**, **`max()`**, **`quantile()`**
* Position: **`first()`**, **`last()`**
* Count: **`n()`**
Note the `first()`, `last()` and `n()` are only really useful when working on
groups of observations using **`group_by()`**.
**`n()`** is a special function that returns the number of observations; it
doesn't take a vector argument, i.e. a column.
```
It is also possible to summarise using a function that takes more than one
value, i.e. from multiple columns. For example, we could compute the correlation
between the expression of FOXA1 and MLPH.
```{r}
summarise(metabric, correlation = cor(FOXA1, MLPH))
```
## Summarizing multiple columns
Much like `mutate()` with its `mutate_at()`, `mutate_if()` and `mutate_all()`
variants, there is a family of `summarise()` functions similarly named for
applying the same summarization function to multiple columns in a single
operation. These work in much the same way as their `mutate` cousins.
**`summarise_at()`** allows us to select the columns on which to operate using
an additional `vars()` argument.
```{r}
summarise_at(metabric, vars(FOXA1, MLPH), mean)
```
Selecting the columns is done in the same way as for `mutate_at()` and
`select()`.
**`summarise_all()`** summarises values in all columns.
```{r}
metabric %>%
select(ESR1:MLPH) %>%
summarise_all(mean)
```
You have to be careful with `summarise_all()` that all columns can be summarised
with the given summary function. For example, what happens if we try to compute
an average of a set of character values?
```{r}
summarise_all(metabric, mean, na.rm = TRUE)
```
We get a lot of warning messages and `NA` values for those columns for which
computing an average does not make sense.
**`summarise_if()`** can be used to select those values for which a
summarization function is appropriate.
```{r}
summarise_if(metabric, is.numeric, mean, na.rm = TRUE)
```
It is possible to summarise using more than one function in which case a list of
functions needs to be provided.
```{r}
summarise_at(metabric, vars(ESR1, ERBB2, PGR), list(mean, sd))
```
Pretty neat but I'm not sure about those column headings in the output --
fortunately we have some control over these.
```{r}
summarise_at(metabric, vars(ESR1, ERBB2, PGR), list(average = mean, stdev = sd))
```
## Anonymous functions
The `mutate_` functions and `summarise_` functions work in a very similar
manner, very much in line with the coherent and consistent framework provided
by `dplyr` and the entire tidyverse. For example, we could use an anonymous
function in a `summarise_at()` operation applied to multiple variables. In the
assignment from last week, we asked you to compute the correlation of the
expression for FOXA1 against all other genes to see which was most strongly
correlated. Here is how we could do this in a single `summarise_at()` statement
using an anonymous function.
```{r}
summarise_at(metabric, vars(ESR1:MLPH, -FOXA1), ~ cor(., FOXA1))
```
Notice how we selected all genes between ESR1, the first gene column in our
data frame, and MLPH, the last gene column, but then excluded FOXA1 as we're not
all that interested in the correlation of FOXA1 with itself (we know the answer
is 1).
---
# Faceting with ggplot2
Finally, let's change track completely and take a look at a very useful feature
of ggplot2 -- **faceting**.
Faceting allows you to split your plot into subplots, or facets, based on one
or more categorical variables. Each of the subplots displays a subset of the
data.
There are two faceting functions, **`facet_wrap()`** and **`facet_grid()`**.
Let's create a scatter plot of GATA3 and ESR1 expression values where we're
displaying the PR positive and PR negative patients using different colours.
This is a very similar to a plot we created last week.
```{r scatter_plot_2}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
geom_point(size = 0.5, alpha = 0.5)
```
An alternative is to use faceting with **`facet_wrap()`**.
```{r scatter_plot_3}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(PR_status))
```
This produces two plots, side-by-side, one for each of the categories in the
`PR_status` variable, with a banner across the top of each for the category.
The variable(s) used for faceting are specified using `vars()` in a similar way
to the selection of variables for `mutate_at()` and `summarise_at()`.
We can still use separate colours if we prefer things to be, well, colourful.
```{r scatter_plot_4}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PR_status)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(PR_status))
```
Faceting is usually better than displaying groups using different colours when
there are more than two or three groups when it can be difficult to really tell
which points belong to each group. A case in point is for the 3-gene
classification in the GATA3 vs ESR1 scatter plot we created last week. Let's
create a faceted version of that plot.
```{r scatter_plot_5}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(`3-gene_classifier`))
```
This helps explain why the function is called `facet_wrap()`. When it has too
many subplots to fit across the page, it wraps around to another row. We can
control how many rows or columns to use with the `nrow` and `ncol` arguments.
```{r scatter_plot_6}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(`3-gene_classifier`), nrow = 1)
```
```{r scatter_plot_7}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = `3-gene_classifier`)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(`3-gene_classifier`), ncol = 2)
```
We can combine faceting on one variable with a colour aesthetic for another
variable. For example, let's show the tumour stage status using faceting and the HER2
status using colours.
```{r scatter_plot_8}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = HER2_status)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(Neoplasm_histologic_grade))
```
Instead of this we could facet on more than variable.
```{r scatter_plot_9}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_wrap(vars(Neoplasm_histologic_grade, HER2_status))
```
Faceting on two variables is usually better done using the other faceting
function, **`facet_grid()`**. Note the change in how the formula is written.
```{r scatter_plot_10}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(Neoplasm_histologic_grade), vars(HER2_status))
```
Again we can use colour aesthetics alongside faceting to add further information
to our visualization.
```{r scatter_plot_11}
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(Neoplasm_histologic_grade), vars(HER2_status))
```
Finally, we can use a `labeller` to change the labels for each of the
categorical values so that these are more meaningful in the context of this
plot.
```{r scatter_plot_12}
grade_labels <- c("1" = "Grade I", "2" = "Grade II", "3" = "Grade III")
her2_status_labels <- c("Positive" = "HER2 positive", "Negative" = "HER2 negative")
#
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(Neoplasm_histologic_grade),
vars(HER2_status),
labeller = labeller(
Neoplasm_histologic_grade = grade_labels,
HER2_status = her2_status_labels
)
)
```
This would certainly be necessary if we were to use ER and HER2 status on one
side of the grid.
```{r scatter_plot_13}
er_status_labels <- c("Positive" = "ER positive", "Negative" = "ER negative")
#
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = PAM50)) +
geom_point(size = 0.5, alpha = 0.5) +
facet_grid(vars(Neoplasm_histologic_grade),
vars(ER_status, HER2_status),
labeller = labeller(
Neoplasm_histologic_grade = grade_labels,
ER_status = er_status_labels,
HER2_status = her2_status_labels
)
)
```
---
# Summary
In this session we have covered the following concepts:
* Building up workflows by chaining operations together using the pipe operator
* Sorting rows based on values in one or more columns
* Modifying a data frame by either adding new columns or modifying existing ones
* Summarizing the values in one or more columns
* Faceting of ggplot2 visualizations
---
## Exercises
The week's exercises will test your ability to manipulate the METABRIC data set
by changing the values of existing columns or adding new columns by computing
new variables from existing ones.
We are expecting you to use the 5 main dplyr 'verb' functions: `select()`,
`filter()`, `arrange()`, `mutate()` and `summarize()`. Please use the pipe
operator, `%>%`, in cases where more than one operation is required to achieve
the outcome requested.
```{r message = FALSE, warning = FALSE}
library(tidyverse)
metabric <- read_csv("data/metabric_clinical_and_expression_data.csv")
```
---
**1. Investigate the subset of long-surviving breast cancer patients that didn't receive chemo or radiotherapy**
First obtain the subset of patients that received neither chemotherapy or
radiotherapy and survived for more than 20 years.
```{r}
patients_of_interest <- filter(metabric, Chemotherapy == "NO", Radiotherapy == "NO", Survival_time / 12 > 20)
# alternative using separate steps
patients_of_interest <- metabric %>%
filter(Chemotherapy == "NO") %>%
filter(Radiotherapy == "NO") %>%
mutate(Survival_time_years = Survival_time / 12) %>%
filter(Survival_time_years > 20)
#
patients_of_interest
```
Now look at the breakdown of these patients in terms of ER status. Count the
numbers of ER positive and ER negative patients in this subset. Calculate the
proportion that are ER positive.
```{r}
table(patients_of_interest$ER_status)
mean(patients_of_interest$ER_status == "Positive")
```
What does this tell us? Calculate the proportion of ER positive patients in the
whole cohort by way of a comparison.
```{r}
table(metabric$ER_status)
mean(metabric$ER_status == "Positive")
```
**2. Which patients have spent the largest proportion of their lives dealing with breast cancer?**
```{r}
### Hint
### Create Survival_time_years
### Add it to Age_at_diagnosis
### Create Proportion: Survival_time_years/Age
metabric %>%
mutate(Survival_time_years = Survival_time / 12) %>%
mutate(Age = Age_at_diagnosis + Survival_time_years) %>%
mutate(Proportion_of_life_since_diagnosis = Survival_time_years / Age) %>%
select(Patient_ID, Age, Age_at_diagnosis, Survival_time_years, Proportion_of_life_since_diagnosis) %>%
arrange(desc(Proportion_of_life_since_diagnosis)) %>%
mutate_if(is.numeric, round, digits = 2)
```
---
**3. Convert the expression values for each of the genes into standardized z-scores**
Some genes are generally expressed at higher levels than others. This can make
comparisons of changes between groups for a set of genes somewhat difficult,
particularly if the expression for those genes are on very different scales. The
expression values in our METABRIC are on a log~2~ scale which helps to reduce
the range of values but another method for representing expression measurements
is to standardize these to produce z-scores.
Standardization of a set of measurements involves subtracting the mean from each
and dividing by the standard deviation. This will produce values with a mean of
0 and a standard deviation of 1.
Create a modified version of the `metabric` data frame containing a new column
with the standardized expression values (z-scores) for the ESR1 gene.
```{r}
### Hint
### ESR1 - mean(ESR1)) / sd(ESR1)
metabric_z_scores <- mutate(metabric, ESR1_z_score = (ESR1 - mean(ESR1)) / sd(ESR1))
```
Check that you've done this correctly by calculating the mean and standard
deviation of your new z-score variable.
```{r}
summarise(metabric_z_scores, mean = mean(ESR1_z_score), sd = sd(ESR1_z_score))
```
Add another column to your modified `metabric` data frame containing a z-score
for GATA3 and then create a scatter plot of the z-scores of GATA3 against ESR1.
Modify your plot to facet by the PAM50 classification.
```{r}
metabric_z_scores <- mutate(metabric_z_scores, GATA3_z_score = (GATA3 - mean(GATA3)) / sd(GATA3))
ggplot(data = metabric_z_scores) +
geom_point(mapping = aes(x = GATA3_z_score, y = ESR1_z_score)) +
facet_wrap(vars(PAM50))
```
---
**4. Which Star Wars characters need to go on a diet?**
Compute the body mass index (BMI) of characters in the `starwars` tibble.
```{r}
starwars <- starwars %>%
mutate(height = height / 100) %>%
mutate(BMI = mass / height ^ 2)
```
Filter for human characters that are overweight (BMI > 25) and display in
decreasing order of BMI.
```{r}
starwars %>%
filter(species == "Human") %>%
filter(BMI > 25) %>%
arrange(desc(BMI)) %>%
select(name, height, mass, BMI)
```