-
Notifications
You must be signed in to change notification settings - Fork 41
/
Copy path25-functions.Rmd
525 lines (382 loc) · 15.3 KB
/
25-functions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
# (PART\*) Program {-}
# Functions
**Learning objectives:**
We are going to learn about three useful type of function:
- *Vector functions* take one or more vectors as input and return a vector as output.
- *Data frame functions* take a data frame as input and return a data frame as output.
- *Plot functions* that take a data frame as input and return a plot as output.
```{r echo=FALSE, warning = FALSE}
library(tidyverse) |> suppressPackageStartupMessages()
library(nycflights13)
```
## Introduction
Functions are handy because:
- they automate repetitive tasks.
- have a name that makes the purpose very clear
- you only need to update the code in one place as things change
- it's safer than copy and paste - you won't replicate errors
The common theme for functions is to be consistent.
## When and how to write a function
Have you copy and pasted code more than 2x? Consider a function!
Key steps in creating a function:
1. Pick a **name** than makes it clear what the function does
2. **Arguments**, or input variable(s), go inside `function`, like so `function(arguments)`.
3. The **code** goes inside curly braces `{ }`, after `function()`.
4. Check your function with a few inputs to make sure it's working.
```{r eval=FALSE}
name <- function(arguments) {
code
}
```
## Vector functions
```{r}
df <- tibble(
a = rnorm(5),
b = rnorm(5),
c = rnorm(5),
d = rnorm(5),
)
df |> mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) /
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
# Can you spot out the error in the above code?
```
## Writing a vector function
```{r,eval=FALSE}
(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
```
To make this a bit clearer we can replace the bit that varies with █:
```{r,eval=FALSE}
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
```
To turn this into a function you need three things:
- **A name**. Here we’ll use `rescale01` because this function rescales a vector to lie between 0 and 1.
- **The arguments**. We have just one argument that we’ll call `x` because this is the conventional name for a numeric vector.
- **The body**. The body is the code that’s repeated across all the calls.
## Using the `rescale01()` function
```{r}
rescale01 <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
```
```{r}
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```
## Using the `rescale01()` function (cont.)
Then you can rewrite the call to `mutate()` as:
```{r}
df |> mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d),
)
```
## Other vector functions
Here, we want to strip percent signs, commas, and dollar signs from a string before converting it into a number:
```{r}
# https://twitter.com/NVlabormarket/status/1571939851922198530
clean_number <- function(x) {
is_pct <- str_detect(x, "%")
num <- x |>
str_remove_all("%") |>
str_remove_all(",") |>
str_remove_all(fixed("$")) |>
as.numeric(x)
if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
clean_number("45%")
```
## Data frame functions
When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function.
Data frame functions work like dplyr verbs:
- they take a data frame as the first argument,
- some extra arguments that say what to do with it,
- and return a data frame or vector.
## The problem of indirection
When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection.
```{r}
grouped_mean <- function(df, group_var, mean_var) {
df |>
group_by(group_var) |>
summarize(mean(mean_var))
}
```
```{r,error=TRUE}
diamonds |>
grouped_mean(cut, carat)
```
## The problem of indirection explained
- To make the problem a bit more clear, we can use a made up data frame:
```{r}
df <- tibble(
mean_var = 1,
group_var = "g",
group = 1,
x = 10,
y = 100
)
df |>
grouped_mean(group, x)
df |>
grouped_mean(group, y)
```
- Regardless of how we call `grouped_mean()` it always does `df |> group_by(group_var) |> summarize(mean(mean_var))`, instead of `df |> group_by(group) |> summarize(mean(x))` or `df |> group_by(group) |> summarize(mean(y))`.
- This is a problem of *indirection*, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment.
## Tidy evaluation and embracing
- Tidy evaluation makes our data analyses very concise as you never have to say which data frame a variable comes from, but the downside comes when we want to wrap up repeated tidyverse code into a function.
- Our solution to overcome to this problem called **embracing** 🤗. Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
```{r}
grouped_mean <- function(df, group_var, mean_var) {
df |>
group_by({{ group_var }}) |>
summarize(mean({{ mean_var }}))
}
df |>
grouped_mean(group, x)
```
## When to embrace?
So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:
- **Data-masking:** this is used in functions like `arrange()`, `filter()`, and `summarize()` that *compute* with variables.
- **Tidy-selection:** this is used for functions like `select()`, `relocate()`, and `rename()` that *select* variables.
## Common use cases
```{r}
summary6 <- function(data, var) {
data |> summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }})),
.groups = "drop"
)
}
diamonds |>
summary6(carat)
diamonds |>
group_by(cut) |>
summary6(carat)
```
## Plot functions
```{r eval=FALSE}
diamonds |>
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.1)
diamonds |>
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.05)
```
You can take the code above and create a function, keeping in mind that `aes()` is a data-masking function and you'll need to embrace.
```{r}
histogram <- function(df, var, binwidth = NULL) {
df |>
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}
diamonds |>
histogram(carat, 0.1)
```
Note that because `histogram()` returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from `|>` to `+`:
```{r eval=FALSE}
diamonds |>
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")
```
## Adding more variables to plot functions
Here, we want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:
```{r}
# https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check <- function(df, x, y) {
df |>
ggplot(aes(x = {{ x }}, y = {{ y }})) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE)
}
starwars |>
filter(mass < 1000) |>
linearity_check(mass, height)
```
## Combining with other tidyverse
We can combine a dash of data manipulation with ggplot2, as seen below.
You'll notice we have to use a new operator here, `:=`, because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of `=`, but R’s syntax doesn’t allow anything to the left of `=` except for a single literal name.
```{r}
sorted_bars <- function(df, var) {
df |>
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |>
ggplot(aes(y = {{ var }})) +
geom_bar()
}
diamonds |>
sorted_bars(clarity)
```
## Labeling
Here, we label the output with the variable and the bin width that was used in our previous histogram using the `rlang::englue()`to go under the covers of tidy evaluation. `rlang` is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools). `englue()` works similarly to `str_glue()`, so any value wrapped in `{ }` will be inserted into the string.
```{r}
histogram <- function(df, var, binwidth) {
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |>
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
diamonds |>
histogram(carat, 0.1)
```
## Style: Making functions readable
Be consistent in your naming and coding of functions
**Names:**
- Functions should be verbs (action, state, or occurrence), arguments should be nouns (people places or things).
- Be consistent in using snake_case or camelCase.
- For sets of functions, use a common prefix
- Don't overwrite existing function
**Comments:**
- Use comments to explain the 'why' of the code
- Use lines of - or = to break up code into sections
## Summary
- In this chapter we learned how to write functions for three useful scenarios: **creating a vector**, **creating a data frames**, or **creating a plot**.
- To learn more about programming with tidy evaluation, see useful recipes in [programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html); and [programming with tidyr](https://tidyr.tidyverse.org/articles/programming.html); learn more about the theory in [What is data-masking and why do I need](https://rlang.r-lib.org/reference/topic-data-mask.html)
- To learn more about reducing duplication in your ggplot2 code, read the
[ Programming with ggplot2](https://ggplot2-book.org/programming.html) chapter of the ggplot2 book.
- For more advice on function style, see the [ tidyverse style guide](https://style.tidyverse.org/functions.html)
## Meeting Videos
### Cohort 5
`r knitr::include_url("https://www.youtube.com/embed/B5097Rbsafc")`
<details>
<summary> Meeting chat log </summary>
```
00:24:22 Jon Harmon (jonthegeek): Famous computer science quote: There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
00:32:31 Jon Harmon (jonthegeek): > identical(1.0, 1L)
[1] FALSE
00:32:50 Jon Harmon (jonthegeek): > 1.0 == 1L
[1] TRUE
00:33:31 Jon Harmon (jonthegeek): identical(as.integer(1.0), 1L)
00:33:49 Jon Harmon (jonthegeek): identical(1.0, as.double(1L))
00:38:39 Njoki Njuki Lucy: is there a difference between ifelse() and if, else function?
00:39:32 Jon Harmon (jonthegeek): ifelse()
if … else if … else
00:40:15 Jon Harmon (jonthegeek): ifelse(c(TRUE, FALSE, TRUE), "yes", "no")
00:40:28 Jon Harmon (jonthegeek): > ifelse(c(TRUE, FALSE, TRUE), "yes", "no")
[1] "yes" "no" "yes"
00:40:52 Jon Harmon (jonthegeek): > ifelse(1:10 == 8, "it's 8", "it isn't")
[1] "it isn't" "it isn't" "it isn't" "it isn't" "it isn't" "it isn't" "it isn't"
[8] "it's 8" "it isn't" "it isn't"
00:41:12 Jon Harmon (jonthegeek): > ifelse(1:10 == 8, 8, NA)
[1] NA NA NA NA NA NA NA 8 NA NA
00:42:13 Ryan Metcalf: Possible Reference, Section 7.4, Missing Values. It makes a reference to `ifelse()` function: https://r4ds.had.co.nz/exploratory-data-analysis.html?q=ifelse()#missing-values-2
00:43:01 Jon Harmon (jonthegeek): if else
ifelse
if_else
00:43:17 Njoki Njuki Lucy: thank you!
00:43:22 Njoki Njuki Lucy: big time:)
00:50:35 Njoki Njuki Lucy: what exactly is the trim doing? I didn't understand
00:52:05 Jon Harmon (jonthegeek): > mean(c(1, 90:100), trim = 0)
[1] 87.16667
> mean(c(1, 90:100), trim = 0.1)
[1] 94.5
> mean(c(1, 90:100), trim = 0.5)
[1] 94.5
00:52:46 Jon Harmon (jonthegeek): > mean(1:10, trim = 0.5)
[1] 5.5
00:54:10 Njoki Njuki Lucy: okay, understood. thanks!
00:58:15 Jon Harmon (jonthegeek): myfun <- function(x, ...) {
mean(x, ...)
}
00:58:47 Jon Harmon (jonthegeek): > myfun(1:10, trim = 0.1)
[1] 5.5
00:59:13 Jon Harmon (jonthegeek): > myfun(1:10, trim = 0.1)
Error in myfun(1:10, trim = 0.1) : unused argument (trim = 0.1)
01:01:34 Jon Harmon (jonthegeek): myfun <- function(x, funname, ...) {
if (funname == "mean") {
mean(x, ...)
} else {
log(x, ...)
}
}
01:04:11 Jon Harmon (jonthegeek): myfun <- function(...) {
dots <- list(...)
names(dots)
}
myfun(a = 1)
01:05:28 Jon Harmon (jonthegeek): [1] "a"
01:05:49 Jon Harmon (jonthegeek): dots <- list(a = 1)
01:09:32 Jon Harmon (jonthegeek): myfun <- function(a, b) {
a
}
myfun(1:10, Sys.sleep(60))
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/rsRImj294pM")`
<details>
<summary> Meeting chat log </summary>
```
See Chapter 20 for the part of the log that's relevant to that chapter.
```
</details>
### Cohort 6
`r knitr::include_url("https://www.youtube.com/embed/jDsmsNUHfPE")`
<details>
<summary> Meeting chat log </summary>
```
00:23:48 Daniel Adereti: Range() function in R returns the maximum and minimum value of the vector and column of the dataframe in R. range() function of the column of dataframe
00:45:42 Daniel Adereti: My guess for the inf, -inf is just to assign the respective 1 and 0 to the inf and -inf
00:46:26 Daniel Adereti: and to rescale the x vector expressing all variables with inf as 1 and -inf as 0
00:58:38 Adeyemi Olusola: Thanks for the wonderful talk. Sorry, I have to drop off now. Thanks
00:58:41 Daniel Adereti: it might make sense to stop at 19.2
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/whu8LeXt0VE")`
<details>
<summary> Meeting chat log </summary>
```
00:03:15 Marielena Soilemezidi: Hello there! :)
00:03:51 Daniel: Hello!
00:32:16 Daniel: I think it aim to check if any of the vector characters == 8, if yes, it returns 8, if no, it returns "Not available"
00:44:24 Daniel: Hello all, please remember we need volunteers for next week's class: Vectors
```
</details>
### Cohort 7
`r knitr::include_url("https://www.youtube.com/embed/c5FtLt0bGRs")`
<details>
<summary> Meeting chat log </summary>
```
00:15:05 Oluwafemi Oyedele: start
00:58:23 Oluwafemi Oyedele: https://dplyr.tidyverse.org/articles/programming.html
00:58:37 Oluwafemi Oyedele: https://tidyr.tidyverse.org/articles/programming.html
00:58:47 Oluwafemi Oyedele: https://rlang.r-lib.org/reference/topic-data-mask.html
00:58:54 Oluwafemi Oyedele: https://ggplot2-book.org/programming.html
00:59:01 Oluwafemi Oyedele: https://style.tidyverse.org/functions.html
00:59:43 Oluwafemi Oyedele: stop
```
</details>
### Cohort 8
`r knitr::include_url("https://www.youtube.com/embed/10mBNkkLdo0")`
<details>
<summary> Meeting chat log </summary>
```
00:10:01 Abdou: Hi everyone
00:10:12 Shamsuddeen Hassan Muhammad: hello
00:10:31 Shamsuddeen Hassan Muhammad: Ahmad can hear me
00:10:37 Shamsuddeen Hassan Muhammad: Abdul we can hear you
00:10:46 Abdou: No
00:10:54 Shamsuddeen Hassan Muhammad: Can u hear me Abduol?
00:11:07 Abdou: No I can’t
00:11:12 Shamsuddeen Hassan Muhammad: Re-join
00:14:21 Ahmed Mamdouh: Start
00:47:09 Abdou: Stop
```
</details>