-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtidy-data.qmd
570 lines (380 loc) · 24.1 KB
/
tidy-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
# Tidy data {#tidy-data}
Data "shape" can be important when you are trying to work with and visualize data. In this chapter we'll discuss "tidy data" and how it helps us with both ggplot and other charting tools like [Datawrapper](https://www.datawrapper.de/).
## Learning goals for this chapter
- We'll explore what it means to have "tidy" data.
- We'll use `pivot_longer()`, `pivot_wider()` to shape our data for different purposes.
## The questions we'll answer
- Are candy colors evenly distributed within a standard package of M&M's? (i.e., average per color in a standard package)
- We'll plot the result as a column chart to show the average number of colored candies. We'll do it first in ggplot, then Datawrapper.
- Bonus 1: Who got the most candies in their bag?
- Bonus 2: What is the average number of candy in a bag?
## What is tidy data
"Tidy" data is well formatted so each variable is in a column, each observation is in a row and each value is a cell. Our first step in working with any data is to make sure we are "tidy".
![](images/tidy-example.png)
It's easiest to see the difference through examples. The data frame below is of tuberculosis reports from the World Health Organization.
- Each row is a set of observations (or case) from a single country for a single year.
- Each column describes a unique variable. The year, the number of cases and the population of the country at that time.
![](images/tidy-table-tidy.png)
Table2 below isn't tidy. The **count** column contains two different type of values.
![](images/tidy-table-nottidy.png)
When our data is tidy, it is easy to manipulate. We can use functions like `mutate()` to calculate new values for each case.
![](images/tidy-table-manipulate.png)
When our data is tidy, we can use the [tidyr](https://tidyr.tidyverse.org/) package to reshape the layout of our data to suit our needs. It gets loaded with `library(tidyverse)` so we won't need to load it separately.
### Wide vs long data
In the figure below, the table on the left is "wide". There are are multiple year columns describing the same variable. It might be useful if we want to calculate the difference of the values for two different years. It's less useful if we want plot on a graphic because we don't have a single "year" column to map as an x or y axes.
The table on the right is "long", in that each column describes a single variable. It is this shape we need when we want to plot values on a chart in ggplot. We can then set our "Year" column as an x-axis, our "n" column on our y-axis, and group by the "Country".
![Wide vs long](images/tidy-wide-vs-long.png)
**Neither shape is wrong**, they are just useful for different purposes. In fact, you'll find yourself pivoting the same data in different ways depending on your needs.
### Why we might want different shaped data
There are a myriad of reasons why you might need to reshape your data. Performing calculations on row-level data might be easier if it is wide. Grouping and summarizing calculations might be easier when it is long. ggplot graphics like long data, while Datawrapper sometimes wants wide data to make the same chart.
I find myself shaping data back and forth depending on my needs.
## The candy project
Let's visualize our goal here before we jump into the candy bowl.
Our first question is this: **Are there equal numbers of different-colored candies in a bag of M&M's?**
To visually communicate this, we'll build a chart similar to this one (though this one is about Skittles instead of M&M's.)
![Skittles example](images/skittles-chart-example.png)
We'll build this chart in both ggplot and Datawrapper.
### Prepare our candy project
We will use candy count data we've been collected in Reporting wth Data classes to explore this subject. You'll need to start a new project for this ...
1. Create a new project and call it: `yourname-candy`.
1. Update your `_quarto.yml` based on [Project setup](project-setup.qmd).
2. Create `data-raw` and `data-processed` folders as we normally do.
3. You can use your `index.qmd` file for this lesson ... just change the title to "Tidy data: Candy".
4. Create your setup block and load the `tidyverse` and `janitor`.
```{r setup}
#| echo: false
#| message: false
options(dplyr.summarise.inform = FALSE)
library(tidyverse)
library(janitor)
```
```{r}
#| label: test-data
#| include: false
# Using random test data until we get some from class
# This is hidden from printing (or should be)
# test_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTlrXKADzSsFeARpgcmtIu_5ngT5iiylqrgqua9CelFgzL8XMLXViSaqb8mTInc5K2JyIKF6oGOAAQ2/pub?gid=0&single=true&output=csv")
```
### Get the data
We'll download our data from a Google Sheet and save it onto our computer.
1. Add a Markdown section saying you'll download the data.
2. Add the chunk below and run it.
```{r}
#| label: download-date
download.file("https://docs.google.com/spreadsheets/d/e/2PACX-1vRCGayKLOy-52gKmEoPOj3ZKnOQVtCiooSloiCr-i_ci27e4n1CMPL0Z9s6MeFX9oQuN9E-HCFJnWjD/pub?gid=1456715839&single=true&output=csv", "data-raw/candy.csv", mode = "wb")
```
After you've run this, you can comment out the line of code so it doesn't re-download it each time you run the notebook.
### Import the data
OK, now we pull the data into our notebook. Nothing new here ... we import and clean the names.
1. Add a Markdown section noting that you are importing data.
2. Add this import chunk and run it.
```{r}
#| label: load-data
#| message: false
#| warning: false
candy_raw <- read_csv("data-raw/candy.csv") |> clean_names()
# peek at the data
candy_raw
```
This data comes from a Google Sheets document fed by a form that students have filled out, counting the colors of candies in a standard size bag of plain M&Ms.
**HINT:** Take note of the number of rows in this data. That is how many bags of candies are used in the analysis, which you might want to include in the chart. Also, these are 48g/1.7oz bags.
### Drop unneeded columns
For this exercise we don't need the `timestamp`, `candy_type` or `box_code` columns. We'll drop them so we can keep things simple.
1. Create new section noting you'll drop unneeded columns.
2. Create an R chunk and use `select()` to remove the columns noted above and save the result into a new data frame called `candy`. Remember you can negate a list of columns like this: `!c(col1, col2)`.
You've done this in the past, so you should be able to do it on your own.
```{r drop-cols}
#| label: drop-cols
#| code-fold: true
#| code-summary: "You got this! (But, just in case …)"
candy <- candy_raw |>
select(!c(timestamp, candy_type, box_code))
```
### Peek at the wide table
Let's take look closer at this data:
```{r peek}
candy |> head()
```
This is pretty well-formed data. This format would be useful to create a "total" column for each bag, but there are better ways to do this with **long** data. Same with getting our averages for each color.
### Where are we going with this data
Let's look at our plotting goal again:
![Skittle example](images/skittles-chart-example.png)
In our chart, we want one axis to have the candy color and the other to have the average number of candy.
To plot a chart like this in ggplot or Datawrapper, the data needs to be the same shape, like this:
| Color | Average |
|:-------|--------:|
| Green | 10.9 |
| Orange | 12.0 |
| Purple | 12.4 |
| Red | 11.4 |
| Yellow | 12 |
It will be easier to accomplish both of these tasks if our data were in the **long** format.
So, instead of this:
| first_name | last_name | red | green | orange | yellow | blue | brown |
|:-----------|:----------|----:|------:|-------:|-------:|-----:|------:|
| Christian | McDonald | 2 | 17 | 11 | 4 | 16 | 4 |
We want this:
| first_name | last_name | color | candies |
|:-----------|:----------|:-------|--------:|
| Christian | McDonald | red | 2 |
| Christian | McDonald | green | 17 |
| Christian | McDonald | orange | 11 |
| Christian | McDonald | yellow | 4 |
| Christian | McDonald | blue | 16 |
| Christian | McDonald | brown | 4 |
## The tidyr verbs
The two functions we'll use to reshape are data are:
- [pivot_longer()](https://tidyr.tidyverse.org/reference/pivot_longer.html) which "lengthens" data, increasing the number of rows and decreasing the number of columns.
- [pivot_wider()](https://tidyr.tidyverse.org/reference/pivot_wider.html) which "widens" data, increasing the number of columns and decreasing the number of rows.
Again, the best way to learn this is to present a problem and then solve it with explanation.
## Pivot longer
This visualization gives you an idea how `pivot_longer()` works.
![Pivot longer](images/pivot_longer_chart.png)
Each column of data chosen (the colored ones) is turned into it's own row of data. Supporting data (the grey columns) are duplicated.
The `pivot_longer()` function needs several arguments: **cols=**, **names_to=** and **values_to**. Below are two examples to pivot the example data shown above.
![pivot_longer code](images/pivot_longer_diagram.png)
- **`cols=`** is where you define a range of columns you want to pivot. *For our candy data we want the range `red:brown`*.
- **`names_to=`** allows you to name the new column filled by the column names. *For our candy data we want to name this "color" since that's what those columns described.*
- **`values_to=`** allows you to name the new column filled with the cell data. *For us we want to call this "count_candies" since these are the number of candies in each bag.*
There are a number of ways we can describe the `cols=` argument ... anything in [tidy-select](https://tidyr.tidyverse.org/reference/tidyr_tidy_select.html) works. You can see a bunch of [examples here](https://tidyr.tidyverse.org/reference/pivot_longer.html#ref-examples).
### Pivot our candy data longer
What we want here is six rows for each person's entry, with a column for "color" and a column for "count_candies".
We are using a range, naming the first "red" and the last column "brown" with `:` in between. This only works because those columns are all together. Otherwise we could list them all with `c(red, blue, green)` etc.
1. Add a note that you are pivoting the data
2. Add the chunk below and run it
```{r candy-pivot-long}
candy_long <- candy |>
pivot_longer(
cols = red:brown, # <1>
names_to = "color", # <2>
values_to = "count_candies" # <3>
)
candy_long |> head()
```
1. Sets which columns to pivot based on their names.
2. Sets the new column name for the former column names. This would default to "name" if we didn't set it.
3. Sets the new column name for the former values. This would default to "value" if we didn't set it.
### Get average candies per color
To get the average number of candies per each color, we can use our `candy_long` data and `group_by` color (which will consider all the **red** rows together, etc.) and use `summarize()` to get the mean.
This is very similar to the `sum()`s we did with military surplus, but you use `mean()` instead. One thing you have to watch for with `mean()` is you might need the argument `na.rm = TRUE` if there are missing or zero values, since you can't divide by zero. We'll include that here in case there are bags that don't have any of a specific color.
Save the resulting summary table into a new tibble called `candy_avg`.
```{r avg-bag}
candy_avg <- candy_long |>
group_by(color) |>
summarize(avg_candies = mean(count_candies, na.rm = TRUE)) # <1>
candy_avg
```
1. The na.rm hedges our bets in case some bags are missing a color.
### Round the averages
Let's **modify this summary** to round the averages to tenths so they will plot nicely on our chart.
::: callout-caution
### Rounding can be complicated
We use janitor's [`round_half_up()`](https://sfirke.github.io/janitor/reference/round_half_up.html) function because the base R [`round()`](https://www.rdocumentation.org/packages/base/topics/Round) function rounds half values to the nearest even number, e.g. both 11.5 and 12.5 would be rounded to 12. In journalism, we would usually round up at 5 or higher, so 11.5 to 12 and 12.5 to 13.
:::
We'll use the [`round_half_up()`](https://sfirke.github.io/janitor/reference/round_half_up.html) function, which needs the column to change, and then the number of digits past the decimal to include.
1. **Edit** your summarize function to add the `mutate()` function below.
```{r summary-mutate}
candy_avg <- candy_long |>
group_by(color) |>
summarize(avg_candies = mean(count_candies, na.rm = TRUE)) |>
mutate(
avg_candies = round_half_up(avg_candies, 1)
)
candy_avg
```
**BONUS POINT OPPORTUNITY:** Using a similar method to rounding above, you can also capitalize the names of the colors. You don't *have* to do this, but I'll give you a bonus point if you do:
- In your mutate, add a rule that updates `color` column using `str_to_title()`.
You can read more about [converting the case of a string here](https://stringr.tidyverse.org/reference/case.html). It's part of the [stringr](https://stringr.tidyverse.org/index.html) package, which is loaded with tidyverse.
### On your own: Plot the averages
Now I want you to use ggplot to create a bar chart that shows the average number of candies by color in a bag. This is very similar to the [Disney Princesses bar chart in Intro to ggplot](plots.qmd)).
1. Build a bar chart of average color using ggplot.
Some things to consider:
- I want the bars to be ordered by the highest average on top.
- I want you to have a good title, subtitle and byline, along with good axes names. Make sure a reader has all the information they need to understand what you are communicating with the chart, like the size and number of the packages considered.
- Include the values on the bars.
- Change the theme to something other than the default.
Here is what it should more or less look like, but with good text, etc:
```{r}
#| label: avg-plot
#| echo: false
#| message: false
#| warning: false
candy_avg |> mutate(color = str_to_title(color)) |>
ggplot(aes(x = reorder(color, avg_candies), y = avg_candies)) +
geom_col() +
coord_flip() +
geom_text(aes(label = avg_candies), hjust = 1.5, color = "white") + # plots votes text values on chart
labs(
title = "Give it a good title",
subtitle = str_wrap("Explain where the data came in your subtitle. You can wrap the text in an str_wrap() function so that the text will wrap into a new line like this one does instead off going off the chart."),
caption = "By Your Name",
x = "Do you even need an axis name here?",
y = "Provide a good axis name here",
)
```
**The numbers in the example above may not be up to date**, so don't let that throw you.
## Introducing Datawrapper
There are some other great charting tools that journalists use. My favorite is [Datawrapper](https://www.datawrapper.de/) and it is free for the level you need it.
Datawrapper is so easy I don't even have to teach you how to use it. They have [excellent tutorials](https://academy.datawrapper.de/).
What you do need is the data to plot, but you've already "shaped" it the way you need it. Your `candy_avg` tibble is what you need.
Here are the steps I want you to follow:
### Review how to make a bar chart
1. In a web browser, go to the [Datawrapper Academy](https://academy.datawrapper.de/)
2. Click on **Bar charts**
3. Choose [**How to create a bar chart**](https://academy.datawrapper.de/article/7-bar-chart)
The first thing to note there is they show you what they expect the data to look like. Your `candy_avg` tibble is just like this, but with Color and Candies.
You'll use these directions to create your charts so you might keep this open in its own tab.
### Start a chart
1. In a new browser tab, go to [datawrapper.de](https://www.datawrapper.de/) and click the big **Start creating** button.
2. Use the **Login/Sign Up** button along the top to create an account or log in if you have one.
3. The first screen you have is where you can **Upload data** or paste it into the window. We are going to upload the data, but we have to write out the data to your computer first.
### Export your data for Datawrapper
1. Go back to RStudio.
2. Create a new block and name it something about exporting
3. Take your data and pipe it into `write_csv()`. As an argument to write_csv(), give it a path and name of the file: We can write directly to our project folder as `candy_avg.csv`.
```{r}
#| label: write-candy
candy_avg |> write_csv("data-processed/candy_avg.csv")
```
This will save your data file onto your computer in your project folder.
#### posit.cloud export
If you are using posit.cloud, you will also need to **Export** your data from the cloud to get it onto your computer so you can import it into Datawrapper. Do this **only** if you are using the cloud version.
1. In the Files window (bottom right) go inside your `data-processed` folder.
2. Click on the checkbox next to the `candy_avg.csv` file.
3. Click on the **More** blue gear thing and choose **Export**.
4. You'll get a prompt to name the file (which you can keep the same) and then a **download** button. Click the button.
5. The file should go to your Downloads folder like anything else you download from the Internet.
6. When you go back to **Datawrapper** click on the **XLS/CSV upload** button and go find your file to import.
### Build the datawrapper graphic
1. Return to **Datawrapper** in your browser and click on the **XLS/CSV upload** button and go find your file to import it.
2. Once your data is either uploaded or copied into Datawrapper, you can click **Proceed** to go to the next step.
You can now follow the [Datawrapper Academy](https://academy.datawrapper.de/article/7-bar-chart) directions to finish your chart.
When you get to the Publish & Embed window, I want you to click the **Publish Now** button and then add the resulting *Link to your visualization:* URL to your R Notebook so I can find it for grading.
## Pivot wider
Now that you've pivoted data longer, I'd also like to show how you can pivot your table into a wide format. I'll sometimes do this to make an easier to read display or to shape data a specific way for Datawrapper or some other visualization software. Let's walk through he concept first.
As you can imagine, `pivot_wider()` does the opposite of `pivot_longer()`. When we pivot wider we move our data from a "long" format to a "wide" format. We create a new column based categories and values in the data.
![Long to wide](images/long_to_wide.png)
`pivot_wider()` needs two arguments:
- `names_from =` lets us define from which column (or columns) we are pulling values from to create the new column names. In the example above, this would be "V1".
- `values_from =` lets us say which column will be the values in the new arrangement. In the example above this would be "V2".
So the code to flip the data above would be:
``` r
df |>
pivot_wider(names_from = V1, values_from = V2)
```
### Pivot wider example
Let's do this with our data, taking our "long" data and making it "wide" again.
Here is our "long" data:
```{r}
#| label: candy-long-example
candy_long
```
... and now we flip it back to having a column for each color:
```{r}
#| label: candy-wide-example
candy_long |>
pivot_wider(
names_from = color, # <1>
values_from = count_candies # <2>
)
```
1. Within `pivot_wider()` we need to tell it from which column to pull the new column headers. i.e., where do we get the "names_from"?
2. The second thing it needs to know which variable has the data to fill these new columns. i.e., where do we get the "values_from"?
Let's practice some more.
### Choosing what to flip
Lets create a different look at our data so that we can spin it around in different ways.
When I buy candy for the class, we buy big boxes with many bags in it. There is a "box code" stamped on each package that denotes which box it was packaged in. We'll re-summarize our data so we get the color averages within each box. I do some other cleanup here, too, to remove bags that came from the wild.
1. Create a new section and note you are pivoting wider
2. Add a new chunk with the code below and run it
See annotations below the code to explain what is going on, in case you are interested.
```{r}
#| label: candy-box-color-prep
candy_box_color <- candy_raw |> # <1>
filter(box_code != "130FXWAC0336", box_code != "1524976SE") |> # <2>
select(!c(timestamp, candy_type)) |> # <3>
pivot_longer(cols = red:brown, names_to = "candy_color", values_to = "candy_count") |> # <4>
group_by(box_code, candy_color) |> # <5>
summarise(avg_candies = mean(candy_count) |> round_half_up(1)) # <6>
candy_box_color
```
1. We create a new object to put everything in, then go back to the original data to get the box code.
2. Removing some rows where there were only one or two bags. Not shown here is the research to find their values.
3. Removing the `timestamp` and `candy_type` variables.
4. Pivoting the data longer so we can group and summarize.
5. Group by the box and color before we ...
6. Summarize to get the average number of candies in each bag, within each box. We also round the result.
OK, now you have a new object called `candy_box_color`.
Let's pivot this data to shows a column for each box, which means that is where we draw our "names_from":
1. Add a note that you'll pivot our new data
2. Add the code below and run it
```{r}
#| label: pivot-box
candy_box_color |>
pivot_wider(names_from = box_code, values_from = avg_candies)
```
### Pivot wider on your own
Now I want you do apply the same `pivot_wider()` function to that same `candy_box_color` data, but to have a column for each color and a row for each box.
1. Start a new section and note this is pivot_wider on your own.
2. Start with the `candy_box_color` data, and then ...
3. Use `pivot_wider()` to make the data shaped like this
```{r}
#| label: wide-own-answer
#| include: false
candy_box_color |>
pivot_wider(names_from = candy_color, values_from = avg_candies)
```
| box_code | blue | brown | green | orange | red | yellow |
|:-------------|-----:|------:|------:|-------:|----:|-------:|
| \[box name\] | \# | \# | \# | \# | \# | \# |
## Bonus questions
More opportunities for bonus points on this assignment. These aren't plots, just data wrangling to find answers.
### Most/least candies
Answer me this: Who got the most candies in their bag? Who got the least?
I want a well-structured section (headline, text) with two chunks, one for the most and one for the least.
```{r}
#| label: most-candy-answer
#| echo: false
#| output: false
candy_long |>
group_by(first_name, last_name) |>
summarise(total_candy = sum(count_candies)) |>
arrange(total_candy |> desc()) |>
filter(total_candy > 60)
```
```{r}
#| label: least-candy-answer
#| echo: false
#| output: false
candy_long |>
group_by(first_name, last_name) |>
summarise(total_candy = sum(count_candies)) |>
arrange(total_candy) |>
filter(total_candy < 53)
```
### Average total candies in a bag
Answer me this: What is the average number of candy in a bag?
Again, well-structured section and include the code.
Hint: You need a total number of candies per person before you can get an average.
```{r}
#| label: avg-candy-bag-answer
#| echo: false
#| output: false
total_candies <- candy_long |>
group_by(first_name, last_name) |>
summarise(total_candy = sum(count_candies))
# hard way
total_candies |>
ungroup() |>
summarise(avg_bag = mean(total_candy))
# easier
total_candies |>
summary()
```
## Turn in your work
1. Make sure your notebook Renders.
2. Publish it to Quarto Publish. Include the published link at the top of your notebook.
3. Stuff your project and turn it into the Candy assignment in Canvas.
## What we learned
- We learned what "tidy data" means and why it is important. It is the best shape for data wrangling and plotting.
- We learned about [`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html) and [`pivot_wider()`](https://tidyr.tidyverse.org/reference/pivot_wider.html) and we used them to transpose our data.
- We also used [`round_half_up()`](https://sfirke.github.io/janitor/reference/round_half_up.html) to round off some numbers, and you might have used `str_to_title()` to change the case of the color values.