Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Factors] Update dataset to CES for the lab #163

Merged
merged 5 commits into from
Oct 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 22 additions & 20 deletions modules/Factors/lab/Factors_Lab.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,41 +13,44 @@ library(tidyverse)

### 1.0

Load the Youth Tobacco Survey data and `select` "Sample_Size", "Education", and "LocationAbbr". Name this data "yts".
Load the CalEnviroScreen dataset and use `select` to choose the `CaliforniaCounty`, `ImpWaterBodies`, and `ZIP` variables. Then subset this data using `filter` to include only the California counties Napa and San Francisco. Name this data "ces".

`ImpWaterBodies`: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.

```{r}
yts <-
read_csv("https://daseh.org/data/Youth_Tobacco_Survey_YTS_Data.csv") %>%
select(Sample_Size, Education, LocationAbbr)
ces <-
read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>%
select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
filter(CaliforniaCounty == c("Amador", "Napa", "Ventura", "San Francisco"))
```

### 1.1

Create a boxplot showing the difference in "Sample_Size" between Middle School and High School "Education". **Hint**: Use `aes(x = Education, y = Sample_Size)` and `geom_boxplot()`.
Create a boxplot showing the difference in groundwater contamination threats (`ImpWaterBodies`) among Amador, Napa, San Francisco, and Ventura counties (`CaliforniaCounty`). **Hint**: Use `aes(x = CaliforniaCounty, y = ImpWaterBodies)` and `geom_boxplot()`.

```{r 1.1response}

```

### 1.2

Use `count` to count up the number of observations of data for each "Education" group.
Use `count` to count up the number of observations of data for each `CaliforniaCounty` group.

```{r 1.2response}

```

### 1.3

Make "Education" a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder "Education". Reorder this variable so that "Middle School" comes before "High School". Assign the output the name "yts_fct".
Make `CaliforniaCounty` a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder `CaliforniaCounty`. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name "ces_fct".

```{r 1.3response}

```

### 1.4

Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different ordering in the plot and `count` table.
Repeat question 1.1 and 1.2 using the "ces_fct" data. You should see different ordering in the plot and `count` table.

```{r 1.4response}

Expand All @@ -57,39 +60,38 @@ Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different o
# Practice on Your Own!

### P.1

Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
Subset `ces_fct` so that it only includes data from Ventura county. Then convert `ZIP` (zip code) into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.

```{r P.1response}

```

### P.2

We want to create a new column that contains the group-level median sample size.
We want to create a new column that contains the group-level median values for `ImpWaterBodies`.

- Using the "yts_fct" data, `group_by` "LocationAbbr".
- Then, use `mutate` to create a new column "med_sample_size" that is the median "Sample_Size".
- **Hint**: Since you have already done `group_by`, a median "Sample_Size" will automatically be created for each unique level in "LocationAbbr". Use the `median` function with `na.rm = TRUE`.
- Using the "ces_Ventura" data, group the data by `ZIP` using `group_by`
- Then, use `mutate` to create a new column `med_ImpWaterBodies` that is the median of `ImpWaterBodies`.
- **Hint**: Since you have already done `group_by`, a median `ImpWaterBodies` will automatically be created for each unique level in `ZIP`. Use the `median` function with `na.rm = TRUE`.

```{r P.2response}

```

### P.3

We want to plot the "LocationAbbr" (state) by the "med_sample_size" column we created above. Using the `forcats` package, create a plot that:
We want to make a plot of the `med_ImpWaterBodies` column we created above in the `ces_Ventura`, separated by `ZIP`. Using the `forcats` package, create a plot that:

- Has "LocationAbbr" on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by "med_sample_size"
- Has "Sample_Size" on the y-axis
- Has `ZIP` on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by `med_ImpWaterBodies`
- Has `med_ImpWaterBodies` on the y-axis
- Is a boxplot (`geom_boxplot`)
- Has the x axis label of `State`
- Has the x axis label of "Zipcode"
(Don't worry if you get a warning about not being able to plot `NA` values.)

Save your plot using `ggsave()` with a width of 10 and height of 3.

Which state has the largest median sample size?
Which zipcode has the largest median measure of water pollution?

```{r P.3response}

Expand Down
86 changes: 45 additions & 41 deletions modules/Factors/lab/Factors_Lab_Key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,114 +13,118 @@ library(tidyverse)

### 1.0

Load the Youth Tobacco Survey data and `select` "Sample_Size", "Education", and "LocationAbbr". Name this data "yts".
Load the CalEnviroScreen dataset and use `select` to choose the `CaliforniaCounty`, `ImpWaterBodies`, and `ZIP` variables. Then subset this data using `filter` to include only the California counties Napa and San Francisco. Name this data "ces".

`ImpWaterBodies`: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.

```{r}
yts <-
read_csv("https://daseh.org/data/Youth_Tobacco_Survey_YTS_Data.csv") %>%
select(Sample_Size, Education, LocationAbbr)
ces <-
read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>%
select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
filter(CaliforniaCounty == c("Amador", "Napa", "Ventura", "San Francisco"))
```

### 1.1

Create a boxplot showing the difference in "Sample_Size" between Middle School and High School "Education". **Hint**: Use `aes(x = Education, y = Sample_Size)` and `geom_boxplot()`.
Create a boxplot showing the difference in groundwater contamination threats (`ImpWaterBodies`) among Amador, Napa, San Francisco, and Ventura counties (`CaliforniaCounty`). **Hint**: Use `aes(x = CaliforniaCounty, y = ImpWaterBodies)` and `geom_boxplot()`.

```{r 1.1response}
yts %>%
ggplot(mapping = aes(x = Education, y = Sample_Size)) +
ces %>%
ggplot(mapping = aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()
```

### 1.2

Use `count` to count up the number of observations of data for each "Education" group.
Use `count` to count up the number of observations of data for each `CaliforniaCounty` group.

```{r 1.2response}
yts %>%
count(Education)
ces %>%
count(CaliforniaCounty)
```

### 1.3

Make "Education" a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder "Education". Reorder this variable so that "Middle School" comes before "High School". Assign the output the name "yts_fct".
Make `CaliforniaCounty` a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder `CaliforniaCounty`. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name "ces_fct".

```{r 1.3response}
yts_fct <-
yts %>% mutate(Education = factor(Education,
levels = c("Middle School", "High School")
ces_fct <-
ces %>% mutate(CaliforniaCounty = factor(CaliforniaCounty,
levels = c("San Francisco", "Ventura", "Napa", "Amador")
))
```

### 1.4

Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different ordering in the plot and `count` table.
Repeat question 1.1 and 1.2 using the "ces_fct" data. You should see different ordering in the plot and `count` table.

```{r 1.4response}
yts_fct %>%
ggplot(mapping = aes(x = Education, y = Sample_Size)) +
ces_fct %>%
ggplot(mapping = aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()

yts_fct %>%
count(Education)
ces_fct %>%
count(CaliforniaCounty)
```


# Practice on Your Own!

### P.1

Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
Subset `ces_fct` so that it only includes data from Ventura county. Then convert `ZIP` (zip code) into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.

```{r P.1response}
yts_fct <- yts_fct %>% mutate(LocationAbbr = factor(LocationAbbr))
ces_Ventura <- ces_fct %>%
filter(CaliforniaCounty == "Ventura") %>%
mutate(ZIP = factor(ZIP))
```

### P.2

We want to create a new column that contains the group-level median sample size.
We want to create a new column that contains the group-level median values for `ImpWaterBodies`.

- Using the "yts_fct" data, `group_by` "LocationAbbr".
- Then, use `mutate` to create a new column "med_sample_size" that is the median "Sample_Size".
- **Hint**: Since you have already done `group_by`, a median "Sample_Size" will automatically be created for each unique level in "LocationAbbr". Use the `median` function with `na.rm = TRUE`.
- Using the "ces_Ventura" data, group the data by `ZIP` using `group_by`
- Then, use `mutate` to create a new column `med_ImpWaterBodies` that is the median of `ImpWaterBodies`.
- **Hint**: Since you have already done `group_by`, a median `ImpWaterBodies` will automatically be created for each unique level in `ZIP`. Use the `median` function with `na.rm = TRUE`.

```{r P.2response}
yts_fct <- yts_fct %>%
group_by(LocationAbbr) %>%
mutate(med_sample_size = median(Sample_Size, na.rm = TRUE))
ces_Ventura <- ces_Ventura %>%
group_by(ZIP) %>%
mutate(med_ImpWaterBodies = median(ImpWaterBodies, na.rm = TRUE))
```

### P.3

We want to plot the "LocationAbbr" (state) by the "med_sample_size" column we created above. Using the `forcats` package, create a plot that:
We want to make a plot of the `med_ImpWaterBodies` column we created above in the `ces_Ventura`, separated by `ZIP`. Using the `forcats` package, create a plot that:

- Has "LocationAbbr" on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by "med_sample_size"
- Has "Sample_Size" on the y-axis
- Has `ZIP` on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by `med_ImpWaterBodies`
- Has `med_ImpWaterBodies` on the y-axis
- Is a boxplot (`geom_boxplot`)
- Has the x axis label of `State`
- Has the x axis label of "Zipcode"
(Don't worry if you get a warning about not being able to plot `NA` values.)

Save your plot using `ggsave()` with a width of 10 and height of 3.

Which state has the largest median sample size?
Which zipcode has the largest median measure of water pollution?

```{r P.3response}
library(forcats)

yts_fct_plot <- yts_fct %>%
ces_Ventura_plot <- ces_Ventura %>%
drop_na() %>%
ggplot(mapping = aes(
x = fct_reorder(
LocationAbbr, med_sample_size
ZIP, med_ImpWaterBodies
),
y = Sample_Size
y = med_ImpWaterBodies
)) +
geom_boxplot() +
labs(x = "State")
labs(x = "Zipcode")

ggsave(
filename = "yts_fct.png", # will save in working directory
plot = yts_fct_plot,
filename = "ces_Ventura.png", # will save in working directory
plot = ces_Ventura_plot,
width = 10, height = 3
)
```
Loading
Loading