diff --git a/materials/sections/clean-wrangle-data.qmd b/materials/sections/clean-wrangle-data.qmd index 1eccbcdf..25c16d2d 100644 --- a/materials/sections/clean-wrangle-data.qmd +++ b/materials/sections/clean-wrangle-data.qmd @@ -21,12 +21,12 @@ Suppose you have the following `data.frame` called `length_data` with data about | year| length\_cm| |-----:|-----------:| -| 1990| 5.673318| -| 1991| 3.081224| -| 1991| 4.592696| -| 1992| 4.381523| -| 1992| 5.597777| -| 1992| 4.900052| +| 1990| 5.6| +| 1991| 3.0| +| 1991| 4.5| +| 1992| 4.3| +| 1992| 5.5| +| 1992| 4.9| Before thinking about the code, let's think about the steps we need to take to get to the answer (aka pseudocode). @@ -50,10 +50,10 @@ length_data %>% | site | 1990 | 1991 | ... | 1993 | |--------|------|------|-----|------| -| gold | 100 | 118 | ... | 112 | -| lake | 100 | 118 | ... | 112 | +| gold | 101 | 109 | ... | 112 | +| lake | 104 | 98 | ... | 102 | | ... | ... | ... | ... | ... | -| dredge | 100 | 118 | ... | 112 | +| dredge | 144 | 118 | ... | 145 | You are probably familiar with data in the above format, where values of the variable being observed are spread out across columns. In this example we have a different column per year. @@ -178,14 +178,15 @@ The code chunk you use to read in the data should look something like this: catch_original <- read_csv("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1") ``` -**Note for Windows users:** Keep in mind, if you want to replicate this workflow in your local computer you also need to use the `url()` function here with the argument `method = "libcurl"`. + + -It would look like this: + -```{r} -#| eval: false -catch_original <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1", method = "libcurl")) -``` + + + + ::: @@ -271,7 +272,7 @@ If you think of the assignment operator (`<-`) as reading like "gets", then the So you might think of the above chunk being translated as: -> The cleaned data frame gets the original data, and then a filter (of the original data), and then a select (of the filtered data). +> The cleaned data frame **gets** the original data, and **then** a filter (of the original data), and **then** a select (of the filtered data). The benefits to using pipes are that you don't have to keep track of (or overwrite) intermediate data frames. The drawbacks are that it can be more difficult to explain the reasoning behind each step, especially when many operations are chained together. It is good to strike a balance between writing efficient code (chaining operations), while ensuring that you are still clearly explaining, both to your future self and others, what you are doing and why you are doing it. @@ -565,6 +566,28 @@ sse_catch <- catch_long %>% head(sse_catch) ``` +::: {.callout-important} + +## `==` and `%in%` operators + +The `filter()` function performs a logical test across all the rows of a dataframe, and if that test is `TRUE` for a given row, it keeps that row. The `==` operator tests whether the left hand side and right hand side match - in the example above, does the value of the `Region` variable match the value `"SSE"`? + +But if you want to test whether a variable's value is within a set of possible values, *do not* use the `==` operator - it will very likely give false results! Instead, use the `%in%` operator: +```{r} +catch_long %>% + filter(Region == c("SSE", "ALU")) %>% + nrow() + +catch_long %>% + filter(Region %in% c("SSE", "ALU")) %>% + nrow() +``` + +This is because the `==` version "recycles" the vector of allowed values, so it tests whether the first row matches `"SSE"` (yep!), whether the second matches `"ALU"` (nope! this row gets dropped!), and then whether the third is `"SSE"` again and so on. + +Note that the `%in%` operator actually works for single values too, so you can never go wrong with that! +::: + ::: {.callout-note icon=false} ## Exercise @@ -581,13 +604,13 @@ catch_million <- catch_long %>% filter(catch > 1000000) ## Chinook from SSE data -chinook_see <- catch_long %>% +chinook_sse <- catch_long %>% filter(Region == "SSE", species == "Chinook") -## OR -chinook_see <- catch_long %>% - filter(Region == "SSE" & species == "Chinook") +## OR combine tests with & ("and") or | ("or")... also, we can swap == for %in% +chinook_sse <- catch_long %>% + filter(Region %in% "SSE" & species %in% "Chinook") ``` ::: @@ -700,7 +723,7 @@ mean_region <- catch_original %>% pivot_longer(-c(Region, Year), names_to = "species", values_to = "catch") %>% - mutate(catch = catch*1000) %>% + mutate(catch = catch * 1000) %>% group_by(Region) %>% summarize(mean_catch = mean(catch)) %>% arrange(desc(mean_catch)) @@ -708,6 +731,15 @@ mean_region <- catch_original %>% head(mean_region) ``` +## Write out the results with `readr::write_csv()` + +Now that we have performed all this data wrangling, we can save out the results for future use using `readr::write_csv()`. + +```{r} +#| eval: false +write_csv(mean_region, here::here("data/mean_catch_by_region.csv")) +``` + We have completed our lesson on Cleaning and Wrangling data. Before we break, let's practice our Git workflow. diff --git a/materials/sections/data-management-essentials.qmd b/materials/sections/data-management-essentials.qmd index a4b40766..9f4b5ff7 100644 --- a/materials/sections/data-management-essentials.qmd +++ b/materials/sections/data-management-essentials.qmd @@ -139,11 +139,11 @@ The article *Ten Simple Rules for Creating a Good Data Management Plan* (@michen #### Define how the data will be organized -- Once you know the data you will be using (rule #2) it is time to define how are you going to work with your data. Where will the raw data live? How are the different collaborators going to access the data? The needs vary widely from one project to another depending on the data. When drafting your DMP is helpful to focus on identifying what products and software you will be using. When collaborating with a team it is important to identify f there are any limitations to accessing any software or tool. +- Once you know the data you will be using (rule #2) it is time to define how are you going to work with your data. Where will the raw data live? How are the different collaborators going to access the data? The needs vary widely from one project to another depending on the data. When drafting your DMP is helpful to focus on identifying what products and software you will be using. When collaborating with a team it is important to identify whether there are any limitations to accessing any software or tool. - Resource - - [Here is an example](https://nceas.github.io/scicomp.github.io/tutorial_server.html) from the LTER Scientific Computing Support Team on working on NCEAS Server. + - [Here is an example](https://lter.github.io/workshop-github/server.html) from the LTER Scientific Computing Support Team on working on NCEAS Server. #### Explain how the data will be documented @@ -283,7 +283,7 @@ So, **how does a computer organize all this information?** There are a number of - [Ecological Metadata Language (EML)](https://eml.ecoinformatics.org/) - [Geospatial Metadata Standards (ISO 19115 and ISO 19139)](https://www.fgdc.gov/metadata/iso-standards) - See [NOAA's ISO Workbook](http://www.ncei.noaa.gov/sites/default/files/2020-04/ISO%2019115-2%20Workbook_Part%20II%20Extentions%20for%20imagery%20and%20Gridded%20Data.pdf) -- [Biological Data Profile (BDP)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/biometadata/biodatap.pdf) +- [Biological Data Profile (BDP)](https://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/biometadata/biodatap.pdf) - [Dublin Core](https://www.dublincore.org/) - [Darwin Core](https://dwc.tdwg.org/) - [PREservation Metadata: Implementation Strategies (PREMIS)](https://www.loc.gov/standards/premis/) diff --git a/materials/sections/geospatial-vector-analysis.qmd b/materials/sections/geospatial-vector-analysis.qmd index a473a4a4..93b62e37 100644 --- a/materials/sections/geospatial-vector-analysis.qmd +++ b/materials/sections/geospatial-vector-analysis.qmd @@ -64,6 +64,7 @@ file.remove('shapefile_demo_data.zip') #| warning: false #| message: false library(readr) +library(here) library(sf) library(ggplot2) library(leaflet) @@ -75,13 +76,12 @@ library(dplyr) ## Exploring the data using `plot()` and `st_crs()` -First let's read in the shapefile of regional boundaries in Alaska using `read_sf()` and then create a basic plot of the data `plot()`. +First let's read in the shapefile of regional boundaries in Alaska using `read_sf()` and then create a basic plot of the data `plot()`. Here we're adding a `_sf` suffix to our object name, to remind us that this is a Simple Features object with spatial information. ```{r} #| eval: false - # read in shapefile using read_sf() -ak_regions <- read_sf("data/ak_regions_simp.shp") +ak_rgns_sf <- read_sf(here("data/ak_regions_simp.shp")) ``` ```{r read_shp_sf} @@ -89,22 +89,20 @@ ak_regions <- read_sf("data/ak_regions_simp.shp") #| message: false #| warning: false -library(here) - -# data is saved in a different folder than participants -ak_regions <- read_sf(here("data/shapefiles/ak_regions_simp.shp")) +# for quarto rendering, data is saved in a different folder than participants +ak_rgns_sf <- read_sf(here("data/shapefiles/ak_regions_simp.shp")) ``` ```{r} # quick plot -plot(ak_regions) +plot(ak_rgns_sf) ``` We can also examine its class using `class()`. ```{r} -class(ak_regions) +class(ak_rgns_sf) ``` `sf` objects usually have two types of classes: `sf` and `data.frame`. @@ -115,9 +113,9 @@ But, unlike a typical `data.frame`, an `sf` object has spatial metadata (`geomet ```{r} #| message: false -head(ak_regions) +head(ak_rgns_sf) -glimpse(ak_regions) +glimpse(ak_rgns_sf) ``` ### Coordinate Reference System (CRS) @@ -141,7 +139,7 @@ ESRI has a [blog post](https://www.esri.com/arcgis-blog/products/arcgis-pro/mapp You can view what `crs` is set by using the function `st_crs()`. ```{r} -st_crs(ak_regions) +st_crs(ak_rgns_sf) ``` This looks pretty confusing. Without getting into the details, that long string says that this data has a geographic coordinate system (WGS84) with no projection. A convenient way to reference `crs` quickly is by using the EPSG code, a number that represents a standard projection and datum. You can check out a list of (lots!) of EPSG codes [here](http://spatialreference.org/ref/epsg/?page=1). @@ -155,14 +153,14 @@ We will use multiple EPSG codes in this lesson. Here they are, along with their You will often need to transform your geospatial data from one coordinate system to another. The `st_transform()` function does this quickly for us. You may have noticed the maps above looked wonky because of the dateline. We might want to set a different projection for this data so it plots nicer. A good one for Alaska is called the Alaska Albers projection, with an EPSG code of [3338](http://spatialreference.org/ref/epsg/3338/). ```{r} -ak_regions_3338 <- ak_regions %>% +ak_rgns_3338_sf <- ak_rgns_sf %>% st_transform(crs = 3338) -st_crs(ak_regions_3338) +st_crs(ak_rgns_3338_sf) ``` ```{r} -plot(ak_regions_3338) +plot(ak_rgns_3338_sf) ``` Much better! @@ -177,24 +175,26 @@ Since `sf` objects are data.frames, they play nicely with packages in the `tidyv ```{r select} # returns the names of all the columns in dataset -colnames(ak_regions_3338) +colnames(ak_rgns_3338_sf) ``` ```{r} -ak_regions_3338 %>% +ak_rgns_3338_sf %>% select(region) ``` -Note the sticky geometry column! The geometry column will stay with your `sf` object even if it is not called explicitly. +Note the sticky `geometry` column stays with the `region` column! The geometry column will stay with your `sf` object even if it is not called explicitly. ### `filter()` +Recall that `==` is problematic if you're testing whether a variable might match multiple values - use `%in%` for that situation! + ```{r} -unique(ak_regions_3338$region) +unique(ak_rgns_3338_sf$region) ``` ```{r filter} -ak_regions_3338 %>% +ak_rgns_3338_sf %>% filter(region == "Southeast") ``` @@ -213,20 +213,22 @@ We have some population data, but it gives the population by city, not by region 4. Save the spatial object you created using `write_sf()` ::: -**1. Read in `alaska_population.csv` using `read.csv()`** +**1. Read in `alaska_population.csv` using `read_csv()`** + +Here we'll add a `_df` suffix to remind us that this is just a regular data frame, not a spatial data frame. It does contain spatial variables (longitude and latitude), but as far as it knows, those are just numbers, not recognized as spatial geometry... yet! ```{r} #| eval: false # read in population data -pop <- read_csv("data/alaska_population.csv") +pop_df <- read_csv(here("data/alaska_population.csv")) ``` ```{r} #| echo: false #| message: false -# data is saved in a different folder than participants -pop <- read_csv(here("data/shapefiles/alaska_population.csv")) +# for Quarto rendering, data is saved in a different folder than participants +pop_df <- read_csv(here("data/shapefiles/alaska_population.csv")) ``` **Turn `pop` into a spatial object** @@ -235,13 +237,15 @@ The `st_join()` function is a spatial left join. The arguments for both the left We can do this easily using the `st_as_sf()` function, which takes as arguments the coordinates and the `crs`. The `remove = F` specification here ensures that when we create our `geometry` column, we retain our original `lat` `lng` columns, which we will need later for plotting. Although it isn't said anywhere explicitly in the file, let's assume that the coordinate system used to reference the latitude longitude coordinates is WGS84, which has a `crs` number of 4326. +Note that we're adding a `_sf` suffix to our new object, because now it is a Simple Features spatial data frame! + ```{r} -pop_4326 <- st_as_sf(pop, - coords = c('lng', 'lat'), - crs = 4326, - remove = F) +pop_4326_sf <- st_as_sf(pop_df, + coords = c('lng', 'lat'), + crs = 4326, + remove = F) -head(pop_4326) +head(pop_4326_sf) ``` **2. Join population data with Alaska regions data using `st_join()`** @@ -252,9 +256,9 @@ In this case, we want to find what region each city falls within, so we will use ```{r} #| eval: false -pop_joined <- st_join(pop_4326, - ak_regions_3338, - join = st_within) +pop_joined_sf <- st_join(pop_4326_sf, + ak_rgns_3338_sf, + join = st_within) ``` This gives an error! @@ -266,16 +270,16 @@ Error: st_crs(x) == st_crs(y) is not TRUE Turns out, this won't work right now because our coordinate reference systems are not the same. Luckily, this is easily resolved using `st_transform()`, and projecting our population object into Alaska Albers. ```{r} -pop_3338 <- st_transform(pop_4326, - crs = 3338) +pop_3338_sf <- st_transform(pop_4326_sf, + crs = 3338) ``` ```{r} -pop_joined <- st_join(pop_3338, - ak_regions_3338, - join = st_within) +pop_joined_sf <- st_join(pop_3338_sf, + ak_rgns_3338_sf, + join = st_within) -head(pop_joined) +head(pop_joined_sf) ``` ::: {.callout-caution icon="false"} @@ -286,26 +290,26 @@ There are many different types of joins you can do with geospatial data. Examine **3. Calculate the total population by region using `group_by()` and `summarize()`** -Next we compute the total population for each region. In this case, we want to do a `group_by()` and `summarize()` as this were a regular `data.frame`. Otherwise all of our point geometries would be included in the aggregation, which is not what we want. Our goal is just to get the total population by region. We remove the sticky geometry using `as.data.frame()`, on the advice of the `sf::tidyverse` help page. +Next we compute the total population for each region. In this case, we want to do a `group_by()` and `summarize()` as if this were a regular `data.frame`, without the spatial information - otherwise all of our point geometries would be included in the aggregation, which is not what we want. We remove the sticky geometry using `st_drop_geometry()`. Here we're adding a `_df` suffix because it's no longer a spatial data frame. ```{r} -pop_region <- pop_joined %>% - as.data.frame() %>% +pop_rgn_df <- pop_joined_sf %>% + st_drop_geometry() %>% group_by(region) %>% - summarise(total_pop = sum(population)) + summarize(total_pop = sum(population)) -head(pop_region) +head(pop_rgn_df) ``` And use a regular `left_join()` to get the information back to the Alaska region shapefile. Note that we need this step in order to regain our region geometries so that we can make some maps. ```{r} -pop_region_3338 <- left_join(ak_regions_3338, - pop_region, +pop_rgn_3338_sf <- left_join(ak_rgns_3338_sf, + pop_rgn_df, by = "region") # plot to check -plot(pop_region_3338["total_pop"]) +plot(pop_rgn_3338_sf["total_pop"]) ``` So far, we have learned how to use `sf` and `dplyr` to use a spatial join on two datasets and calculate a summary metric from the result of that join. @@ -319,11 +323,11 @@ The `group_by()` and `summarize()` functions can also be used on `sf` objects to Say we want to calculate the population by Alaska management area, as opposed to region. ```{r} -pop_mgmt_3338 <- pop_region_3338 %>% +pop_mgmt_3338_sf <- pop_rgn_3338_sf %>% group_by(mgmt_area) %>% summarize(total_pop = sum(total_pop)) -plot(pop_mgmt_3338["total_pop"]) +plot(pop_mgmt_3338_sf["total_pop"]) ``` Notice that the region geometries were combined into a single polygon for each management area. @@ -331,11 +335,11 @@ Notice that the region geometries were combined into a single polygon for each m If we don't want to combine geometries, we can specify `do_union = F` as an argument. ```{r} -pop_mgmt_3338 <- pop_region_3338 %>% +pop_mgmt_3338_sf <- pop_rgn_3338_sf %>% group_by(mgmt_area) %>% summarize(total_pop = sum(total_pop), do_union = F) -plot(pop_mgmt_3338["total_pop"]) +plot(pop_mgmt_3338_sf["total_pop"]) ``` **4. Save the spatial object to a new file using `write_sf()`** @@ -344,7 +348,7 @@ Save the spatial object to disk using `write_sf()` and specifying the filename. ```{r plot} #| eval: false -write_sf(pop_region_3338, "data/ak_regions_population.shp") +write_sf(pop_rgn_3338_sf, here("data/ak_regions_population.shp")) ``` ## Visualize with `ggplot` @@ -355,7 +359,7 @@ We can plot `sf` objects just like regular data.frames using `geom_sf`. ```{r} #| message: false -ggplot(pop_region_3338) + +ggplot(pop_rgn_3338_sf) + geom_sf(aes(fill = total_pop)) + labs(fill = "Total Population") + scale_fill_continuous(low = "khaki", @@ -373,17 +377,17 @@ We can also plot multiple shapefiles in the same plot. Say if we want to visuali ```{r} #| echo: false -# data is saved in a different folder than participants -rivers_3338 <- read_sf(here("data/shapefiles/ak_rivers_simp.shp")) +# for Quarto rendering, data is saved in a different folder than participants +rivers_3338_sf <- read_sf(here("data/shapefiles/ak_rivers_simp.shp")) ``` ```{r} #| eval: false -rivers_3338 <- read_sf("data/ak_rivers_simp.shp") +rivers_3338_sf <- read_sf(here("data/ak_rivers_simp.shp")) ``` ```{r} -st_crs(rivers_3338) +st_crs(rivers_3338_sf) ``` Note that although no EPSG code is set explicitly, with some sleuthing we can determine that this is `EPSG:3338`. [This site](https://epsg.io) is helpful for looking up EPSG codes. @@ -391,11 +395,11 @@ Note that although no EPSG code is set explicitly, with some sleuthing we can de ```{r} ggplot() + - geom_sf(data = pop_region_3338, + geom_sf(data = pop_rgn_3338_sf, aes(fill = total_pop)) + - geom_sf(data = pop_3338, + geom_sf(data = pop_3338_sf, size = 0.5) + - geom_sf(data = rivers_3338, + geom_sf(data = rivers_3338_sf, aes(linewidth = StrOrder)) + scale_linewidth(range = c(0.05, 0.5), guide = "none") + @@ -416,15 +420,15 @@ The `ggspatial` package has a function that can add tile layers from a few prede Then we will add `ggspatial::annotation_map_tile()` function into `ggplot` to add a base map to our map. This can take a couple of minutes to load. ```{r} -pop_3857 <- st_transform(pop_3338, +pop_3857_sf <- st_transform(pop_3338_sf, crs = 3857) ``` ```{r} #| message: false -ggplot(data = pop_3857) + - ggspatial::annotation_map_tile(type = "osm", zoom = 4) + # higher zoom values are more detailed +ggplot(data = pop_3857_sf) + + ggspatial::annotation_map_tile(type = "osm", zoom = 4, progress = 'none') + # higher zoom values are more detailed geom_sf(aes(color = population), fill = NA) + scale_color_continuous(low = "darkkhaki", @@ -439,10 +443,10 @@ ggplot(data = pop_3857) + ## Potential way of plotting base maps with more basemap providers. Issue: cropping is right at the border of the western, norther, eastern and southern point. So plot looks funky. -pop_osm <- maptiles::get_tiles(pop_3857, crop = TRUE) # retrieve maptiles +pop_osm <- maptiles::get_tiles(pop_3857_sf, crop = TRUE) # retrieve maptiles -ggplot(data = pop_3857) + # pop polygon layer +ggplot(data = pop_3857_sf) + # pop polygon layer tidyterra::geom_spatraster_rgb(data = pop_osm) + #add basemap geom_sf(aes(color = population), #add geometry fill = NA)+ @@ -473,19 +477,19 @@ epsg3338 <- leaflet::leafletCRS( You might notice that this looks familiar! The syntax is a bit different, but most of this information is also contained within the `crs` of our shapefile: ```{r} -st_crs(pop_region_3338) +st_crs(pop_rgn_3338_sf) ``` Since `leaflet` requires that we use an unprojected coordinate system, let's use `st_transform()` yet again to get back to WGS84. ```{r} -pop_region_4326 <- pop_region_3338 %>% +pop_rgn_4326_sf <- pop_rgn_3338_sf %>% st_transform(crs = 4326) ``` ```{r} m <- leaflet(options = leafletOptions(crs = epsg3338)) %>% - addPolygons(data = pop_region_4326, + addPolygons(data = pop_rgn_4326_sf, fillColor = "gray", weight = 1) @@ -495,11 +499,11 @@ m We can add labels, legends, and a color scale. ```{r} -pal <- colorNumeric(palette = "Reds", domain = pop_region_4326$total_pop) +pal <- colorNumeric(palette = "Reds", domain = pop_rgn_4326_sf$total_pop) m <- leaflet(options = leafletOptions(crs = epsg3338)) %>% addPolygons( - data = pop_region_4326, + data = pop_rgn_4326_sf, fillColor = ~ pal(total_pop), weight = 1, color = "black", @@ -509,7 +513,7 @@ m <- leaflet(options = leafletOptions(crs = epsg3338)) %>% addLegend( position = "bottomleft", pal = pal, - values = range(pop_region_4326$total_pop), + values = range(pop_rgn_4326_sf$total_pop), title = "Total Population" ) @@ -519,18 +523,18 @@ m We can also add the individual communities, with popup labels showing their population, on top of that! ```{r} -pal <- colorNumeric(palette = "Reds", domain = pop_region_4326$total_pop) +pal <- colorNumeric(palette = "Reds", domain = pop_rgn_4326_sf$total_pop) m <- leaflet(options = leafletOptions(crs = epsg3338)) %>% addPolygons( - data = pop_region_4326, + data = pop_rgn_4326_sf, fillColor = ~ pal(total_pop), weight = 1, color = "black", fillOpacity = 1 ) %>% addCircleMarkers( - data = pop_4326, + data = pop_4326_sf, lat = ~ lat, lng = ~ lng, radius = ~ log(population / 500), @@ -539,12 +543,12 @@ m <- leaflet(options = leafletOptions(crs = epsg3338)) %>% fillOpacity = 1, weight = 0.25, color = "black", - label = ~ paste0(pop_4326$city, ", population ", comma(pop_4326$population)) + label = ~ paste0(pop_4326_sf$city, ", population ", comma(pop_4326$population)) ) %>% addLegend( position = "bottomleft", pal = pal, - values = range(pop_region_4326$total_pop), + values = range(pop_rgn_4326_sf$total_pop), title = "Total Population" ) diff --git a/materials/sections/git-collab-merge-conflicts.qmd b/materials/sections/git-collab-merge-conflicts.qmd index c5736705..7e458552 100644 --- a/materials/sections/git-collab-merge-conflicts.qmd +++ b/materials/sections/git-collab-merge-conflicts.qmd @@ -38,7 +38,7 @@ The Collaborator will make changes to the repository and then `push` those chang The instructors will demonstrate this process in the next section. -### Step 0: Owner adds a Collaborator to their repository on GitHub {.unnumbered} +### Step 1: Owner adds a Collaborator to their repository on GitHub {.unnumbered} The Owner must change the settings of the remote repository and give the Collaborator access to the repository by inviting them as a collaborator. Once the Collaborator accepts the owner's invitation, they will have push access to the repository -- meaning they can contribute their own changes/commits to the Owner's repository. @@ -46,7 +46,7 @@ To do this, the owner will navigate to their remote repository on GitHub, then c -### Step 1: Collaborator clones the remote repository {.unnumbered} +### Step 2: Collaborator clones the remote repository {.unnumbered} In order to contribute, the Collaborator must **clone** the repository from the **Owner's** GitHub account (*Note: as a Collaborator, you won't see the repository appear under your profile's Repositories page*). To do this, the Collaborator should navigate to the Owner's repository on GitHub, then copy the clone URL. In RStudio, the Collaborator will create a new project from version control by pasting this clone URL into the appropriate dialog box (see the [earlier chapter](https://learning.nceas.ucsb.edu/2024-06-delta/session_05.html#exercise-2-clone-your-repository-and-use-git-locally-in-rstudio) introducing GitHub). @@ -54,12 +54,12 @@ In order to contribute, the Collaborator must **clone** the repository from the Frequent communication is SO important when collaborating! Letting one another know that you're about to make and push changes to the remote repo can help to prevent merge conflicts (and reduce headaches). **The easiest way to avoid merge conflicts is to ensure that you and your collaborators aren't working on the same file(s)/section(s) of code at the same time.** -### Step 2: Collaborator edits files locally {.unnumbered} +### Step 3: Collaborator edits files locally {.unnumbered} With the repo cloned locally, the Collaborator can now make changes to the `README.md` file, adding a line or statement somewhere noticeable near the top. Save the changes. -### Step 3: Collaborator `commit`s, `pull`s, and `push`s {.unnumbered} +### Step 4: Collaborator `commit`s, `pull`s, and `push`s {.unnumbered} It's recommended that all collaborators (including the repo Owner) follow this workflow when syncing changes between their local repo and the remote repo (in this example, the Collaborator is now following these steps): @@ -80,7 +80,7 @@ The **merge** part of `git pull` will fail if you have uncommitted changes in yo Remember, communication is key! The Owner now knows that they can pull those changes down to their local repo. -### Step 4: Owner `pull`s new changes from the remote repo to their local repo {.unnumbered} +### Step 5: Owner `pull`s new changes from the remote repo to their local repo {.unnumbered} The Owner can now open their local working copy of the code in RStudio, and `pull` to fetch and merge those changes into their local copy. @@ -91,7 +91,7 @@ in RStudio, and `pull` to fetch and merge those changes into their local copy. Did we mention that communication is important? :) -### Step 5: Owner edits, `commit`s, `pull`s (just in case!) and `push`es {.unnumbered} +### Step 6: Owner edits, `commit`s, `pull`s (just in case!) and `push`es {.unnumbered} Following the same workflow as the Collaborator did earlier: @@ -103,7 +103,7 @@ Following the same workflow as the Collaborator did earlier: Yes, this seems silly to repeat, yet again -- but it's also easy to forget in practice! -### Step 6: Collaborator `pull`s new changes from the remote repo to their local repo {.unnumbered} +### Step 7: Collaborator `pull`s new changes from the remote repo to their local repo {.unnumbered} The Collaborator can now `pull` down those changes from the Owner, and all copies are once again fully synced. And just like that, you've successfully collaborated! @@ -155,9 +155,7 @@ git config pull.rebase false You will do the exercise twice, where each person will get to practice being both the Owner and the Collaborator roles. -- **Step 0:** Designate one person as the Owner and one as the Collaborator. - -**Round One:** +**Round One:** Designate one person as the **Owner** and one as the **Collaborator**. - **Step 1:** Owner adds Collaborator to `{FIRSTNAME}_test` repository (see Setup block above for detailed steps) - **Step 2:** Collaborator clones the Owner's `{FIRSTNAME}_test` repository @@ -168,10 +166,9 @@ You will do the exercise twice, where each person will get to practice being bot - **Step 6:** Owner edits the `README` file: - Under "Git Workflow", Owner adds the steps of the Git workflow we've been practicing - **Step 7:** Owner commits and pushes the `README` file with the new changes to GitHub -- **Step 8:** Collaborator pulls the `Owners` changes from GitHub -- **Step 9:** Go back to Step 0, switch roles, and then follow the steps in Round Two. +- **Step 8:** Collaborator pulls the Owner's changes from GitHub -**Round Two:** +**Round Two:** Swap **Owner** and **Collaborator** roles and repeat! - **Step 1:** Owner adds Collaborator to `{FIRSTNAME}_test` repository - **Step 2:** Collaborator clones the Owner's `{FIRSTNAME}_test` repository @@ -182,7 +179,7 @@ You will do the exercise twice, where each person will get to practice being bot - **Step 6:** Owner edits the `README` file: - Under "How to Create a Git Repository", Owner adds the high level steps for this workflow - **Step 7:** Owner commits and pushes the `README` file with the new changes to GitHub -- **Step 8:** Collaborator pulls the `Owners` changes from GitHub +- **Step 8:** Collaborator pulls the Owner's changes from GitHub **Hint:** If you don't remember how to create a Git repository, refer to the chapter [Intro to Git and GitHub](https://learning.nceas.ucsb.edu/2023-04-coreR/session_07.html) where we created two Git repositories ::: @@ -359,12 +356,10 @@ Note you will only need to complete the Setup and Git configuration steps again Now it's your turn. In pairs, intentionally create a merge conflict, and then go through the steps needed to resolve the issues and continue developing with the merged files. See the sections above for help with each of the steps below. You will do the exercise twice, where each person will get to practice being both the Owner and the Collaborator roles. -- **Step 0:** Designate one person as the Owner and one as the Collaborator. - -**Round One:** +**Round One:** Designate one person as the Owner and one as the Collaborator. Both open the Owner's `{FIRSTNAME}_test` project. - **Step 1:** Both Owner and Collaborator `pull` to ensure both have the most up-to-date changes -- **Step 2:** Owner edits the `README` file and makes a change to the title and commits **do not push** +- **Step 2:** Owner edits the `README` file and makes a change to the title and commits but **do not push yet!** - **Step 3:** **On the same line**, Collaborator edits the `README` file and makes a change to the title and commits - **Step 4:** Collaborator pushes the file to GitHub - **Step 5:** Owner pushes their changes and gets an error @@ -374,12 +369,11 @@ Now it's your turn. In pairs, intentionally create a merge conflict, and then go - **Step 9:** Owner pushes the resolved changes to GitHub - **Step 10:** Collaborator pulls the resolved changes from GitHub - **Step 11:** Both view commit history -- **Step 12:** Go back to Step 0, switch roles, and then follow the steps in Round Two. -**Round Two:** +**Round Two:** Swap Owner and Collaborator roles and repeat! Switch to the new Owner's `{FIRSTNAME}_test` project. - **Step 1:** Both Owner and Collaborator `pull` to ensure both have the most up-to-date changes -- **Step 2:** Owner edits the `README` file and makes a change to line 2 and commits **do not push** +- **Step 2:** Owner edits the `README` file and makes a change to line 2 and commits but **do not push yet!** - **Step 3:** **On the same line**, Collaborator edits the `README` file and makes a change to line 2 and commits - **Step 4:** Collaborator pushes the file to GitHub - **Step 5:** Owner pushes their changes and gets an error diff --git a/materials/sections/r-creating-functions.qmd b/materials/sections/r-creating-functions.qmd index 34556e35..3fb13fd2 100644 --- a/materials/sections/r-creating-functions.qmd +++ b/materials/sections/r-creating-functions.qmd @@ -80,7 +80,7 @@ Functions in R are a mechanism to process some input and return a value. Similar ```{r} #| label: f2c-function -fahr_to_celsius <- function(fahr) { +convert_f_to_c <- function(fahr) { celsius <- (fahr - 32) * 5/9 return(celsius) } @@ -91,7 +91,7 @@ By running this code, we have created a function and stored it in R's global env ```{r} #| label: demo-f2c-function -celsius1a <- fahr_to_celsius(fahr = airtemps[1]) +celsius1a <- convert_f_to_c(fahr = airtemps[1]) celsius1a celsius1 == celsius1a ``` @@ -101,11 +101,11 @@ Excellent. So now we have a conversion function we can use. Note that, because m ```{r} #| label: f2c-function-vector -celsius <- fahr_to_celsius(fahr = airtemps) +celsius <- convert_f_to_c(fahr = airtemps) celsius ``` -This takes a vector of temperatures in Fahrenheit, and returns a vector of temperatures in Celsius. Note also that we explicitly named the argument inside the function call (`fahr_to_celsius(fahr = airtemps)`), but in this simple case, R can figure it out if we didn't explicitly tell it the argument name (`fahr_to_celsius(airtemps)`). More on this later! +This takes a vector of temperatures in Fahrenheit, and returns a vector of temperatures in Celsius. Note also that we explicitly named the argument inside the function call (`convert_f_to_c(fahr = airtemps)`), but in this simple case, R can figure it out if we didn't explicitly tell it the argument name (`convert_f_to_c(airtemps)`). More on this later! #### Your Turn: Create a Function that Converts Celsius to Fahrenheit {.unnumbered} @@ -113,9 +113,9 @@ This takes a vector of temperatures in Fahrenheit, and returns a vector of tempe #### Exercise -Create a function named `celsius_to_fahr` that does the reverse, it takes temperature data in Celsius as input, and returns the data converted to Fahrenheit. +Create a function named `convert_c_to_f` that does the reverse, it takes temperature data in Celsius as input, and returns the data converted to Fahrenheit. -Create the function `celsius_to_fahr` in a new code chunk or even a separate R Script file. +Create the function `convert_c_to_f` in a new code chunk or even a separate R Script file. Then use that formula to convert the `celsius` vector back into a vector of Fahrenheit values, and compare it to the original `airtemps` vector to ensure that your answers are correct. @@ -133,12 +133,12 @@ Don't peek until you write your own... ```{r} #| label: f2c-func-solution -celsius_to_fahr <- function(celsius) { +convert_c_to_f <- function(celsius) { fahr <- celsius * 9/5 + 32 return(fahr) } -result <- celsius_to_fahr(celsius) +result <- convert_c_to_f(celsius) airtemps == result ``` @@ -161,7 +161,8 @@ convert_temps <- function(fahr) { } t_vec <- c(-100, -40, 0, 32, 98.6, 212) -temps_df <- data.frame(convert_temps(t_vec)) +temps_list <- convert_temps(fahr = t_vec) +temps_df <- data.frame(convert_temps(fahr = t_vec)) ``` ```{r} @@ -218,8 +219,8 @@ This kind of function should take a vector (or multiple vectors) and return a *s #| label: func-mutate-example data.frame(f = t_vec) %>% - mutate(c = fahr_to_celsius(fahr = f), - f2 = celsius_to_fahr(celsius = c)) + mutate(c = convert_f_to_c(fahr = f), + f2 = convert_c_to_f(celsius = c)) ``` Why wouldn't our `convert_temps()` function work here? @@ -249,7 +250,9 @@ Let's make a function that can take a dataframe and calculate a new column that ```{r} calc_hotcold <- function(df, thresh = 70) { + ### error check: if(!'fahr' %in% names(df)) stop('The data frame must have a column called `fahr`!') + out_df <- df %>% mutate(hotcold = ifelse(fahr > thresh, 'hot', 'cold')) diff --git a/materials/sections/r-practice-collaborator-clean-wrangle.qmd b/materials/sections/r-practice-collaborator-clean-wrangle.qmd index 83ce4571..2fe1694e 100644 --- a/materials/sections/r-practice-collaborator-clean-wrangle.qmd +++ b/materials/sections/r-practice-collaborator-clean-wrangle.qmd @@ -73,8 +73,10 @@ carp_20_traps <- lobster_traps %>% filter(SITE == "CARP" & TRAPS > 20) ``` -# Question 4 +::: callout-note +### Question 4 Find the maximum number of commercial trap floats using `max()` and group by `SITE` and `MONTH`. Think about how you want to treat the `NA` values in `TRAPS` (Hint: check the arguments in `max()`). Check your output. +::: ```{r} #| echo: false diff --git a/materials/sections/r-practice-owner-clean-wrangle.qmd b/materials/sections/r-practice-owner-clean-wrangle.qmd index 51b49cd8..6c799faf 100644 --- a/materials/sections/r-practice-owner-clean-wrangle.qmd +++ b/materials/sections/r-practice-owner-clean-wrangle.qmd @@ -76,9 +76,10 @@ aque_70mm <- lobster_abundance %>% filter(SITE == "AQUE" & SIZE_MM >= 70) ``` - -# Question 4 +::: callout-note +### Question 4 Find the maximum carapace length using `max()` and group by `SITE` and `MONTH`. Think about how you want to treat the NA values in `SIZE_MM` (Hint: check the arguments in `max()`). Check your output. +::: ```{r} #| code-summary: "Answer" diff --git a/materials/sections/visualization-delta.qmd b/materials/sections/visualization-delta.qmd index 4d99f026..5dae88c4 100644 --- a/materials/sections/visualization-delta.qmd +++ b/materials/sections/visualization-delta.qmd @@ -185,7 +185,7 @@ Now, let's plot total daily visits by restoration location. We will show this by ## Option 1 - data and mapping called in the ggplot() function ggplot(data = daily_visits_loc, - aes(x = restore_loc, y = daily_visits))+ + aes(x = restore_loc, y = daily_visits)) + geom_col() @@ -203,7 +203,7 @@ ggplot() + They all will create the same plot: -(Apologies for the crumble text on the x-axis, we will learn how to make this look better soon) +(Apologies for the jumbled text on the x-axis, we will learn how to make this look better soon) ```{r esc_plot} #| echo: false @@ -276,7 +276,7 @@ Let's go back to our base bar graph. What if we want our bars to be blue instead ggplot(data = daily_visits_loc, aes(x = restore_loc, y = daily_visits, - fill = "blue"))+ + fill = "blue")) + geom_col() ``` @@ -289,7 +289,7 @@ What we really wanted to do was just change the color of the bars. If we want do ```{r fill_blue_geom} ggplot(data = daily_visits_loc, - aes(x = restore_loc, y = daily_visits))+ + aes(x = restore_loc, y = daily_visits)) + geom_col(fill = "blue") ``` @@ -300,7 +300,7 @@ What if we did want to map the color of the bars to a variable, such as `visitor ggplot(data = daily_visits_loc, aes(x = restore_loc, y = daily_visits, - fill = visitor_type))+ + fill = visitor_type)) + geom_col() ``` @@ -311,7 +311,7 @@ ggplot(data = daily_visits_loc, - If you want to map a variable onto a graph aesthetic (e.g., point color should be based on a specific region), put it within `aes()`. -- If you want to update your plot base with a constant (e.g. “Make ALL the points BLUE”), you can add the information directly to the relevant geom_ layer. +- If you want to update your plot base with a constant (e.g. “Make ALL the points BLUE”), you can add the information directly to the relevant `geom_` layer outside the `aes()` call. ::: @@ -322,22 +322,20 @@ ggplot(data = daily_visits_loc, We have successfully plotted our data. But, this is clearly not a nice plot. Let's work on making this plot look a bit nicer. We are going to: - Add a title, subtitle and adjust labels using `labs()` -- Flip the x and y axis to better read the graph using `coord_flip()` +- Flip the x and y axis to make it a sideways column plot and make the labels easier to read - Include a built in theme using `theme_bw()` ```{r theme_bw_plot} ggplot(data = daily_visits_loc, - aes(x = restore_loc, y = daily_visits, - fill = visitor_type))+ - geom_col()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(y = restore_loc, x = daily_visits, fill = visitor_type)) + + geom_col() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during July 2017 and March 2018")+ - coord_flip()+ + subtitle = "Sum of all visits during July 2017 and March 2018") + theme_bw() @@ -364,18 +362,16 @@ Let's look at an example of a `theme()` call, where we change the position of th ```{r} ggplot(data = daily_visits_loc, - aes(x = restore_loc, y = daily_visits, - fill = visitor_type))+ - geom_col()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(y = restore_loc, x = daily_visits, fill = visitor_type)) + + geom_col() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during study period")+ - coord_flip()+ - theme_bw()+ + subtitle = "Sum of all visits during study period") + + theme_bw() + theme(legend.position = "bottom", - axis.ticks.y = element_blank()) ## note we mention y-axis here + axis.ticks.y = element_blank()) ``` @@ -396,15 +392,13 @@ So now our code will look like this: ```{r} ggplot(data = daily_visits_loc, - aes(x = restore_loc, y = daily_visits, - fill = visitor_type))+ - geom_col()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(y = restore_loc, x = daily_visits, fill = visitor_type)) + + geom_col() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during study period")+ - coord_flip()+ + subtitle = "Sum of all visits during study period") + my_theme ``` @@ -415,7 +409,7 @@ ggplot(data = daily_visits_loc, What changes do you expect to see in your plot by adding the following line of code? Discuss with your neighbor and then try it out! -`scale_y_continuous(breaks = seq(0,120, 20))` +`scale_x_continuous(breaks = seq(0,120, 20))` ::: @@ -425,16 +419,14 @@ What changes do you expect to see in your plot by adding the following line of c #| code-summary: "Answer" ggplot(data = daily_visits_loc, - aes(x = restore_loc, y = daily_visits, - fill = visitor_type))+ - geom_col()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(y = restore_loc, x = daily_visits, fill = visitor_type)) + + geom_col() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during study period")+ - coord_flip()+ - scale_y_continuous(breaks = seq(0,120, 20))+ + subtitle = "Sum of all visits during study period") + + scale_x_continuous(breaks = seq(0,120, 20)) + my_theme ``` @@ -443,16 +435,14 @@ Finally we are going to expand the bars all the way to the axis line. In other w ```{r} ggplot(data = daily_visits_loc, - aes(x = restore_loc, y = daily_visits, - fill = visitor_type))+ - geom_col()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(y = restore_loc, x = daily_visits, fill = visitor_type)) + + geom_col() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during study period")+ - coord_flip()+ - scale_y_continuous(breaks = seq(0,120, 20), expand = c(0,0))+ + subtitle = "Sum of all visits during study period") + + scale_x_continuous(breaks = seq(0,120, 20), expand = c(0,0)) + my_theme ``` @@ -461,33 +451,34 @@ ggplot(data = daily_visits_loc, #### Reordering things -`ggplot()` loves putting things in alphabetical order. But more frequent than not, that's not the order you actually want things to be plotted. One way to do this is to use the `fct_reorder()` function from the `forcats` package. `forcats` provides tools for working with categorical variables. In this case, we want to reorder our categorical variable of `restore_loc` based on the total number of visits. +`ggplot()` loves putting things in alphabetical order. But more frequently than not, that's not the order you actually want things to be plotted. One way to do this is to use the `fct_reorder()` function from the `forcats` package. `forcats` provides tools for working with categorical variables. In this case, we want to reorder our categorical variable of `restore_loc` based on the total number of visits. -The first thing we need to do is to add a column to our data with the _total number of visits_ by location. This will be our "sorting" variable. +The first thing we need to do is to add a column to our data with the _total number of visits_ by location. This will be our "sorting" variable. Then we use `fct_reorder()` to reorder the `restore_loc` variable according to our sorting variable. ```{r} daily_visits_totals <- daily_visits_loc %>% group_by(restore_loc) %>% mutate(n = sum(daily_visits)) %>% - ungroup() + ungroup() %>% + mutate(restore_loc = fct_reorder(restore_loc, n)) head(daily_visits_totals) +levels(daily_visits_totals$restore_loc) ### not alphabetical any more! ``` Next, we will run the code for our plot adding the `fct_reorder()` function. ```{r} ggplot(data = daily_visits_totals, - aes(x = fct_reorder(restore_loc, n), y = daily_visits, - fill = visitor_type))+ - geom_col()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(x = daily_visits, y = restore_loc, + fill = visitor_type)) + + geom_col() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during study period")+ - coord_flip()+ - scale_y_continuous(breaks = seq(0,120, 20), expand = c(0,0))+ + subtitle = "Sum of all visits during study period") + + scale_x_continuous(breaks = seq(0,120, 20), expand = c(0,0)) + my_theme ``` @@ -495,28 +486,25 @@ ggplot(data = daily_visits_totals, What if you want to plot the other way around? In this case from least to greater? We add the `desc()` to the variable we are sorting by. ```{r} +daily_visits_totals <- daily_visits_loc %>% + group_by(restore_loc) %>% + mutate(n = sum(daily_visits)) %>% + ungroup() %>% + mutate(restore_loc = fct_reorder(restore_loc, desc(n))) + ggplot(data = daily_visits_totals, - aes(x = fct_reorder(restore_loc, desc(n)), y = daily_visits, - fill = visitor_type))+ - geom_col()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(x = daily_visits, y = restore_loc, + fill = visitor_type)) + + geom_col() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during study period")+ - coord_flip()+ - scale_y_continuous(breaks = seq(0,120, 20), expand = c(0,0))+ + subtitle = "Sum of all visits during study period") + + scale_x_continuous(breaks = seq(0,120, 20), expand = c(0,0)) + my_theme - ``` - - #### Colors @@ -525,17 +513,16 @@ The last thing we will do to our plot is change the color. To do this we are goi ```{r} ggplot(data = daily_visits_totals, - aes(x = fct_reorder(restore_loc, desc(n)), y = daily_visits, - fill = visitor_type))+ - geom_col()+ - scale_fill_viridis_d()+ - labs(x = "Restoration Location", - y = "Number of Visits", + aes(x = daily_visits, y = restore_loc, + fill = visitor_type)) + + geom_col() + + scale_fill_viridis_d() + + labs(x = "Number of Visits", + y = "Restoration Location", fill = "Type of Visitor", title = "Total Number of Visits to Delta Restoration Areas by visitor type", - subtitle = "Sum of all visits during study period")+ - coord_flip()+ - scale_y_continuous(breaks = seq(0,120, 20), expand = c(0,0))+ + subtitle = "Sum of all visits during study period") + + scale_x_continuous(breaks = seq(0,120, 20), expand = c(0,0)) + my_theme @@ -567,18 +554,18 @@ The default behavior of facet wrap is to put all facets on the same x and y scal ```{r} facet_plot <- ggplot(data = daily_visits_totals, aes(x = visitor_type, y = daily_visits, - fill = visitor_type))+ - geom_col()+ + fill = visitor_type)) + + geom_col() + facet_wrap(~restore_loc, scales = "free_y", ncol = 5, - nrow = 2)+ - scale_fill_viridis_d()+ + nrow = 2) + + scale_fill_viridis_d() + labs(x = "Type of visitor", y = "Number of Visits", title = "Total Number of Visits to Delta Restoration Areas", - subtitle = "Sum of all visits during study period")+ - theme_bw()+ + subtitle = "Sum of all visits during study period") + + theme_bw() + theme(legend.position = "bottom", axis.ticks.x = element_blank(), axis.text.x = element_blank())