diff --git a/19-joins.Rmd b/19-joins.Rmd index 12b15865..50e620ad 100644 --- a/19-joins.Rmd +++ b/19-joins.Rmd @@ -1,261 +1,392 @@ # Joins - -```{r 21-01, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE} +```{r 19-01, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE} +oldopt <- options(pillar.print_max = 4, pillar.print_min = 4) library(nycflights13) library(tidyverse) ``` **Learning objectives:** -- Use mutating and filtering joins to combine data. -- Identify **keys**, or varisbles, to connect a pair of data frames. +- Identify keys to connect a pair of data frames +- Use mutating and filtering joins to combine data - Understand how joins work and understand the output -- Discuss a family of joins that provides more flexibiility in matching keys +- Understand how various key matching conditions work ------------ +## What? {-} -## Introduction +- Joining two data frames `x` and `y`: combining the information from both data frames to create a new data frame, by **matching rows in `x` to rows in `y` based on one or more common variables** (keys) -- It's rare for data analysis to use only a single data frame so we often use multiple, called **relational data**. +## What? {-} -- The idea behind joins is these data relationships are defined between a pair of tables through keys. +- A join is the operation of joining. +Joins can be classified by several criteria: + - is the information of data frame `y` included in the result? (mutating vs filtering joins) + - what happens with non-matching rows? (inner vs outer joins / semi-join vs anti-join) + - how are key matching conditions defined? (equality vs inequality vs no restrictions) -- Every join involves a pair of keys: a primary key and a foreign key. +## Keys {-} -- To explore these keys, we'll look at the data frames in the nycflight13 package. +Keys = the variables used to connect a pair of data frames in a join. +- `x$key` must match `y$key` for a row in `x` to be matched to a row in `y` +- **compound key**: key that consists of > 1 variable -## nycflights13 +## Keys {-} -- The nycflights13 package contains airline on-time data for all flights departing NYC in 2013, as well as useful 'metadata' on airlines, airports, weather, and -planes in various other data frames. - - - `airlines` contains data about each flight - - - `airlines` contains data about each airline +Every join involves a **pair of keys**: one key in each data frame. - - `airports` records data about each airport +They typically play a different role depending on the data frame they belong to: - - `weather` records data about the weather at the origin airports. +- one is a data frame's **primary key**: the _variable_ or the _set of variables_ that **_uniquely identifies_** each observation +- the other is called a **foreign key**: + - it _corresponds_ to the primary key (same meaning, same number of variables) + - its values can be repeated - - `planes` records data about each plane. +![](images/19_equality_match.png) -- The relationship between these data frames can be seen below. +## Keys {-} -```{r 21-03, echo=FALSE, fig.align='center', fig.cap='nycfilghts13 package', out.width='100%'} -knitr::include_graphics('images/nycfilghts13.png') -``` - -## Keys +Primary & foreign key relationships in the nycflights13 package: -- A **primary key** uniquely identifies an observation in its own table. +![](images/19_relational.png) -- A **foreign key** uniquely identifies an observation in another table. +## Keys {-} -- When more than one variable is needed, the key is called a **compound key**. +Tips: -- `flights$tailnum` is a foreign key that corresponds to the primary key `planes$tailnum`. +- joining is easiest if primary and foreign key have the same name +- check the primary keys! + - each value must occur only once + - there must be no missing values -- `flights$carrier` is a foreign key that corresponds to the primary key `airlines$carrier`. +## Keys {-} -- `flights$origin` is a foreign key that corresponds to the primary key `airports$faa`. +**Surrogate key**: a single variable added to reflect a compound primary key; makes life easier -- `flights$dest` is a foreign key that corresponds to the primary key `airports$faa`. +```{r} +flights |> + count(time_hour, carrier, flight) |> + filter(n > 1) +flights2 <- flights |> + mutate(id = row_number(), .before = 1) +flights2 +``` -- `flights$origin-flights$time_hour` is a compound foreign key that corresponds to the compound primary key `weather$origin-weather$time_hour`. -- The primary and foreign keys in the nycflights13 package almost always have the same names and almost every variable name used in multiple tables has the same meaning in each place, which makes joining them much easier, with the exception of `year`, year means year of departure in `flights` and year of manufacturer in `planes`. +## Mutating joins {-} -## Validating keys +**Mutating joins** add columns from data frame `y`. -- It’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to `count()` the primary keys and look for entries where `n` is greater than one. +- Inner join: only keep matching rows. +- Outer join: also keep non-matching rows from `x` (left join), `y` (right join) or both (full join). -```{r} -planes |> - count(tailnum) |> - filter(n > 1) +## Mutating joins {-} +![](images/19_inner.png) -weather |> - count(time_hour, origin) |> - filter(n > 1) +## Mutating joins {-} +![](images/19_left.png) + +## Mutating joins {-} + +![](images/19_right.png) + +## Mutating joins {-} + +![](images/19_full.png) + +## Mutating joins: examples {-} + +```{r} +flights2 <- flights |> + select(year, time_hour, origin, dest, tailnum, carrier) +flights2 ``` -- You should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation! +## Mutating joins: examples {-} ```{r} -planes |> - filter(is.na(tailnum)) +flights2 |> + left_join(airlines) +``` -weather |> - filter(is.na(time_hour) | is.na(origin)) +## Mutating joins: examples {-} + +```{r} +flights2 |> + inner_join(airlines) ``` +## Mutating joins: extras {-} + +`left_join()`, `right_join()` and `inner_join()` have an argument `unmatched =` -## Surrogate keys +- defaults to `"drop"`: drop non-matching records from (respectively) `y`, `x` or both `x` and `y` +- can be set to `"error"` to verify that all records from `y`, `x` or both `x` and `y` are kept, if that is what you expect -- A **surrogate key** is a custom made key where it is possible to identify unique information, such as the number of rows in a table, and it is made if a table lacks a primary key. +## Mutating joins: extras {-} -- For flights, the combination of time_hour, carrier, and flight seems reasonable simple numeric surrogate key using the row number. +All mutating join functions have an argument `keep =` -- For example, `flights2$flights_id` is a surrogate key because it is custom made and uniquely identifies each observation in the flights table. +- defaults to `NULL`: equi joins retain only the key from `x`, while non-equi joins retain both keys +- can be set to `TRUE` to force both keys to be retained in the output ```{r} -flights2 <- flights |> - mutate(id = row_number(), .before = 1) -flights2 +flights2 |> + inner_join(airlines, keep = TRUE) ``` +## Relationships in mutating joins {-} -## Basic joins +The _relationship_ describes how many rows in `x` each value of `y$key` is _expected_ to match (**one** or **many**), _and vice-versa_, giving rise to 4 possible combinations: -- Once you understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. +- one-to-many +- many-to-one +- one-to-one +- many-to-many -- dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `full_join()`, `semi_join()`, and `anti_join()`. +Which relationship applies, follows from the keys being primary or foreign in `x` and `y`. -- They all have the same interface: they take a pair of data frames (`x` and `y`) and return a data frame. The order of the rows and columns in the output is primarily determined by `x`. +## Relationships in mutating joins {-} +**One-to-many**: primary key is in `x`. -```{r 21-twotables, echo=FALSE, fig.align='center', fig.cap='Two tables', out.width='30%'} -knitr::include_graphics("images/twotables.png") -``` +Example with left join: +![](images/19_one-to-many.png) -## Mutating joins +## Relationships in mutating joins {-} -- A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. +**Many-to-one**: primary key is in `y`. -- Like `mutate()`, the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones. +Example with left join or inner join: -- There are four types of mutating joins, but there’s one that you’ll use almost all of the time: `left_join()`. +![](images/19_many-to-one.png) -```{r} -flights2 <- flights |> - select(year, time_hour, origin, dest, tailnum, carrier) -flights2 +## Relationships in mutating joins {-} + +**One-to-one**: each row in `x` matches at most one row in `y`, and vice-versa. + +![](images/19_right.png) + +## Relationships in mutating joins {-} + +**Many-to-many**: no primary keys involved! + +![](images/19_many-to-many.png) + +This gives a warning by default, since it can be unintended. + +To avoid the warning (an intentional many-to-many), explicitly set `relationship = "many-to-many"`. + +## Relationships in mutating joins {-} + +The `relationship` argument in mutating joins allows you to take control over the expected relationship. + +From `help("mutate-joins")`: + +> In production code, it is best to preemptively set `relationship` to whatever relationship you expect to exist between the keys of `x` and `y`, as this forces an error to occur immediately if the data doesn't align with your expectations. + +## Filtering joins {-} + +**Filtering joins** filter `x` based on (non-)matching rows in `y`. + +They never duplicate rows! +## Filtering joins {-} + +![](images/19_semi.png) + +## Filtering joins {-} + +![](images/19_anti.png) + +## Filtering joins: examples {-} + +```{r} +flights2 |> + semi_join(weather) ``` +## Filtering joins: examples {-} + ```{r} flights2 |> - left_join(airlines) + anti_join(weather) ``` -## Left Joins +## Specifying join keys and their matching conditions {-} + +Without specifying a join key: the 'natural join' is applied (messages in previous slides). + +BUT: -- The left join is special because the output will always have the same rows as `x`, the data frame you’re joining to. +- this will miss key pairs with different names +- this will assume any pair of identical variable names to be part of the key pair -- The primary use of `left_join()` is to add in additional metadata. For example, we can use `left_join()` to add the full airline name to the `flights2` data: +## Specifying join keys and their matching conditions {-} + +E.g. this is NOT wat we intend (`year` has a different meaning in `flights` vs `planes`): ```{r} flights2 |> - left_join(airlines) + left_join(planes) ``` -## Specifying join keys +## Specifying join keys and their matching conditions {-} --By default, `left_join()` will use all variables that appear in both data frames as the join key, the so called natural join. +So it's much recommended to explicitly define the join key using `join_by()`: -- This is a useful heuristic, but it doesn’t always work, like when joining `flights` and `planes` which each have a year column but they mean different things. +- `inner_join(x, y, by = join_by(...))` +- `left_join(x, y, by = join_by(...))` +- `right_join(x, y, by = join_by(...))` +- `full_join(x, y, by = join_by(...))` +- `semi_join(x, y, by = join_by(...))` +- `anti_join(x, y, by = join_by(...))` + +## Specifying join keys and their matching conditions {-} + +When both keys have the same name(s): just provide the key name(s). ```{r} -flights2 |> - left_join(planes) +flights2 |> + left_join(planes, join_by(tailnum)) ``` -- We get a lot of missing matches because our join is trying to use `tailnum` and `year` as a compound key. In this case, we only want to join on tailnum so we need to provide an explicit specification with `join_by()`, where `join_by(tailnum)` is short for `join_by(tailnum == tailnum)`. +## Specifying join keys and their matching conditions {-} +When both keys have the same name(s): just provide the key name(s). ```{r} -flights2 |> - left_join(planes, join_by(tailnum)) +flights2 |> + semi_join(weather, join_by(origin, time_hour)) ``` +## Specifying join keys and their matching conditions {-} -- You can also specify different join keys in each table. For example, there are two ways to join the `flight2` and `airports` table: either by `dest` or `origin`: +`join_by(var1)` is shorthand for `join_by(var1 == var1)`. +`join_by(var1, var2)` is shorthand for `join_by(var1 == var1, var2 == var2)`. -```{r} -flights2 |> - left_join(airports, join_by(dest == faa)) +The expression form in `join_by()` defines: +- _each_ key's name -- these can differ between the `x` and `y` data frame +- the matching condition that defines the match between `x$key` and `y$key`: equality (`==`) or inequality (`<`, `<=` etc, and helpers) -flights2 |> - left_join(airports, join_by(origin == faa)) +## Some `join_by()` examples {-} +Join airport attributes without losing flights: + +```{r} +flights2 |> + left_join(airports, join_by(origin == faa)) ``` -## Equi Joins +## Some `join_by()` examples {-} -- A left join is often called an equi join because it describes the relationship between the two tables where the keys are equal. +Join airport attributes without losing flights: -- `inner_join()`, `right_join()`, `full_join()` are similar to `left_join()` in that respect, but the difference is which rows they keep: +```{r} +flights2 |> + left_join(airports, join_by(dest == faa)) +``` - - left join keeps all the rows in `x`, - - - the right join keeps all rows in `y`, - - - the full join keeps all rows in either `x` or `y`, and - - - the inner join only keeps rows that occur in both `x` and `y` - -- Equi joins are the most common type of join, so we’ll typically omit the equi prefix, and just say “inner join” rather than “equi inner join”. - +## Some `join_by()` examples {-} -```{r 21-outerjoins, echo=FALSE, fig.align='center', fig.cap='Outer joins', out.width='60%'} -knitr::include_graphics("images/outerjoins.png") +```{r} +df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max")) +df ``` +Doing a self-join with an inequality matching condition to get all name combinations: -```{r 21-innerjoin, echo=FALSE, fig.align='center', fig.cap='inner join', out.width='100%'} -knitr::include_graphics("images/innerjoin.png") +```{r} +df |> inner_join(df, join_by(id < id)) |> print(n = Inf) ``` -## Filtering joins -- The primary action of a filtering join is to filter the rows, and unlike mutating joins, never duplicate rows. +## Key matching conditions {-} -- There are two types: semi-joins and anti-joins. +- Equality condition: so called '**equi joins**'. + - You will use this most of the time. + - All others are sometimes called 'non-equi joins' +- Inequality conditions: + - **inequality joins**: `join_by(id < id)` + - **rolling joins** (only closest): `join_by(closest(id < id))` + - **overlap joins**: to set interval conditions + - `between(x, y_lower, y_upper)` is short for `x >= y_lower, x <= y_upper` + - `within(x_lower, x_upper, y_lower, y_upper)` is short for `x_lower >= y_lower, x_upper <= y_upper` + - `overlaps(x_lower, x_upper, y_lower, y_upper)` is short for `x_lower <= y_upper, x_upper >= y_lower` +- No conditions: **cross joins**. +Use separate function `cross_join()`. - - Semi-joins keep all rows in `x` that have a match in `y` - - anti-joins, it's inverse, return all rows in `x` that don’t have a match in `y`. +## Key matching conditions {-} -```{r 21-semijoin, echo=FALSE, fig.align='center', fig.cap='Semi-join', out.width='60%'} -knitr::include_graphics("images/semijoin.png") -``` +![](images/19_lt.png) +## Key matching conditions {-} -```{r 21-antijoin, echo=FALSE, fig.align='center', fig.cap='Anti-join', out.width='60%'} -knitr::include_graphics("images/antijoin.png") +**`cross_join(x, y)`** + +![](images/19_cross.png) + +## Key matching conditions: examples {-} + +```{r include=FALSE} +set.seed(123) +employees <- tibble( + name = sample(babynames::babynames$name, 100), + birthday = ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1) +) +parties <- tibble( + q = 1:4, + party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")), + start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")), + end = ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31")) +) ``` -## Non-equi joins +```{r} +employees +parties +``` -- In equi joins the x keys and y are always equal, so we only need to show one in the output, but that isn't always the case. +## Key matching conditions: examples {-} -- dplyr helps by identifying four particularly useful types of non-equi join: +When is the birthday party for each employee? - - Cross joins match every pair of rows. - - Inequality joins use <, <=, >, and >= instead of ==. - - Rolling joins are similar to inequality joins but only find the closest match. - - Overlap joins are a special type of inequality join designed to work with ranges. +```{r} +employees |> + inner_join( + parties, + join_by(between(birthday, start, end)), + unmatched = "error" + ) +``` +## Key matching conditions: examples {-} -```{r 19-cross, echo=FALSE, fig.align='center', fig.cap='Anti-join', out.width='60%'} -knitr::include_graphics("images/19_cross_join.png") +```{r} +df <- tibble(name = c("John", "Simon", "Tracy", "Max")) +df ``` -```{r 19-inequality, echo=FALSE, fig.align='center', fig.cap='Anti-join', out.width='60%'} -knitr::include_graphics("images/19_inequality_join.png") +Generate permutations in a self-join with `cross_join()`: + +```{r} +df |> cross_join(df) ``` -```{r 19-rolling, echo=FALSE, fig.align='center', fig.cap='Anti-join', out.width='60%'} -knitr::include_graphics("images/19_rolling_join.png") + + +```{r include=FALSE} +options(oldopt) ``` + ## Meeting Videos ### Cohort 5 diff --git a/images/19_anti.png b/images/19_anti.png new file mode 100644 index 00000000..15011572 Binary files /dev/null and b/images/19_anti.png differ diff --git a/images/19_cross.png b/images/19_cross.png new file mode 100644 index 00000000..0a90a86d Binary files /dev/null and b/images/19_cross.png differ diff --git a/images/19_cross_join.png b/images/19_cross_join.png deleted file mode 100644 index f476196a..00000000 Binary files a/images/19_cross_join.png and /dev/null differ diff --git a/images/19_equality_match.png b/images/19_equality_match.png new file mode 100644 index 00000000..aa6da9da Binary files /dev/null and b/images/19_equality_match.png differ diff --git a/images/19_full.png b/images/19_full.png new file mode 100644 index 00000000..b0c63c1b Binary files /dev/null and b/images/19_full.png differ diff --git a/images/19_inequality_join.png b/images/19_inequality_join.png deleted file mode 100644 index d7a33f7b..00000000 Binary files a/images/19_inequality_join.png and /dev/null differ diff --git a/images/19_inner.png b/images/19_inner.png new file mode 100644 index 00000000..7c6f9a89 Binary files /dev/null and b/images/19_inner.png differ diff --git a/images/19_left.png b/images/19_left.png new file mode 100644 index 00000000..4efb093f Binary files /dev/null and b/images/19_left.png differ diff --git a/images/19_lt.png b/images/19_lt.png new file mode 100644 index 00000000..7c8b6a79 Binary files /dev/null and b/images/19_lt.png differ diff --git a/images/19_many-to-many.png b/images/19_many-to-many.png new file mode 100644 index 00000000..e2eba256 Binary files /dev/null and b/images/19_many-to-many.png differ diff --git a/images/19_many-to-one.png b/images/19_many-to-one.png new file mode 100644 index 00000000..f64dddc7 Binary files /dev/null and b/images/19_many-to-one.png differ diff --git a/images/19_one-to-many.png b/images/19_one-to-many.png new file mode 100644 index 00000000..0c25fbaf Binary files /dev/null and b/images/19_one-to-many.png differ diff --git a/images/19_relational.png b/images/19_relational.png new file mode 100644 index 00000000..40cc9b1c Binary files /dev/null and b/images/19_relational.png differ diff --git a/images/19_right.png b/images/19_right.png new file mode 100644 index 00000000..5d8c6cdf Binary files /dev/null and b/images/19_right.png differ diff --git a/images/19_rolling_join.png b/images/19_rolling_join.png deleted file mode 100644 index 21006299..00000000 Binary files a/images/19_rolling_join.png and /dev/null differ diff --git a/images/19_semi.png b/images/19_semi.png new file mode 100644 index 00000000..b76f2115 Binary files /dev/null and b/images/19_semi.png differ diff --git a/images/antijoin.png b/images/antijoin.png deleted file mode 100644 index 85b98641..00000000 Binary files a/images/antijoin.png and /dev/null differ diff --git a/images/innerjoin.png b/images/innerjoin.png deleted file mode 100644 index b2a51709..00000000 Binary files a/images/innerjoin.png and /dev/null differ diff --git a/images/nycfilghts13.png b/images/nycfilghts13.png deleted file mode 100644 index 0435cc9b..00000000 Binary files a/images/nycfilghts13.png and /dev/null differ diff --git a/images/outerjoins.png b/images/outerjoins.png deleted file mode 100644 index f19fd6ef..00000000 Binary files a/images/outerjoins.png and /dev/null differ diff --git a/images/semijoin.png b/images/semijoin.png deleted file mode 100644 index d30ffa5d..00000000 Binary files a/images/semijoin.png and /dev/null differ diff --git a/images/semijoin2.png b/images/semijoin2.png deleted file mode 100644 index d81b9b5f..00000000 Binary files a/images/semijoin2.png and /dev/null differ diff --git a/images/twotables.png b/images/twotables.png deleted file mode 100644 index f6879e0a..00000000 Binary files a/images/twotables.png and /dev/null differ diff --git a/images/venn_diagram.png b/images/venn_diagram.png deleted file mode 100644 index e759a85e..00000000 Binary files a/images/venn_diagram.png and /dev/null differ