-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path15.apis.qmd
174 lines (102 loc) · 10.8 KB
/
15.apis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "15. APIs"
author: "Patrick Spauster"
format:
html:
code-copy: true
toc: true
toc-location: right
editor: visual
---
## Video Tutorial
```{=html}
<div style="max-width:608px"><div style="position:relative;padding-bottom:66.118421052632%"><iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/1674401/sp/167440100/embedIframeJs/uiconf_id/23435151/partner_id/1674401?iframeembed=true&playerId=kaltura_player&entry_id=1_vpkmj6pj&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[LeadWithHLSOnFlash]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_wkx3phqq" width="608" height="402" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="15. APIs" style="position:absolute;top:0;left:0;width:100%;height:100%"></iframe></div></div>
```
```{r include=F}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
rows.print=5
)
options(tibble.max_extra_cols = 5, tibble.print_max = 5)
library(tidyverse)
```
## Downloading Data from the web
You can download datasets directly from the web with read_csv().
```{r warning=F, message=F}
read_csv('https://raw.githubusercontent.com/Statology/Miscellaneous/main/basketball_data.csv')
```
But sometimes getting data stored on the web is more complicated, or you are accessing massive databases that would take forever to read the whole thing into R.
When downloading data from [NYC Open Data](https://opendata.cityofnewyork.us/) and other sources, you can use a special set of commands to filter and manipulate the data before downloading it to your local computer. These commands, unlike ones you'd write in an R script, are actually written **in the URL itself**. This kind of command is called an "API Call"--- API stands for "Application programming interface" and is just a special system and language for users to communicate with the back end data servers. API Calls to APIs use a special syntax, which will be included in an API's documentation. For NYC's Open data, that syntax is very similar to a popular language called SQL or Structured Query Language.
Some APIs will require registration of a "token" which will be an input in your URL. We'll go over those with an example using the census API.
## Writing API Calls to NYC Open Data
Using API Calls is particularly useful when you're dealing with a \*\*huge dataset\*\* that would otherwise be a hassle to download in full. For this example, we're going to use the [HPD's Housing Maintinance Code Violations](https://data.cityofnewyork.us/Housing-Development/Housing-Maintenance-Code-Violations/wvxf-dwi5) dataset on NYC Open Data, which has \~7 million rows.
### Grabbing the "API Endpoint"
Let's first grab the beginning part of the URL that we're going to use to write our API Call. This is called the "API Endpoint" and can be found by clicking the "API" tab on the NYC Open Data page for the dataset you're working with.
Before copying the URL, make sure you set the data format toggle from "JSON" to "CSV", as that is the format we're going to want our data in.
For our example, our endpoint looks like this:
[`https://data.cityofnewyork.us/resource/wvxf-dwi5.csv`](https://data.cityofnewyork.us/resource/wvxf-dwi5.csv)
### Writing up your API Call
Copy the API Endpoint into a text editor - or an R script! - (I prefer [Sublime Text](https://www.sublimetext.com/) or [Visual Studio Code](https://code.visualstudio.com/)--- others like Word or Pages have a tendency to "auto-correct" certain letters and syntax which may mess you up).
Now, to initiate our query, we are going to add `?$query=` to the end of our URL:
```
https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=
```
### Adding your Query
At the end our URL, we can now add some special code to filter the violations data for our download.
To do this, we're going to want to first take a look at the [API Documentation](https://dev.socrata.com/foundry/data.cityofnewyork.us/wvxf-dwi5) for our dataset of choice, which can be found by clicking on the "API" tab again on the dataset's Open Data page and clicking the "API Docs" button. Specifically, this documentation gives us a run down of all of the columns in the data and how we can reference them by name in our API call.
For this example, we want to look at the most serious (class C) HPD Violations within the past month. So, we're going to write out our query as such:
**`SELECT *`** -- this selects all columns of the data
**`WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000' AND class='C'`** -- this filters only rows where the `inspectiondate` value is between June 1st and Aug 31st, and the class of the violation is `'C'`. The `AND` operator here allows us to include multiple filtering conditions at once, and could even include conditions on other columns. Note the special format that the dates come in... we were able to spot this by looking at the [Documentation](https://dev.socrata.com/foundry/data.cityofnewyork.us/wvxf-dwi5).
**`LIMIT 100000`** --- this sets the maximum number of downloadable rows to 100,000. It's good practice to set a limit here so we don't accidentially try downloading millions of rows at once. **Note: if you don't specify, the default limit is just 1,000 rows!**
You can find more information on the types of queries you can write on the [Socrata Developers Portal](https://dev.socrata.com/docs/queries/query.html) (Socrata is the special "flavor" of API that NYC Open Data uses).
### Running our API Call
We add the above pieces in that order to our URL:
`https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000' AND class='C' LIMIT 100000`
Now, you can copy this full url into your browser and press ENTER--- your special download should begin!
We can also use string operators to create and quickly modify the different components of our API call.
```{r}
base_url <- "https://data.cityofnewyork.us/resource/wvxf-dwi5.csv"
inspectiondate_range <- c("2021-06-01T00:00:01.000","2021-09-01T00:00:01.000")
class <- "C"
limit <- c("100000")
full_api_call <- paste0(base_url, "?$query=SELECT * WHERE inspectiondate>='",inspectiondate_range[1],"' AND inspectiondate<'",inspectiondate_range[2],"' AND class='",class,"' LIMIT ",limit)
```
### Importing your data directly into R Studio
Once you have your data downloaded via API Call, you can feel free to import it into your R project like any other CSV. If you want to use the URL you created to import it directly, you can do that as well:
```{r message=FALSE, warning=FALSE, paged.print=FALSE}
library(tidyverse)
library(fs)
# R doesn't like weird characters like spaces and carats, so we need the `URLencode` function here to encode those symbols properly
url_hpd_viol <- URLencode("https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000' AND class='C' LIMIT 100000")
# Now, we can use our formatted url inside our `read_csv` function
summer_hpd_viols <- read_csv(url_hpd_viol)
```
### Note: Always check the size of your output
Sometimes, the limit on your API Call may make your data export smaller than your desired outcome, and you won't necessarily be notified. Therefore, it is always very important to **check the number of rows** of your data from your API Call before proceeding with analysis--- if the number of rows matches the exact number of your limit (or is 1000, the default limit), it's very likely that your data got cut off and you don't have the complete set of data that you wanted.
The below example illustrates this problem and shows how to diagnose. For the example, imagine that we didn't include a `LIMIT` clause in our API Call query:
`https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000'`
```{r}
url_viol_no_limit = URLencode("https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000'")
summer_violations_cut_off <- read_csv(url_viol_no_limit)
# Using the head() function won't actually reveal the cut-off problem:
head(summer_violations_cut_off)
```
By looking at the head of your dataset, things appear to be fine. However, let's use the `nrow()` function to get a sense of how many rows we have:
```{r}
nrow(summer_violations_cut_off)
```
Given that our data output is 1,000 rows, which is exactly the default limit for API Calls to NYC Open Data, it's very likely that our data got cut off and there are more rows within our filtering conditions that we want.
Our next step would be to increase our LIMIT in our API CAll until get a number of outputs rows below the limit value. In our first example, you can see we've done just that--- our LIMIT was set to 100,000 rows and we only received around 20K or so rows. Safe to say we got all of the rows that fit our filtering criteria...
## Other APIs
Thousands of sites have APIs that work similarly. Each API will have specific documentation and formatting. Once you read up on the documentation for one, the others will be easier to decipher. Here are some examples
- [Spotify](https://developer.spotify.com/documentation/web-api)
- [Pro Publica's Nonprofit Explorer](https://projects.propublica.org/nonprofits/api)
- [Bureau of Labor Statistics (BLS)](https://www.bls.gov/bls/api_features.htm)
## API Wrappers
A number of developers have also created R packages that make certain APIs easier to use. The example we will look at next, tidycensus, is an example of a wrapper for the US Census API. It will construct API calls based on parameters we set in a function. Here are some other API wrappers.
- [RSocrata](https://dev.socrata.com/connectors/rsocrata.html) (Many Open Data Portals, including NYC's use Socrata's data system)
- [BLS](https://github.com/mikeasilva/blsAPI)
- [Google Sheets](https://googlesheets4.tidyverse.org/)
- [Datawrapper](https://blog.datawrapper.de/why-i-created-an-r-library-to-use-datawrappers-api/)