-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
121 lines (94 loc) · 2.06 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
title: "Web Scraping"
author: "Daniel Chen"
date: ""
output:
md_document:
variant: markdown_github
toc: true
---
# Ways of getting information from the web
- Download data manually
- Use an API (Application Programming Interface)
- Scrape it
- Check the Terms and Conditoins (TOC)
# Getting Tables from websites
Useful for getting tables from wikipedia
Getting US States Abbreviations
https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations
```{r}
library(RCurl)
library(XML)
library(testthat)
library(stringr)
# url is a name of a function
wiki_url <- RCurl::getURL("https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations")
```
```{r}
tables <- XML::readHTMLTable(wiki_url)
class(tables)
length(tables)
```
```{r}
abbrevs <- tables[[1]]
head(abbrevs)
```
```{r}
us <- abbrevs[11:nrow(abbrevs), ]
head(us)
first_value <- stringr::str_trim((as.character(us[1, 1])))
testthat::expect_equal(object = first_value, expected = 'United States of America')
```
```{r, error=TRUE}
# testthat::expect_equal(object = stringr::str_trim((as.character(us[1, 1]))), expected = 'does not match')
```
# Scraping Websites
https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
```{r}
library(rvest)
if (interactive()) {
data_location <- 'data/working'
} else {
data_location <- '../../data/working'
}
```
# Scraping
IMDB Top Rated Movies:
http://www.imdb.com/chart/top?ref_=nv_mv_250_6
http://selectorgadget.com/
CSS class and id
```{r}
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
```
```{r}
# Rating
lego_movie %>%
html_node("strong span") %>%
html_text() %>%
as.numeric()
```
```{r}
# First page of actors
lego_movie %>%
html_nodes(".itemprop .itemprop") %>%
html_text()
```
```{r}
lego_movie %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
```
```{r}
lego_movie %>%
html_nodes(".primary_photo , .ellipsis, .character, #titleCast .itemprop, #titleCast .loadlate")
```
```{r}
# more manual way
lego_movie %>%
html_nodes("table") %>%
.[[1]] %>%
html_nodes("tr") %>%
html_nodes("span") %>%
html_text()
```