Gaston Sanchez
- Work with the package
"stringr"
- String manipulation
- More regular expressions
- A bit of data cleaning
- Second contact with
"plotly"
- Making some maps
- Write your descriptions, explanations, and code in an
Rmd
(R markdown) file. - Name this file as
lab10-first-last.Rmd
, wherefirst
andlast
are your first and last names (e.g.lab10-gaston-sanchez.Rmd
). - Knit your
Rmd
file as an html document (default option). - Submit your
Rmd
andhtml
files to bCourses, in the corresponding lab assignment. - Due date displayed in the syllabus (see github repo).
So far we’ve been working with data sets that have been already cleaned, and can be imported in R ready to be analyzed.
Today we are going to deal with a “messy” dataset. Most real life data sets require a pre-processing phase, and most of the time spent in any data analysis project will involve getting the data in the right shape. So it is extremely important that you gain skills for cleaning raw data.
For this lab, you will be using the R packages "stringr"
and
"plotly"
# you may need to install the packages
# install.packages("stringr")
# install.packages("plotly")
library(dplyr)
library(stringr)
library(plotly)
Also, while you work on this lab you may want to look at the cheat sheets for:
Before downloading the data, create a folder for this lab:
- Create a new directory, e.g.
lab10
cd
tolab10
The data file for this lab is on the github repo:
https://raw.githubusercontent.com/ucb-stat133/stat133-spring-2018/master/data/mobile-food-sf.csv
The original source is the Mobile Food Facility Permit data, which comes from DataSF (SF Open Data):
https://data.sfgov.org/Economy-and-Community/Mobile-Food-Schedule/jjew-r69b
Download a copy of the file to your working directory. Here’s the code
wiht R’s download.file()
function (or you could also use curl
from
the command
line):
github <- "https://raw.githubusercontent.com/ucb-stat133/stat133-fall-2018/master/"
datafile <- "data/mobile-food-sf.csv"
download.file(paste0(github, datafile), destfile = "mobile-food-sf.csv")
Once you’ve downloaded the data file, you can read it in R:
dat <- read.csv('mobile-food-sf.csv', stringsAsFactors = FALSE)
The variables are:
DayOfWeekStr
starttime
endtime
PermitLocation
optionaltext
ColdTruck
Applicant
Location
Let’s begin using functions from the package "plotly"
which allows you
to produce nice interactive graphics rendered in html form. Keep in mind
the plotly graphs will work as long as your output document is in html
format (i.e. knitting html_document
is okay). However plotly graphs
WON’T work when knitting github_output
.
Consider the variable DayOfWeekStr
which contains the day of the week
in string format. Let’s calculate the frequencies (i.e. counts) of the
categories in this column and visualize them with a bar-chart:
day_freqs <- table(dat$DayOfWeekStr)
barplot(day_freqs, border = NA, las = 3)
An alternative bar-chart can be obtained with "plotly"
. You can use
the function plot_ly()
in a similar way to base-R plot()
:
plot_ly(x = names(day_freqs),
y = day_freqs,
type = 'bar')
Interestingly, you can also use plot_ly()
in a similar way to
ggplot()
. To use plot_ly()
in this way, the data to be graphed must
have the in data.frame (or tibble):
# day frequencies table
day_counts <- dat %>%
select(DayOfWeekStr) %>%
group_by(DayOfWeekStr) %>%
summarise(count = n()) %>%
arrange(desc(count))
day_counts
## # A tibble: 7 x 2
## DayOfWeekStr count
## <chr> <int>
## 1 Friday 1105
## 2 Wednesday 1095
## 3 Thursday 1090
## 4 Tuesday 1081
## 5 Monday 1080
## 6 Saturday 533
## 7 Sunday 263
Having obtained day_counts
, you can pass it to plot_ly()
and then
map the columns DayOfWeekStr
and Count
to the x
and y
attributes, and the type = 'bar'
argument:
plot_ly(day_counts,
x = ~DayOfWeekStr,
y = ~count,
type = 'bar')
Notice the use of the tilder "~"
to specify the mapping, that is:
linking a visual attribute with the column name from a data frame.
To order the bars in increasing order, you need to reorder()
the
values on the x-axis:
plot_ly(day_counts,
x = ~reorder(DayOfWeekStr, count),
y = ~count,
type = 'bar')
Let’s begin processing the values in column starttime
. The goal is to
obtain new times in 24 hr format. For example, a starting time of 10AM
will be transformed to 10:00
. Likewise, a starting time of 1PM
will
be transformed to 13:00
.
We are going to be manipulating character strings. Hence, I recommend that you start working on a small subset of values. Figure out how to get the answers working on this subset, and then generalize to the entire data set.
Consider the first starting time that has a value of 10AM
. To get a
better feeling of string manipulation, let’s create a toy string with
this value:
# toy string
time1 <- '10AM'
To get the time and period values, you can use str_sub()
:
# hour
str_sub(time1, start = 1, end = 2)
## [1] "10"
# period
str_sub(time1, start = 3, end = 4)
## [1] "AM"
Your turn: What about times where the hour has just one digit? For
example: 9AM
, or 8AM
? Create the following vector times
and try to
subset the hour and the periods with str_sub()
times <- c('12PM', '10AM', '9AM', '8AM', '2PM')
# subset time
# subset period
#
One nice thing about str_sub()
is that it allows you to specify
negative values for the start
and end
positions. Run the command
below and see what happens:
# period
str_sub(times, start = -2)
The tricky part with the vector times
is the extraction of the hour.
One solution is to “remove” the characters AM
or PM
from each time.
You can do this with the substitution function str_replace()
:
str_replace(times, pattern = 'AM|PM', replacement = '')
## [1] "12" "10" "9" "8" "2"
So far you’ve managed to get the hour value and the period (AM or PM). Now:
-
Using
times
, create a numeric vectorhours
containing just the number time (i.e. hour) -
Using
times
, create a character vectorperiods
containing the period, e.g.AM
orPM
-
Use
plot_ly()
to make a barchart of the counts forAM
andPM
values. -
Write R code to create a vector
start24
that contains the hour in 24hr scale. -
Add two columns
start
andend
to the data framedat
, containing the starting and ending hour respectively (columns must be"numeric"
). -
With the starting and ending hours, calculate the duration, and add one more column
duration
to the data framedat
:
Another interesting column in the data is Location
. If you look at
this column, you will see values like the following string loc1
loc1 <- "(37.7651967350509,-122.416451692902)"
The goal is to split Location
into latitude and longitude. The first
value corresponds to latitude, while the second value corresponds to
longitude.
First we need to remove the parenthesis. The issue here is that the
characters (
and )
have special meanings; recall they are
metacharacters. So you need to escape in R them by pre-appending two
backslashes: \\(
and \\)
# "remove" opening parenthesis
str_replace(loc1, pattern = '\\(', replacement = '')
## [1] "37.7651967350509,-122.416451692902)"
# "remove" closing parenthesis
str_replace(loc1, pattern = '\\)', replacement = '')
## [1] "(37.7651967350509,-122.416451692902"
You can also combine both patterns in a single call. But be careful:
str_replace(loc1, pattern = '\\(|\\)', replacement = '')
## [1] "37.7651967350509,-122.416451692902)"
str_replace()
replaces only the first occurrence of (
or )
.
However, the location values contain both opening and closing
parentheses. To replace them all, you have to use str_replace_all()
str_replace_all(loc1, pattern = '\\(|\\)', replacement = '')
## [1] "37.7651967350509,-122.416451692902"
Now we need to get rid of the comma ,
. You could replace it with an
empty string, but then you will end up with one long string like this:
lat_lon <- str_replace_all(loc1, pattern = '\\(|\\)', replacement = '')
str_replace(lat_lon, pattern = ',', replacement = '')
## [1] "37.7651967350509-122.416451692902"
Instead of replacing the comma, what we need to use is str_split()
# string split in stringr
str_split(lat_lon, pattern = ',')
## [[1]]
## [1] "37.7651967350509" "-122.416451692902"
Notice that str_split()
returns a list.
Let’s define a vector with more location values, so we can start generalizing our code:
locs <- c(
"(37.7651967350509,-122.416451692902)",
"(37.7907890558203,-122.402273431333)",
"(37.7111991003088,-122.394693339395)",
"(37.7773000262759,-122.394812784799)",
NA
)
- use
str_split()
to create a listlat_lon
containing the latitude and the longitude values oflocs
Assuming that you have lat_lon
, to retrieve the latitude and longitude
values, you can use the lapply()
function, and then specify an
anonymous function to get the first element (for the latitude):
lat <- lapply(lat_lon, function(x) x[1])
Create a list lon
by using lapply()
with an anonymous function to
extract longitude value (i.e. the second element):
To convert from list to a vector, use unlist()
lat <- as.numeric(unlist(lat))
lon <- as.numeric(unlist(lon))
Add two more columns: lat
and lon
to the data frame dat
Now that you have two vectors latitude
and longitude
, and the
corresponding columns lat
and lon
in the data frame dat
, let’s try
to plot those coordinates on a map.
A naive option would be to graph the locations with plot()
:
plot(dat$lon, dat$lat, pch = 19, col = "#77777744")
A similar scatterplot can be obtained with plot_ly()
which, as we’ve
seen, can be used like base-R plot()
:
# default scatterplot
plot_ly(x = lon, y = lay)
but it’s recommended to specify arguments type = 'scatter'
and mode = 'markers'
:
# default scatterplot
plot_ly(x = lon, y = lat, type = 'scatter', mode = 'markers')
Because lon
and lat
are also in the data frame dat
, you could use
plot_ly()
in a similar (although not identical) way to
ggplot()
:
plot_ly(data = dat, x = ~lon, y = ~lat, type = 'scatter', mode = 'markers')
Notice the use of the tildes next to x
and y
arguments. This is what
plot_ly()
uses to map the values of a column (in a data.frame) as
the attributes of a graphical element.
Although the previous calls show the dots with the right latitude and longitude coordinates, there’s no visual cues that let us perceive the information in a geographical way.
Instead of displaying a naked plot()
, we can use the package
"RgoogleMaps"
which is one the several packages available in R to plot
maps.
# install.packages("RgoogleMaps")
library(RgoogleMaps)
To get a map you use the function GetMap()
which requires a center
and a zoom
specifications. The center
is a vector with the latitude
and longitude coordinates. The argument zoom
refers to the zoom level.
# coordinates for center of the map
center <- c(mean(dat$lat, na.rm = TRUE), mean(dat$lon, na.rm = TRUE))
# zoom value
zoom <- min(MaxZoom(range(dat$lat, na.rm = TRUE),
range(dat$lon, na.rm = TRUE)))
# san francisco map
map1 <- GetMap(center=center, zoom=zoom, destfile = "san-francisco.png")
The code above downloads a static map from the Google server and saves
it in the specified destination file. To make a plot you have to use
PlotOnStaticMap()
PlotOnStaticMap(map1, dat$lat, dat$lon, col = "#ed4964", pch=20)
Another useful package for plotting maps is "ggmap"
. As you may guess,
"ggmap"
follows the graphing approach of "ggplot2"
.
As usual, you need to install the package:
# remember to install ggmap
install.packages("ggmap")
library(ggmap)
It is possible that you run into some issues with "ggmap"
(and
"ggplot2"
). Apparently, there are a couple of conflicting bugs in some
versions of these packages. If you encounter some cryptic errors, you
may switch to an older version of "ggplot2"
# skip this part (come back if you run into some error messages)
# (go back to a previous version of ggplot)
devtools::install_github("hadley/[email protected]")
devtools::install_github("dkahle/ggmap")
Here I’m assuming that the data frame dat
already includes columns
lat
and lon
:
# add variables 'lat' and 'lon' to the data frame
dat$lat <- latitude
dat$lon <- longitude
Because some rows have missing values in the geographical coordinates,
we can get rid of them with 'na.omit()
:
# let's get rid of rows with missing values
dat <- na.omit(dat)
In order to plot a map with ggmap()
, we need to define the region of
the map via the function make_bbox()
:
# ggmap typically asks you for a zoom level,
# but we can try using ggmap's make_bbox function:
sbbox <- make_bbox(lon = dat$lon, lat = dat$lat, f = .1)
sbbox
Now that you have the object sbbox
, the next step is to get a map with
get_map()
. This function gets a map from Google by default.
# get a 'terrain' map
sf_map <- get_map(location = sbbox, maptype = "terrain", source = "google")
Having obtained the sf_map
object, we can finally use ggmap()
to
plot some dots with our lat
and lon
coordinates:
ggmap(sf_map) +
geom_point(data = dat,
mapping = aes(x = lon, y = lat),
color = "red", alpha = 0.2, size = 1)
The data table dat
contains a column optionaltext
describing the
types of food and meals served by the food trucks. Let’s take a look at
the first 3
elements:
dat$optionaltext[1:3]
## [1] "Tacos, Burritos, Tortas, Quesadillas, Mexican Drinks, Aguas Frescas"
## [2] "Cold Truck: sandwiches, drinks, snacks, candy, hot coffee"
## [3] "Cold Truck: Pre-packaged Sandwiches, Various Beverages, Salads, Snacks"
Notice that the first element (i.e. the first food truck) prepares Tacos, Burritos, Tortas, Quesadillas, etc.
What if you want to identify all locations that have burritos? This is
where regular expressions comes very handy. Again, always start small:
select the first 10 elements of optionaltext
foods <- dat$optionaltext[1:10]
- Use
str_detect()
(or equivalentlygrep()
) to match"Burritos"
and"burritos"
. - If you use
grepl()
, you can useignore.case = TRUE
to match for both. - Try another pattern: e.g.
"tacos"
, or"quesadillas"
- Now create a data frame
burritos
by subsetting (i.e. filtering) the data frame to get only those rows that match"burritos"
- Use the
lat
andlon
corrdinates inburritos
to display a map of locations with burritos (see map below). - Experiment with other types of foods
- Challenge: try use facetting to show a type of food per facet (e.g. one facet for burritos, another for quesadillas, another one for tacos, etc)