Portfolio for Clarkson Fall 2022 DS 241
The following is a general description of each of the files in this portfolio along with what data they rely on and what skills they address. All data files relied on are found in the data_raw directory.
Each of the files in this section only include exploratory data analysis and data cleaning/preparation; no modeling is done and no evidence is generated beyond simple visualizations to support any conclusions. However, at several moments general observations are made from the resulting visualizations.
- NYC Flights:
This was the first analysis we did in this class. It shows basics such as:
- loading R packages
- manipulating data with
dplyr::filter
- making bar plots with
ggplot
, includingfacet_wrap
- constructing lists and using the
%in%
operator
- BOT: (Relies on BOT.zip)
This file builds on the NYC Flights analysis by taking data from the U.S. Bureau of Transportation. In addition to the skills from #1, it shows the following:
- reading data from a .csv file, with help from the here and janitor packages
- use of the
weight
field inggplot2::geom_bar
to count from a data column rather than number of rows - use of additional
dplyr
functions such asmutate
,select
,group_by
, andsummarise
- plotting
geom_histogram
andgeom_line
withggplot2
- Flights Over Year: (Relies on all files in airline_data)
This file builds on the Bureau of Transportation analysis, specifically examining the departures from La Guardia Airport. In addition to the skills from #2, it shows the following:
- making more visually appealing plots with
ggplot2
by adding titles and axis labels - use of the
rbind
function to combine rows from different dataframes (in this case, needed to combine data across years)
- making more visually appealing plots with
For the rest of the portfolio, the main five dplyr
functions - filter
, mutate
, select
, group_by
, and summarise
- are used fluently when needed.
- MA 132 Enrollment: (Relies on clarkson_math_enrollment.csv)
This file uses enrollment data from the Clarkson University mathematics department to build a model to predict enrollment in MA 132 (Calculus II) in the Spring 2022 semester. It demonstrates the following skills:
- Basic R Skills
- use of
dplyr::distinct
function - use of
startsWith
,substr
,strtoi
, andnchar
to work with strings - use of
cbind
to join dataframes - plotting
geom_point
withggplot2
and usinggeom_smooth(method = lm)
to draw a regression line over it - use of
lm
for linear and multiple regression
- use of
- Data Science Process
- Generating a model based on understanding of enrollment patterns (e.g., many Calculus I students in the fall will enroll in Calculus II in the spring).
- Revising this model to include additional real-world influences and determine a better fit for the data.
- Basic R Skills
The following files demonstrate some spatial analysis - I say "beginnings" because we didn't necessarily work with spatial data in an authentic way (see the next section), but we still made spatial considerations.
- Denny's / La Quinta Lab 4: (Relies on states.csv)
This file demonstrates many of the same skills as in the first section (data cleaning, exploratory analysis).
- One important skill to note is the use of the
case_when
function combined withdplyr::mutate
.
- One important skill to note is the use of the
- Denny's / La Quinta Lab 5: (Relies on FastFoodRestaurants.csv)
This file pursues further analysis on the spatial questions regarding the proximity of Denny's and La Quinta, and demonstrates the following skills:
- use of
mean
andmedian
to compute basic summary statistics - use of
full_join
to join two dataframes - definition and use of the Haversine distance function to compute the distance between two points on the Earth given latitude and longitude coordinates
- use of
Analysis of the Washington D.C. Capital Bikeshare data provides a good summary of all of the skills described previously. The final part of this analysis is not included in the portfolio as it is the final project for the course...see the repository here. The data file used for these represents all rides from September 2022 Capital Bikeshare.
All files rely on 202209-capitalbikeshare-tripdata.zip.
- Bikeshare General:
In addition to using many of the skills described in the above section to explore the bikeshare dataset, this file also shows the following:
- Basic R Skills
- use of
ggplot2::geom_violin
/ggplot2::geom_density
to illustrate/compare two distributions - use of
ggplot2::geom_step
to create a stairstep plot, focusing exactly on when changes occurs - use of additional
dplyr
functions such aspivot_longer
,slice_sample
,arrange
, andrename
- significant work with time data, including use of functions such as:
as.POSIXct
%within%
interval
ggplot2::scale_x_datetime
- use of
- Data Science Process
- Developing an algorithm to count the number of bikes out over time, given that each row in the dataset represents one ride and has a start and end time.
- Starting to develop this algorithm on a small sample of the dataset for working efficiency, then once it's good, extending to the whole dataset.
- Visualizing the result of the algorithm...verifying that it makes logical sense and making some observations based on the result.
- Developing an algorithm to count the number of bikes out over time, given that each row in the dataset represents one ride and has a start and end time.
- Basic R Skills
- Bikeshare w/ Census Data:
This file builds on the general bikeshare analysis by loading geographic data on the U.S. census tracts in Washington D.C. and creating spatial visualizations based on those:
- Basic R Skills
- use of
get_acs
from thetidycensus
package to load census data - use of functions in
sf
package to manipulate spatial data such asst_as_sf
,st_intersects
,st_crs
,st_transform
- use of functions in
tmap
package to visualize spatial data such astmap_mode
,tm_shape
,tm_polygons
- use of
- Data Science Process
- Identifying outliers in a dataset based on a visualization, recognizing that they hurt the effectiveness of the visualization, and removing them accordingly. (In this case, the outlier was the "mall" region of D.C.)
- Basic R Skills
- Bikeshare w/ LODES Data:
This file builds on the census tract-level bikeshare spatial analysis by incorporating data from the Longitudinal Employer-Household Dynamics (LODES) segment of the U.S. census data.
- use of
lehdr::grab_lodes
function to get LODES data - use of
ggplot2::pivot_wider
andleft_join
to restructure data frames
- use of
- Bikeshare w/ Weather Data:
This file uses a slightly different bikeshare dataset - one from 2011/2012 that is baked into the
dsbox
(Data Science in a Box) package - to model the effect of weather patterns on bikeshare usage.- Basic R Skills
- use of
fct_relevel
function to manually reorder factor levels - use of
ifelse
function to evaluate conditions when cleaning data - use of regression functions:
linear_reg
,set_engine("lm")
,fit
(different from the functions used in the MA 132 Enrollment Analysis, but produce the same results)
- use of
- Data Science Process
- Clean data to get desired variables before modeling.
- Visualize data before modeling to get a general sense of the situation.
- Develop a series of models by adding and removing certain variables.
- Interpret the meaning of the slope/intercept parameters and the
R^2
values of a model, and make observations about the data based on these interpretations. - Examine
R^2
values to compare the effectiveness of different models.
- Basic R Skills