layout	title	permalink
page	DP4SS	/dp4ss/

Data Programming for the Social Sciences (DP4SS)

A gentle introduction to data programming
in R for social science audiences.

Jesse Lecy
&
Jamison Crawford

Attribution · NonCommercial · ShareAlike

GitHub

Source Code

This textbook is being developed by adapting lecture notes and resources from a graduate-level introductory course in data science that is offered at the Watts College of Public Service at Arzona State University.

Comments and suggestions are welcome! · · · Comments

CONTENTS:

TOC {:toc}

Your Data Science Toolkit

We will need three tools to manage your data science projects: a data programming language (R), a project management interfact (R Studio), and a way to create data-driven documents (R Markdown).

Core R

What is R? [ video ]
Packages

R Studio

Installing R and R Studio
Tour of R Studio

Data-Driven Docs

Automation & Flexibility
The Importance of Reproducibility
Formats link
Gallery link

Markdown

R Markdown Formats overview
- Headers and Chunks link
- Knitting link
- Customization

Getting Started

These are some useful resources and guides for learning how to program if you are new to R or data programming.

Starting to Code

RMD File Styles and Knitting Tips
Style Guides

Getting Help

Help files
Error messages
Discussion boards

The Learning Curve

Vocabular and verbs
Learning to Learn R

Using R

Functions, variables, and operators are the core components of any functional programming language. These first chapters are foundational for everything moving forward.

R as a Calculator

Mathematical Operators
Objects
Assignment

Functions

Input-Output Devices
Arguments
Values
Returns

Logical Operators

Logical operators
- equal
- not equal
- greater than or less than
- opposite of

Special Operators

Unique values
Duplicates
Missing values (NA)
Maximum
Minimum

One-Dimensional Datasets

Vectors are the building blocks of analysis in R. Vectors come in a variety of flavors - we cover the four most salient data types here: numbers, characters, categories, and logical or boolean.

Vectors

Vector Types
- Numeric (v)
- Character (s)
- Factor (ordered vs unordered) (f)
- Logical (true/false) (L)
Checking vector types
- data class
- data mode

Converting Data Type

Casting
- explicit casting
- implicit casting (coercion)
Information loss
Care with factors

Variable Transformations

Linear transformations
- vectorized functions
- recycling rules
Recoding values
- find and replace
- recoding factors
Floors and ceilings

Two-Dimensional Datasets

Vectors typically represent individual variables in the social science context. A dataset contains IDs for individuals, and multiple measures from each individual. Typically data is organized so that columns represent distinct variables and rows represent individuals in the dataset. This spreadsheet representation of data is operationalized as data frames in R. Here you learn how to construct and manipulate data frames.

Dataframes

Creating data frames from vectors
- rows and columns
the $ operator
Checking and changing class types

Dataframe Subsets

Filter rows and select columns
- the [] operator
- dplyr::filter and dplyr::select
Reorder rows or columns
- sort() versus order()
- dplyr::arrange

Dataframe Constructors

Building data objects:
- data.frame() vs cbind() and rbind()
Variable transformations in df's
- assignment inside a df: dat$x_squared <- x·x
- dplyr::mutate vs dplyr::transmute()

Matrices and Lists

Matrix
Lists
Conversions:
- matrix to df
- list to df

Data IO

Data import and export [ input / output ].

Navigation

Working directories
- paths: windows v linux
- current working directory: getwd()
- change working directory: setwd()
- check files in directory: dir()
- create new folder: dir.create("name")
Unzip files unzip("filename")
Delete files tutorial

Built-In Datasets

Core R datasets
Datasets in packages
Packages that are data

Importing Data into R

Read options
Copy and paste from Excel
Using rdata format
Read from csv or tsv
Read text files
Import from Excel
Import from common format (foreign package)
Import from the web (RCurl)
Import from GitHub
Import from DropBox
[ tutorial ]

Exporting Data

Write options
- CSV
- R Data Sets (RDS)
- CSV vs RDS
- Tables
- RData Format
- SPSS or Stata
Copy to Clipboard
Copy to Excel
[ tutorial ]

APIs

What is an API?
Examples
- Census
- Socrata
- Twitter
[ Demo with DataUSA API ]

Data Wrangling (dplyr)

Data wrangling is the process of preparing data for analysis, which includes reading data into R from a variety of formats, cleaning data, tidying datasets, creating subsets and filters, transforming variables, grouping data, and joining multiple datasets.

The goal of data wrangling is to create a rodeo dataset (clean and well-structured) that is ready for the big show (modeling and visualization)!

Slicing Datasets

Subset operator []
- by position
- by name
- by logical vector
- with recycling
Selector vectors
Subset by row
- dat[ row_selector , ]
- dplyr::filter( dat, row_selector )
Subset by column
- dat[ , column_selector ]
- dplyr::select( dat, column_selector )
Reorder
- with index
- order / match

Data Wrangling Recipes

Pipe operator
Window vs summary functions
dplyr cheat sheet

Combining Datasets

merge() and match()
join() in dplyr
inner, outer, right, left

Explore and Describe

The first step in the data science process is to get to know your data through descriptive analysis and exploratory analysis that searches for useful patterns or trends. We accomplish this through summary statistics, and in the next section visualization.

Summarizing Vectors

Counting things:
- sum( logical statement )
Counting missing data:
- sum( is.na(x) )
Categorical data:
- table( f1, f2 )
- prop.table() and margin.table()
Numeric data: min, max, mean, median, summary, quantile
- all vectors at once: summary( data.frame )

Summarizing Groups of Vectors

table( f1, f2 )
ftable( row.vars=c("f1","f2"), col.vars="f3" )
Function over groups: tapply( v1, f1 ) or dplyr:: group_by() + summarise()
Functions over levels of numeric data: tapply( v1, cut(v2) )
tapply( v1, INDEX=list(f1,f2) or dplyr:: group_by() + summarise()
aggregate( dat, FUN, by=f1 )
https://cran.r-project.org/web/packages/DescTools/vignettes/DescToolsCompanion.pdf
v1, v2 using cor() or visually with pairs()

Efficient Analysis With Groups

As you become proficient with descriptive analysis you will want to find ways to be more efficient. Unless you learn how to scale data exploration and modeling you will not be able to quickly identify patterns in your data. The most efficient way to scale your analysis is to understand the dimensionality or internal problem space in your data, and use apply functions in R to replicate analysis over many groups at once.

Groups

Logical statements
- define group criteria
- TRUE signifies membership
Group constructors
- from categorical variables
- from numeric variables
- from strings
- from missing values
Compound logical statements: AND and OR
Casting logical vectors

Group Structure

Combining factors and numeric data for analysis
Faceting in plots

Counting Group Members

Mathematical operators with logical vectors
- counts of members: sum( L1 )
- proportions of members: mean( L1 )
Conditional proportions
- subset then tabulate
- logical statement in numerator and demoninator

The Mathematics of Groups

Group structure
- generalizing logical statements
Group dimensionality
- how many unique groups are in the data?
- combinatorics of attributes
- total groups from f1 and f2 = nlevels(F1) · nlevels(F2)
Groups as problem spaces
- complexity theory
- search
- dimension reduction

Analysis with Groups

Contingency tables
- counts of members: f1 · f2
Statistics by group
- function applied over a group: v1 ~ f1 · f2
- apply() functions
- dplyr group_by() and summarize() functions

Latent Groups

clustering
unsupervised learning approaches

Visualization

For a great overview with examples of R code:

Wilke, C. O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O'Reilly Media. FREE EBOOK

Principles of Visual Communication

Ground, figure, narrative (context, subject, action)
Tufte’s rules
Visual tragedies

Core Graphics Engine

plot() function
Arguments:
- plot point types
- colors
- size
- axis labels
- plot title

Customizing Graphics

Defining a canvas: xlim, ylim
Adding data
Type (point, line, both)
Symbols
Color
Size
Adding grids
Adding axes
Adding titles / axes labels
Adding data labels: text()
Margins

Colors in R

select by name:
- pre-programmed pallete
- popular packages
color theory
- value
- shade, tint, tone
- hue, saturation
- transparency
color values
- RGB codes vs Hex codes
color functions

Advanced Plot Features

Custom fonts
Math symbols
Multiple plots (core graphics)
- incorrect: https://en.wikipedia.org/wiki/File:Smallmult.png#/media/File:Smallmult.png
Custom graph layouts

Grammar of Graphics and ggplot2

Grammar of graphics concept
ggplot overview

Animations

Dynamic Documents

R shiny

What makes documents dynamic?
Widgets
- input objects
- Widgets Gallery
Render functions
Reactive functions
[ tutorial ]

Dashboards in R

Principles of good dashboard design
Layouts
Sidebars
Value boxes
[ demo RMD ]

Customizing Styles

CSS: cascading style sheets

Files

dp4ss.md

Latest commit

History