title | author | output | ||||
---|---|---|---|---|---|---|
Data Cleaning |
Jill, Lauren, Rush, Will |
|
This file contains the code to clean the raw data in the file catalog sales data.csv
and generate two data files called clean_train.csv
and clean_test.csv
containing the preprocessed training and testing data sets respectively.
library(dplyr)
library(lubridate)
sales <- read.csv("catalog sales data.csv", stringsAsFactors = FALSE)
Adding an ID column to identify each customer uniquely
sales <- mutate(sales, id = as.integer(row.names(sales)))
Converting the date variables into correct format
sales <- mutate(sales, datead6 = mdy(datead6), datelp6 = mdy(datelp6))
As described in the problem statement, "If you run a histogram or frequency distribution of the datelp6 variable among only those with targdol > 0 you will see that, for the most part, datelp6 equals one of two distinct dates in the calendar year. It is as if the person who prepared the data did some strange rounding within six-month bins."
We handled this situation by creating a new variable as follows.
A variable lpurseason
which is 'fall' if the last day of purchase is after June and 'spring' otherwise.
sales <- sales %>%
mutate(lpurseason = ifelse(month(sales$datelp6)>6, 'fall', 'spring')) %>%
mutate(lpurseason = factor(lpurseason))
Again, as described in the problem statement, "there are also other inconsistencies in the data, e.g., falord + sprord
is not equal to ordhist
in about 9% of the cases Similarly, the year of latest purchase obtained from lpuryear
variable and from datelp6
variable do not always agree. Some of these errors result because when two variables measure the same thing, both are not updated". We will remove these inconsistencies here.
Updating NA values of lpuryear
with the year parsed from datelp6
. For the non NA values, first convert datelp6
back to normal 4-digit years for our convenience and compare it with datelp6_year
, and update one of them with the larger one.
sales$datelp6_year <- year(sales$datelp6)
sales[!is.na(sales$lpuryear),]$lpuryear <- ifelse(sales[!is.na(sales$lpuryear),]$lpuryear<3,as.integer(paste0('201',as.character(sales[!is.na(sales$lpuryear),]$lpuryear))),as.integer(paste0('200',as.character(sales[!is.na(sales$lpuryear),]$lpuryear))))
sales[is.na(sales$lpuryear),]$lpuryear <- sales[is.na(sales$lpuryear),]$datelp6_year
sales[sales$lpuryear < sales$datelp6,]$lpuryear <- sales[sales$lpuryear < sales$datelp6,]$datelp6_year
sales <- select(sales, -datelp6_year)
If the sum of falord
and sprord
are greater than ordhist
, update ordhist
. Otherwise, update either falord
or sprord
with respect to lpurseason
,
sales <- sales %>%
mutate(ordhist = ifelse(ordhist<falord+sprord, falord+sprord, ordhist)) %>%
mutate(sprord = ifelse(ordhist>falord+sprord & lpurseason == 'spring',
ordhist-falord, sprord)) %>%
mutate(falord = ifelse(ordhist>falord+sprord & lpurseason == 'fall',
ordhist-sprord, falord))
A variable responded
which indicates whether the customer responded or not. responded
= 1 if and only if targdol
> 0 else it is 0.
sales <- sales %>%
mutate(responded = as.integer(targdol>0)) %>%
mutate(responded = factor(responded))
Adding recency with bin sizes of 6 months.
sales$recency_bin <- 2*(2012 - sales$lpuryear)
sales[sales$lpurseason == "spring",]$recency_bin <- 2*(2012 - sales[sales$lpurseason == "spring",]$lpuryear) + 1
Adding customer lifetime and recency by calculating days from date
sales <- sales %>%
mutate(recency = as.integer(today() - datelp6)) %>%
mutate(lifetime = as.integer(today() - datead6))
Adding active percentage based on lifetime and recency. This variable is the proportion of the lifetime of a customer for which he was actively making purchases.
sales <- mutate(sales, active = (lifetime - recency)/lifetime)
Calculating average spend per order. If both slshist and ordhist are 0, NAs will be produced. We will replace those NAs by 0.
sales <- mutate(sales, avg_amount = slshist/ordhist) %>%
mutate(avg_amount = ifelse(is.na(avg_amount), 0, avg_amount))
Creating a boolean to indicate whether the average amount spent by a customer is greater than the mean average amount spent by all customers.
sales <- mutate(sales, large_avg = ifelse(avg_amount>mean(avg_amount), 1, 0))
Adding variable for season year
sales$lpurseason_year <- paste(sales$lpurseason, sales$lpuryear, sep=" ")
A boolean variable to indicate whether the customer made a purchase within the last three years.
sales <- mutate(sales, pur3yr = as.integer(year(datelp6)>=2010))
A boolean variable to indicate whether the customer spend more this year as compared to last year.
sales <- mutate(sales, slscmp = as.integer(slstyr>slslyr))
We think there might some correlation between the sum of the sales in the last few years with the total number of sales to date, i.e. slstyr + slslyr + sls2ago + sls3ago
might be correlated with slshist
.
with(sales, cor(slstyr + slslyr + sls2ago + sls3ago, slshist))
[1] 0.7071078
with(sales, cor(ordtyr + ordlyr + ord2ago + ord3ago, ordhist))
[1] 0.6081882
These correlations become significantly higher if we only consider the data where the customer responded
with(filter(sales, responded == 1), cor(slstyr + slslyr + sls2ago + sls3ago, slshist))
[1] 0.802798
with(filter(sales, responded == 1), cor(ordtyr + ordlyr + ord2ago + ord3ago, ordhist))
[1] 0.7543666
We can simply remove the ordhist
and slshist
variables. But we would be loosing some information if we do that. To remove multicollinearity and retain information, we add two more variables called sls4bfr
and ord4bfr
which are the total sales dollars and number of orders before 4 years. These can be obtained by subtracting the sum slstyr + slslyr + sls2ago + sls3ago
from slshist
and similarly for number of orders. Now we can remove the ordhist
and slshist
variables at the time of model building.
sales <- sales %>%
mutate(sls4bfr = slshist - slstyr - slslyr - sls2ago - sls3ago) %>%
mutate(ord4bfr = ordhist - ordtyr - ordlyr - ord2ago - ord3ago)
train <- filter(sales, train == 1)
test <- filter(sales, train == 0)
write.csv(train, file = "clean_train.csv", row.names=FALSE)
write.csv(test, file = "clean_test.csv", row.names=FALSE)