-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdata_cleaning.Rmd
202 lines (140 loc) · 7.11 KB
/
data_cleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "Data Cleaning"
author: "Jill, Lauren, Rush, Will"
output:
html_document:
keep_md: TRUE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = "")
knitr::opts_chunk$set(fig.align = 'center')
knitr::opts_chunk$set(cache = TRUE)
```
This file contains the code to clean the raw data in the file `catalog sales data.csv` and generate two data files called `clean_train.csv` and `clean_test.csv` containing the preprocessed training and testing data sets respectively.
## Loading required libraries and reading the data
```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(lubridate)
sales <- read.csv("catalog sales data.csv", stringsAsFactors = FALSE)
```
## Data cleaning
#### ID column
Adding an ID column to identify each customer uniquely
```{r}
sales <- mutate(sales, id = as.integer(row.names(sales)))
```
#### Date variables
Converting the date variables into correct format
```{r}
sales <- mutate(sales, datead6 = mdy(datead6), datelp6 = mdy(datelp6))
```
### Removing inconsistencies in the data
As described in the problem statement, "If you run a histogram or frequency distribution of the datelp6 variable among only those with targdol > 0 you will see that, for the most part, datelp6 equals one of two distinct dates in the calendar year. It is as if the person who prepared the data did some strange rounding within six-month bins."
We handled this situation by creating a new variable as follows.
#### lpurseason
A variable `lpurseason` which is 'fall' if the last day of purchase is after June and 'spring' otherwise.
```{r}
sales <- sales %>%
mutate(lpurseason = ifelse(month(sales$datelp6)>6, 'fall', 'spring')) %>%
mutate(lpurseason = factor(lpurseason))
```
Again, as described in the problem statement, "there are also other inconsistencies in the data, e.g., `falord + sprord` is not equal to `ordhist` in about 9% of the cases Similarly, the year of latest purchase obtained from `lpuryear` variable and from `datelp6` variable do not always agree. Some of these errors result because when two variables measure the same thing, both are not updated". We will remove these inconsistencies here.
#### datelp6 and lpuryear
Updating NA values of `lpuryear` with the year parsed from `datelp6`. For the non NA values, first convert `datelp6` back to normal 4-digit years for our convenience and compare it with `datelp6_year`, and update one of them with the larger one.
```{r}
sales$datelp6_year <- year(sales$datelp6)
sales[!is.na(sales$lpuryear),]$lpuryear <- ifelse(sales[!is.na(sales$lpuryear),]$lpuryear<3,as.integer(paste0('201',as.character(sales[!is.na(sales$lpuryear),]$lpuryear))),as.integer(paste0('200',as.character(sales[!is.na(sales$lpuryear),]$lpuryear))))
sales[is.na(sales$lpuryear),]$lpuryear <- sales[is.na(sales$lpuryear),]$datelp6_year
sales[sales$lpuryear < sales$datelp6,]$lpuryear <- sales[sales$lpuryear < sales$datelp6,]$datelp6_year
sales <- select(sales, -datelp6_year)
```
#### falord, sprord and ordhist
If the sum of `falord` and `sprord` are greater than `ordhist`, update `ordhist`. Otherwise, update either `falord` or `sprord` with respect to `lpurseason`,
```{r}
sales <- sales %>%
mutate(ordhist = ifelse(ordhist<falord+sprord, falord+sprord, ordhist)) %>%
mutate(sprord = ifelse(ordhist>falord+sprord & lpurseason == 'spring',
ordhist-falord, sprord)) %>%
mutate(falord = ifelse(ordhist>falord+sprord & lpurseason == 'fall',
ordhist-sprord, falord))
```
### Creating new variables
#### Responded
A variable `responded` which indicates whether the customer responded or not. `responded` = 1 if and only if `targdol` > 0 else it is 0.
```{r}
sales <- sales %>%
mutate(responded = as.integer(targdol>0)) %>%
mutate(responded = factor(responded))
```
#### recency bin
Adding recency with bin sizes of 6 months.
```{r}
sales$recency_bin <- 2*(2012 - sales$lpuryear)
sales[sales$lpurseason == "spring",]$recency_bin <- 2*(2012 - sales[sales$lpurseason == "spring",]$lpuryear) + 1
```
#### lifetime and recency
Adding customer lifetime and recency by calculating days from date
```{r}
sales <- sales %>%
mutate(recency = as.integer(today() - datelp6)) %>%
mutate(lifetime = as.integer(today() - datead6))
```
#### active
Adding active percentage based on lifetime and recency. This variable is the proportion of the lifetime of a customer for which he was actively making purchases.
```{r}
sales <- mutate(sales, active = (lifetime - recency)/lifetime)
```
#### avg_amount
Calculating average spend per order. If both slshist and ordhist are 0, NAs will be produced. We will replace those NAs by 0.
```{r}
sales <- mutate(sales, avg_amount = slshist/ordhist) %>%
mutate(avg_amount = ifelse(is.na(avg_amount), 0, avg_amount))
```
#### large_avg
Creating a boolean to indicate whether the average amount spent by a customer is greater than the mean average amount spent by all customers.
```{r}
sales <- mutate(sales, large_avg = ifelse(avg_amount>mean(avg_amount), 1, 0))
```
#### lpurseason_year
Adding variable for season year
```{r}
sales$lpurseason_year <- paste(sales$lpurseason, sales$lpuryear, sep=" ")
```
#### pur3yr
A boolean variable to indicate whether the customer made a purchase within the last three years.
```{r}
sales <- mutate(sales, pur3yr = as.integer(year(datelp6)>=2010))
```
#### slscmp
A boolean variable to indicate whether the customer spend more this year as compared to last year.
```{r}
sales <- mutate(sales, slscmp = as.integer(slstyr>slslyr))
```
#### sls4bfr and ord4bfr
We think there might some correlation between the sum of the sales in the last few years with the total number of sales to date, i.e. `slstyr + slslyr + sls2ago + sls3ago` might be correlated with `slshist`.
```{r}
with(sales, cor(slstyr + slslyr + sls2ago + sls3ago, slshist))
with(sales, cor(ordtyr + ordlyr + ord2ago + ord3ago, ordhist))
```
These correlations become significantly higher if we only consider the data where the customer responded
```{r}
with(filter(sales, responded == 1), cor(slstyr + slslyr + sls2ago + sls3ago, slshist))
with(filter(sales, responded == 1), cor(ordtyr + ordlyr + ord2ago + ord3ago, ordhist))
```
We can simply remove the `ordhist` and `slshist` variables. But we would be loosing some information if we do that. To remove multicollinearity and retain information, we add two more variables called `sls4bfr` and `ord4bfr` which are the total sales dollars and number of orders before 4 years. These can be obtained by subtracting the sum `slstyr + slslyr + sls2ago + sls3ago` from `slshist` and similarly for number of orders. Now we can remove the `ordhist` and `slshist` variables at the time of model building.
```{r}
sales <- sales %>%
mutate(sls4bfr = slshist - slstyr - slslyr - sls2ago - sls3ago) %>%
mutate(ord4bfr = ordhist - ordtyr - ordlyr - ord2ago - ord3ago)
```
## Dividing the data into training and test set
```{r}
train <- filter(sales, train == 1)
test <- filter(sales, train == 0)
```
## Writing out clean data
```{r}
write.csv(train, file = "clean_train.csv", row.names=FALSE)
write.csv(test, file = "clean_test.csv", row.names=FALSE)
```