-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathKDD_GLM.Rmd
73 lines (61 loc) · 2.31 KB
/
KDD_GLM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
title: "KDD Logistic Regression"
output: html_notebook
---
Loading data
Finding out the missing values in the data.
```{r}
library(dplyr)
#rm(list=ls())
data = read.csv("~/R/KDD/loan.csv",na = " ")
dim(data)
names(data)
match("term",names(data))
data1 = data[,c(17,3,7,9,12,13,14,6)]
sapply(data1 , function(x) sum(is.na(x)))
```
```{r}
dat1 = data1 %>% filter(!is.na(annual_inc) , !(home_ownership %in% c('NONE' , 'ANY')) , emp_length != 'n/a')
```
We want to convert this variable to binary (1 for default and 0 for non-default) but we have 10 different levels. Loans with status Current, Late payments, In grace period need to be removed. Therefore, we create a new variable called loan_outcome where
loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0 if loan_status = ‘Fully Paid’
```{r}
dat1$loan_outcome <- ifelse(dat1$loan_status %in% c("Charged Off", "Default", "Late (16-30 days)", "Late (31-120 days)", "Does not meet the credit policy. Status:Charged Off", "In Grace Period"), 1, ifelse(dat1$loan_status == "Fully Paid", 0, "Other"))
dat1 <- dat1 %>% filter(dat1$loan_outcome != "Other")
barplot(table(dat1$loan_outcome) , col = 'lightblue')
```
```{r}
library(ggplot2)
dat1 = dat1 %>%
select(-loan_status) %>%
filter(loan_outcome %in% c(0 , 1))
#data1$loan_outcome
dim(dat1)
names(dat1)
ggplot(dat1 , aes(x = grade , y = int_rate , fill = grade)) +
geom_boxplot() +
labs(y = 'Interest Rate' , x = 'Grade')
```
# Split dataset
```{r}
dat1$loan_outcome = as.numeric(dat1$loan_outcome)
idx = sample(dim(dat1)[1] , 0.75*dim(dat1)[1] , replace = F)
trainset = dat1[idx , ]
testset = dat1[-idx , ]
```
# Fit logistic regression
```{r}
glm.model = glm(loan_outcome ~ . , trainset , family = binomial)
summary(glm.model)
```
# Performance of GLM:
```{r}
glm.pred1 = predict(glm.model,testset,type= "response")
preds = predict(glm.model , testset , type = 'response')
length(preds)
glm.probs1 = ifelse(glm.pred1 > 0.5 , 1 , 0)
confusion_matrix_50_1 = table(testset$loan_outcome, glm.probs1)
confusion_matrix_50_1
mean(glm.probs1==testset$loan_outcome)
ggplot(data.frame(glm.pred1) , aes(glm.pred1)) + geom_density(fill = 'lightblue' , alpha = 0.4) +labs(x = 'Predicted Probabilities on test set')
```