-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path10_load_policy_data.Rmd
106 lines (71 loc) · 2.47 KB
/
10_load_policy_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: "Life Policy Book Analysis: 10 - Load Data"
author: "Mick Cooney"
date: "29 October 2016"
output:
html_document:
toc: true
number_sections: true
fig_caption: yes
theme: cerulean
pdf_document: default
---
<!--
(Title:) Life Policy Book Anaysis: 10 - Load Data
Author: Mick Cooney
Date: 2016
Abstract: This worksheet deals with the basic loading of data for the
book of life insurance data
Keywords: life insurance, insurance, data generation
-->
```{r knit_opts, include = FALSE}
knitr::opts_chunk$set(tidy = FALSE, cache = FALSE)
library(tidyverse)
library(data.table)
library(dtplyr)
library(feather)
options(width = 80L)
source("custom_functions.R")
set.seed(42)
```
# Load Input Data
This worksheet loads up the data of the book of life insurance
policies and some very basic data exploration.
```{r setup_params, echo=TRUE}
data_snapshot_date <- as.Date('2015-12-31')
```
With the original dataset
```{r load_data, echo=TRUE}
lifebook_dt <- read_feather("data/lifeinsurance_book.feather")
survprot_dt <- lifebook_dt %>%
filter(prod_type == 'protection'
,prem_freq == 12
,prem_type == 'RP'
)
setDT(survprot_dt)
```
# Create Derived Values
Now that we have loaded the data we do a bit of feature engineering to calculate a few new and useful variables.
```{r do_feature_engineering, echo=TRUE, results='hide'}
survprot_dt <- survprot_dt %>%
mutate(age_at_data_snapshot = as.numeric((data_snapshot_date - dob_life1) / 365.25)
,age_at_policy_start = as.numeric((policy_startdate - dob_life1) / 365.25)
)
# Tried to use case_when here to keep things 'tidyverse', but the data.table idiom
# is concise and effective here
survprot_dt[policy_status == 'lapsed'
,month_reached := as.numeric(policy_statuschangedate - policy_startdate) / 30.4]
survprot_dt[policy_status == 'completed'
,month_reached := as.numeric(policy_duration * 12)]
survprot_dt[policy_status == 'inforce'
,month_reached := as.numeric(data_snapshot_date - policy_startdate) / 30.4]
survprot_dt[, month_reached := as.integer(round(month_reached, 0))]
```
# Analyse Lapse Data
```{r calculate_monthly_lapses, echo=TRUE}
data_lapse_month <- as.Date(format(min(survprot_dt$policy_startdate),"%Y-%m-01"))
month_dates <- seq(data_lapse_month, data_snapshot_date, by = 'month')
monthstart_dt <- survprot_dt %>%
mutate(month_start = format(policy_startdate, "%Y%m")) %>%
select(month_start, policy_id)
```