10_load_policy_data.Rmd

---
title: "Life Policy Book Analysis: 10 - Load Data"
author: "Mick Cooney"
date: "29 October 2016"
output:
  html_document:
    toc: true
    number_sections: true
    fig_caption: yes
    theme: cerulean
  pdf_document: default
---

<!--
(Title:) Life Policy Book Anaysis: 10 - Load Data

Author: Mick Cooney

Date: 2016

Abstract: This worksheet deals with the basic loading of data for the
book of life insurance data

Keywords: life insurance, insurance, data generation

-->

```{r knit_opts, include = FALSE}
knitr::opts_chunk$set(tidy = FALSE, cache = FALSE)

library(tidyverse)
library(data.table)
library(dtplyr)

library(feather)


options(width = 80L)

source("custom_functions.R")

set.seed(42)
```

# Load Input Data

This worksheet loads up the data of the book of life insurance
policies and some very basic data exploration.

```{r setup_params, echo=TRUE}
data_snapshot_date <- as.Date('2015-12-31')
```

With the original dataset


```{r load_data, echo=TRUE}
lifebook_dt <- read_feather("data/lifeinsurance_book.feather")

survprot_dt <- lifebook_dt %>%
    filter(prod_type == 'protection'
          ,prem_freq == 12
          ,prem_type == 'RP'
           )

setDT(survprot_dt)
```

# Create Derived Values

Now that we have loaded the data we do a bit of feature engineering to calculate a few new and useful variables.

```{r do_feature_engineering, echo=TRUE, results='hide'}
survprot_dt <- survprot_dt %>%
    mutate(age_at_data_snapshot = as.numeric((data_snapshot_date - dob_life1) / 365.25)
          ,age_at_policy_start  = as.numeric((policy_startdate - dob_life1) / 365.25)
           )

# Tried to use case_when here to keep things 'tidyverse', but the data.table idiom
# is concise and effective here
survprot_dt[policy_status == 'lapsed'
           ,month_reached := as.numeric(policy_statuschangedate - policy_startdate) / 30.4]

survprot_dt[policy_status == 'completed'
           ,month_reached := as.numeric(policy_duration * 12)]

survprot_dt[policy_status == 'inforce'
           ,month_reached := as.numeric(data_snapshot_date - policy_startdate) / 30.4]

survprot_dt[, month_reached := as.integer(round(month_reached, 0))]
```


# Analyse Lapse Data

```{r calculate_monthly_lapses, echo=TRUE}
data_lapse_month <- as.Date(format(min(survprot_dt$policy_startdate),"%Y-%m-01"))

month_dates <- seq(data_lapse_month, data_snapshot_date, by = 'month')

monthstart_dt <- survprot_dt %>%
    mutate(month_start = format(policy_startdate, "%Y%m")) %>%
    select(month_start, policy_id)

```