-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathbuild_mtpl1_freq_model.Rmd
190 lines (140 loc) · 5.45 KB
/
build_mtpl1_freq_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "Build the MTPL1 Frequency Model"
author: "Mick Cooney <[email protected]>"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
fig_caption: yes
toc_depth: 3
use_bookdown: yes
html_document:
fig_caption: yes
theme: spacelab
highlight: pygments
number_sections: TRUE
toc: TRUE
toc_depth: 3
toc_float:
smooth_scroll: FALSE
pdf_document: default
---
```{r import_libraries, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(tidy = FALSE,
cache = FALSE,
warning = FALSE,
message = FALSE,
fig.height = 8,
fig.width = 11
)
library(conflicted)
library(tidyverse)
library(scales)
library(cowplot)
library(magrittr)
library(rlang)
library(glue)
library(purrr)
library(furrr)
library(rsample)
library(rstan)
library(rstanarm)
library(posterior)
library(bayesplot)
library(tidybayes)
source("custom_functions.R")
resolve_conflicts(
c("magrittr", "rlang", "dplyr", "readr", "purrr", "ggplot2", "rsample")
)
options(width = 80L,
warn = 1,
mc.cores = parallel::detectCores()
)
theme_set(theme_cowplot())
set.seed(42)
stan_seed <- 4242
```
In this workbook we switch our attention to building a frequency model for the
claims data. We build a number of different models and compare them in terms of
both accuracy (estimate of the mean), and precision (estimate of the
variance/dispersion).
All of our modelling is done within a Bayesian context, so rather than
estimating a single set of parameters for our model we instead estimate the
posterior distribution of the joint distribution of the parameters given the
observed data.
We use prior predictive checks to set our priors and then use Monte Carlo
simulation and posterior predictive checks to assess the quality of the various
models.
# Data and Setup
Before we do any modelling we need to load our data. Data exploration and
various data cleaning etc has been performed in a previous workbook, so we
simply load the data as-is.
It may be necessary to perform some simple feature engineering, but this is
part of the modelling process so we will instead do that here as it is an
intrinsic part of the modelling process in most cases.
```{r load_mtpl1_dataset, echo=TRUE}
modelling1_data_tbl <- read_rds("data/modelling1_data_tbl.rds")
modelling1_data_tbl %>% glimpse()
```
This dataset will be the basis for all our subsequent work with the MTPL1 data.
For the purposes of effective model validation we need to construct a
"hold out" or *testing* set. We subset the data now and do not investigate or
check this data till we have final models we wish to work with.
The size of this hold-out set is a matter of discussion, and is a trade-off
between ensuring enough data for modelling, but also ensuring the test set is
large enough to test the final models.
We sample this data at random for now, and take hold out 20% of it.
```{r construct_mtpl1_train_holdout, echo=TRUE}
mtpl1_split <- modelling1_data_tbl %>% initial_split(prop = 0.8)
mtpl1_training_tbl <- mtpl1_split %>% training()
mtpl1_training_tbl %>% glimpse()
mtpl1_testing_tbl <- mtpl1_split %>% testing()
mtpl1_testing_tbl %>% glimpse()
```
# Our First Frequency Model
We start with a simple frequency model for the car insurance data, using prior
predictive checks to the set our prior parameters. The first time we do this
we will discuss this in more detail to explain the method and what we are
trying to achieve. Once we are happy with our prior model we then switch to
conditioning it on the data - check the *posterior shrinkage* and estimation
of how informative our data has been on the model, and then use the output to
guide our work.
## Constructing Our Prior Model
We start by building a simple model with a small number of parameters. Going
by our previous data exploration, we use `gas` and `cat_driver_age`. Later
models will use a smoothed predictor for our continuous variables where there
is a nonlinear effect, but for now we focus on the discretisations of those
variables for simplicity.
In formula notation, our model will look something like this:
```
claim_count ~ gas + cat_driver_age
```
Since `claim_count` is a count variable we will use some form of count
regression: either Poisson or Negative Binomial, and will will try both.
Our idea is to have our parameters vary on a unit scale, and so Normal priors
should be fine. This leaves the intercept, so we start with a Normal prior here
also and see what the effect is.
### Our First Prior Model
We use the `rstanarm` package to fit this model - this allow us to use standard
R model notation and formula in a Bayesian context, and avoids us the tedious
of work of having to write out the full Stan code for this problem.
To fit from the prior predictive rather than conditioning on the data, the
model will not add the observed data and simply fit from the priors.
```{r fit_first_prior_model, echo=TRUE}
fit_model_tbl <- modelling1_data_tbl %>% select(-sev_data)
mtpl1_freq1_prior_stanreg <- stan_glm(
claim_count ~ gas + cat_driver_age,
family = poisson(),
data = fit_model_tbl,
offset = log(exposure),
iter = 1000,
chains = 8,
QR = TRUE,
prior = normal(location = 0, scale = 1),
prior_PD = TRUE,
seed = stan_seed
)
```
# R Environment
```{r show_session_info, echo=TRUE, message=TRUE}
sessioninfo::session_info()
```