initial_pnbd_models.qmd

---
title: "Initial Exploration of the P/NBD Model"
author: "Mick Cooney <mickcooney@gmail.com>"
date: "Last updated: `r format(Sys.time(), '%B %d, %Y')`"
editor: source
execute:
  message: false
  warning: false
  error: false
format:
  html:
    light: superhero
    dark: darkly
    anchor-sections: true
    embed-resources: true
    number-sections: true
    smooth-scroll: true
    toc: true
    toc-depth: 3
    toc-location: left
    code-fold: true
    code-summary: "Show code"
---


```{r import_libraries}
#| echo: FALSE
#| message: FALSE

library(conflicted)
library(tidyverse)
library(scales)
library(cowplot)
library(directlabels)
library(magrittr)
library(rlang)
library(fs)
library(purrr)
library(furrr)
library(rsyslog)
library(glue)
library(cmdstanr)
library(brms)
library(posterior)
library(bayesplot)
library(tidybayes)


source("lib_utils.R")
source("lib_btyd.R")


conflict_lst <- resolve_conflicts(
  c("magrittr", "rlang", "dplyr", "readr", "purrr", "ggplot2", "MASS",
    "fitdistrplus")
  )


options(
  width = 80L,
  warn  = 1,
  mc.cores = parallel::detectCores()
  )


set.seed(42)
stanfit_seed <- 4000

open_syslog("initial_pnbd_models")

theme_set(theme_cowplot())

plan(multisession)
```

In this workbook we introduce the various different BTYD models, starting with
a discussion of the underlying theory.


# Background Theory

Before we start working on fitting and using the various Buy-Till-You-Die
models, we first need to discuss the basic underlying theory and model.

In this model, we assume a customer becomes 'alive' to the business at the
first purchase and then makes purchases stochastically but at a steady-rate
for a period of time, and then 'dies' - i.e. becomes inactive to the business -
hence the use of "Buy-Till-You-Die".

Thus, at a high level these models decompose into modelling the transaction
events using distributions such as the Poisson or Negative Binomial, and then
modelling the 'dropout' process using some other method.

A number of BTYD models exist and for this workshop we will focus on the
BG/NBD model - the Beta-Geometric Negative Binomial Distribution model (though
we will discuss the P/NBD model also).

These models require only two pieces of information about each customer's
purchasing history: the "recency" (when the last transaction occurred) and
"frequency" (the count of transactions made by that customer in a specified
time period).

The notation used to represent this information is

$$
X = (x, \, t_x, \, T),
$$
where

$$
\begin{eqnarray*}
x   &=& \text{the number of transactions},           \\
T   &=& \text{the observed time period},             \\
t_x &=& \text{the time since the last transaction}.
\end{eqnarray*}
$$

From this summary data we can fit most BTYD models.

# BTYD Models

There are a number of different statistical approaches to building BTYD
models - relying on a number of different assumptions about how the various
recency, frequency and monetary values are modelled.

We now discuss a number of different ways of modelling this.


## Pareto/Negative-Binomial Distribution (P/NBD) Model

The P/NBD model relies on five assumptions:

  1. While active, the number of transactions made by a customer follows a
  Poisson process with transaction rate $\lambda$.
  1. Heterogeneity in $\lambda$ follows a Gamma distribution
  $\Gamma(\lambda \, | \, \alpha, r)$ with shape $r$ and rate $\alpha$. 
  1. Each customer has an unobserved 'lifetime' of length $\tau$. This point at
  which the customer becomes inactive is distributed as an exponential with
  dropout rate $\mu$.
  1. Heterogeneity in dropout rates across customers follows a Gamma
  distribution $\Gamma(\mu \, | \, s, \beta)$ with shape parameter $s$ and
  rate parameter $\beta$.
  1. The transaction rate $\lambda$ and the dropout rate $\mu$ vary
  independently across customers.


As before, we express this in mathematical notation as:

$$
\begin{eqnarray*}
\lambda &\sim& \Gamma(\alpha, r),    \\
\mu &\sim& \Gamma(s, \beta),         \\
\tau &\sim& \text{Exponential}(\mu)
\end{eqnarray*}
$$


## Beta-Geometric/Negative-Binomial Distribution (BG/NBD) Model

This model relies on a number of base assumptions, somewhat similar to the
P/NBD model but modelling lifetime with a different process:

  1. While active, the number of transactions made by a customer follows a
  Poisson process with transaction rate $\lambda$.
  1. Heterogeneity in $\lambda$ follows a Gamma distribution
  $\Gamma(\lambda \, | \, \alpha, r)$ with parameters shape $r$ and rate
  $\alpha$. 
  1. After any transaction, a customer becomes inactive with probability $p$.
  1. Heterogeneity in $p$ follows a Beta distribution $B(p \, | \, a, b)$ with
  shape parameters $a$ and $b$.
  1. The transaction rate $\lambda$ and the dropout probability $p$ vary
  independently across customers.


Note that it follows from the above assumptions that the probability of a
customer being 'alive' after any transaction is given by the Geometric
distribution, and hence the Beta-Geometric in the name.

To put this into more formal mathematical notation, we have:
 
$$
\begin{eqnarray*}
\lambda &\sim& \Gamma(\alpha, r),                 \\
P(\text{alive}, k) &\sim& \text{Geometric}(p, k), \\
p &\sim& \text{Beta}(a, b)
\end{eqnarray*}
$$


# Initial P/NBD Models

We start by modelling the P/NBD model using our synthetic datasets before we
try to model real-life data.


## Load Short Time-frame Synthetic Data

We now want to load the short time-frame synthetic data.


```{r load_shortframe_synthetic_data}
#| echo: TRUE

customer_cohortdata_tbl   <- read_rds("data/synthdata_shortframe_cohort_tbl.rds")
customer_cohortdata_tbl   |> glimpse()

customer_simparams_tbl    <- read_rds("data/synthdata_shortframe_simparams_tbl.rds")
customer_simparams_tbl    |> glimpse()

customer_transactions_tbl <- read_rds("data/synthdata_shortframe_transactions_tbl.rds")
customer_transactions_tbl |> glimpse()

customer_fit_stats_tbl    <- read_rds("data/shortsynth_customer_fit_stats_tbl.rds")
customer_fit_stats_tbl    |> glimpse()

customer_valid_stats_tbl  <- read_rds("data/shortsynth_obs_validdata_tbl.rds")
customer_valid_stats_tbl  |> glimpse()

obs_fitdata_tbl           <- read_rds("data/shortsynth_obs_fitdata_tbl.rds")
obs_fitdata_tbl           |> glimpse()

obs_validdata_tbl         <- read_rds("data/shortsynth_obs_validdata_tbl.rds")
obs_validdata_tbl         |> glimpse()
```


We re-produce the visualisation of the transaction times we used in previous
workbooks.

```{r plot_customer_transaction_times}
#| echo: TRUE

plot_tbl <- customer_transactions_tbl |>
  group_nest(customer_id, .key = "cust_data") |>
  filter(map_int(cust_data, nrow) > 3) |>
  slice_sample(n = 30) |>
  unnest(cust_data)

ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
  geom_line() +
  geom_point() +
  labs(
      x = "Date",
      y = "Customer ID",
      title = "Visualisation of Customer Transaction Times"
    ) +
  theme(axis.text.y = element_text(size = 10))
```


## Load Derived Data

```{r load_derived_data}
#| echo: TRUE

obs_fitdata_tbl   <- read_rds("data/shortsynth_obs_fitdata_tbl.rds")
obs_validdata_tbl <- read_rds("data/shortsynth_obs_validdata_tbl.rds")

customer_fit_stats_tbl <- obs_fitdata_tbl |>
  rename(x = tnx_count)
```


## Load Subset Data

We also want to construct our data subsets for the purposes of speeding up our
valuations.

```{r construct_customer_subset_data}
#| echo: TRUE

customer_subset_id <- read_rds("data/shortsynth_customer_subset_ids.rds")
customer_subset_id |> glimpse()

customer_fit_subset_tbl <- obs_fitdata_tbl |>
  filter(customer_id %in% customer_subset_id)

customer_fit_subset_tbl |> glimpse()


customer_valid_subset_tbl <- obs_validdata_tbl |>
  filter(customer_id %in% customer_subset_id)

customer_valid_subset_tbl |> glimpse()
```


We now use these datasets to set the start and end dates for our various
validation methods.


```{r set_start_end_dates}
dates_lst <- read_rds("data/shortsynth_simulation_dates.rds")

use_fit_start_date <- dates_lst$use_fit_start_date
use_fit_end_date   <- dates_lst$use_fit_end_date

use_valid_start_date <- dates_lst$use_valid_start_date
use_valid_end_date   <- dates_lst$use_valid_end_date
```

We now want to split the transaction datasets into two parts.

```{r split_transaction_data}
#| echo: true

customer_fit_transactions_tbl <- customer_transactions_tbl |>
  filter(
    customer_id %in% customer_subset_id,
    tnx_timestamp >= use_fit_start_date,
    tnx_timestamp <= use_fit_end_date
    )
  
customer_fit_transactions_tbl |> glimpse()


customer_valid_transactions_tbl <- customer_transactions_tbl |>
  filter(
    customer_id %in% customer_subset_id,
    tnx_timestamp >= use_valid_start_date,
    tnx_timestamp <= use_valid_end_date
    )
  
customer_valid_transactions_tbl |> glimpse()
```


## Derive the Log-likelihood Function

We now turn our attention to deriving the log-likelihood model for the P/NBD
model.

We assume that we know that a given customer has made $x$ transactions after
the initial one over an observed period of time $T$, and we label these
transactions $t_1$, $t_2$, ..., $t_x$.

![](img/clv_pnbd_arrow.png)


To model the likelihood for this observation, we need to consider two
possibilities: one for where the customer is still 'alive' at $T$, and one
where the customer has 'died' by $T$.

In the first instance, the likelihood is the product of the observations of
each transaction, multiplied by the likelihood of the customer still being
alive at time $T$.

Because we are modelling the transaction counts as a Poisson process, this
corresponds to the times between events following an exponential distribution,
and so both the transaction times and the lifetime likelihoods use the
exponential.

This gives us:

$$
\begin{eqnarray*}
L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T)
  &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ...
      \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(T - t)} \\
  &=& \lambda^x e^{-\lambda T}
\end{eqnarray*}
$$

and we can combine this with the likelihood of the lifetime of the customer
$\tau$ being greater than the observation window $T$,

$$
P(\tau > T \, | \, \mu) = e^{-\mu T}
$$

For the second case, the customer becomes inactive at some time $\tau$ in the
interval $(t_x, T]$, and so the likelihood is

$$
\begin{eqnarray*}
L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T)
  &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ...
      \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(\tau - t_x)} \\
  &=& \lambda^x e^{-\lambda \tau}
\end{eqnarray*}
$$

In both cases we do not need the times of the individual transactions, and all
we need are the values $(x, t_x, T)$.

As we cannot observe $\tau$, we want to remove the conditioning on $\tau$ by
integrating it out.

$$
\begin{eqnarray*}
L(\lambda, \mu \, | \, x, t_x, T)
  &=& L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) \, P(\tau > T \, | \, \mu) \; + \\
  & & \; \;\; \; \; \;\; \; \int^T_{t_x} L(\lambda \, | \, x, T, \text{ inactive at } (t_x, T] ) \, f(\tau \, | \mu) \, d\tau \\
  &=& \lambda^x e^{-\lambda T} e^{\mu T} +
      \lambda^x \int^T_{t_x} e^{-\lambda \tau} \mu e^{-\mu \tau} d\tau   \\
  &=& \lambda^x e^{-(\lambda + \mu)T} + \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} +
      \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) T} \\
  &=& \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} +
      \frac{\lambda^{x+1} \mu}{\lambda + \mu} e^{-(\lambda + \mu) T}
\end{eqnarray*}
$$

In Stan, we do not calculate the likelihoods but the Log-likelihood, so we need
to take the log of this expression. This creates a problem, as we have no easy
way to calculate $\log(a + b)$. As this expression occurs a lot, Stan provides
a `log_sum_exp()`, which is defined by

$$
\ln (a + b) = \text{LogSumExp}(\ln a, \, \ln b)
$$

We then use this `log_sum_exp()` function to calculate the Log-Likelihood for
the model. Note we use "LogSumExp" here to get around limitations in the
renderer.


$$
\begin{eqnarray*}
LL(\lambda, \mu \, | \, x, t_x, T)
  &=&
    \log \left(
      \frac{\lambda^x \, \mu}{\lambda + \mu}
      \left( e^{-(\lambda + \mu) t_x} + \lambda e^{-(\lambda + \mu) T} \right)
      \right) \\
  &=& x \log \lambda + \log \mu - \log(\lambda + \mu) \; + \\
  & & \;\;\;\;\;\;\; \text{LogSumExp}(-(\lambda + \mu) \, t_x, \; \log \lambda - (\lambda + \mu) \, T)
\end{eqnarray*}
$$

This is the log-likelihood model we want to fit in Stan.


# Fit Initial P/NBD Model

```{r create_log_initial_pnbd_model}
#| echo: TRUE

syslog(
  glue("Creating the first P/NBD model"),
  level = "INFO"
  )
```


We now construct our Stan model and prepare to fit it with our synthetic
dataset.

Before we start on that, we set a few parameters for the workbook to organise
our Stan code.

```{r setup_workbook_parameters}
#| echo: TRUE

stan_modeldir <- "stan_models"
stan_codedir  <-   "stan_code"
```

We also want to set a number of overall parameters for this workbook

To start the fit data, we want to use the 1,000 customers. We also need to
calculate the summary statistics for the validation period.


We start with the Stan model.

```{r display_pnbd_init_model_stancode}
#| echo: FALSE

read_lines("stan_code/pnbd_fixed.stan") |> cat(sep = "\n")
```

This file contains a few new features of Stan - named file includes and
user-defined functions - `calculate_pnbd_loglik`. We look at this file here:

```{r display_util_functions_stancode}
#| echo: FALSE

read_lines("stan_code/util_functions.stan") |> cat(sep = "\n")
```


## Compile and Fit Stan Model

We now compile this model using `CmdStanR`.

```{r compile_pnbd_fixed_stanmodel}
#| echo: TRUE
#| results: "hide"

pnbd_fixed_stanmodel <- cmdstan_model(
  "stan_code/pnbd_fixed.stan",
  include_paths =   stan_codedir,
  pedantic      =           TRUE,
  dir           =  stan_modeldir
  )
```


We then use this compiled model with our data to produce a fit of the data.


```{r fit_pnbd_init_stanmodel}
#| echo: TRUE

stan_modelname <- "pnbd_init"
stanfit_prefix <- str_c("fit_", stan_modelname)
stanfit_seed   <- stanfit_seed + 1

stanfit_object_file <- glue("data/{stanfit_prefix}_stanfit.rds")


stan_data_lst <- customer_fit_stats_tbl |>
  select(customer_id, x, t_x, T_cal) |>
  compose_data(
    lambda_mn = 0.25,
    lambda_cv = 1.00,
    
    mu_mn     = 0.10,
    mu_cv     = 1.00,
    )

if(!file_exists(stanfit_object_file)) {
  pnbd_init_stanfit <- pnbd_fixed_stanmodel$sample(
    data            =                stan_data_lst,
    chains          =                            4,
    iter_warmup     =                          500,
    iter_sampling   =                          500,
    seed            =                 stanfit_seed,
    save_warmup     =                         TRUE,
    output_dir      =                stan_modeldir,
    output_basename =               stanfit_prefix,
    )
  
  pnbd_init_stanfit$save_object(stanfit_object_file, compress = "gzip")

} else {
  pnbd_init_stanfit <- read_rds(stanfit_object_file)
}

pnbd_init_stanfit$print()
```


We have some basic HMC-based validity statistics we can check.

```{r calculate_pnbd_init_hmc_diagnostics}
#| echo: TRUE

pnbd_init_stanfit$cmdstan_diagnose()
```


## Visual Diagnostics of the Sample Validity

Now that we have a sample from the posterior distribution we need to create a
few different visualisations of the diagnostics.

```{r plot_lambda_traceplots_warmup}
#| echo: TRUE

parameter_subset <- c(
  "lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
  "mu[1]",     "mu[2]",     "mu[3]",     "mu[4]"
  )

pnbd_init_stanfit$draws(inc_warmup = TRUE) |>
  mcmc_trace(
    pars     = parameter_subset,
    n_warmup = 500
    ) +
  ggtitle("Full Traceplots of Some Lambda and Mu Values")
```


As the warmup is skewing the y-axis somewhat, we repeat this process without
the warmup.

```{r plot_lambda_traceplots_nowarmup}
#| echo: TRUE

pnbd_init_stanfit$draws(inc_warmup = FALSE) |>
  mcmc_trace(pars = parameter_subset) +
  expand_limits(y = 0) +
  labs(
    x = "Iteration",
    y = "Value",
    title = "Traceplot of Sample of Lambda and Mu Values"
    ) +
  theme(axis.text.x = element_text(size = 10))
```

A common MCMC diagnostic is $\hat{R}$ - which is a measure of the 'similarity'
of the chains.

```{r plot_pnbd_init_parameter_rhat}
#| echo: TRUE

pnbd_init_stanfit |>
  rhat(pars = c("lambda", "mu")) |>
  mcmc_rhat() +
    ggtitle("Plot of Parameter R-hat Values")
```

Related to this quantity is the concept of *effective sample size*, $N_{eff}$,
an estimate of the size of the sample from a statistical information point of
view.


```{r plot_pnbd_init_parameter_neffratio}
#| echo: TRUE

pnbd_init_stanfit |>
  neff_ratio(pars = c("lambda", "mu")) |>
  mcmc_neff() +
    ggtitle("Plot of Parameter Effective Sample Sizes")
```

Finally, we also want to look at autocorrelation in the chains for each
parameter.

```{r plot_pnbd_init_parameter_acf}
#| echo: TRUE

pnbd_init_stanfit$draws() |>
  mcmc_acf(pars = parameter_subset) +
    ggtitle("Autocorrelation Plot of Sample Values")
```

As before, this first fit has a comprehensive run of fit diagnostics, but for
the sake of brevity in later models we will show only the traceplots once we
are satisfied with the validity of the sample.


## Check Model Fit

As we are still working with synthetic data, we know the true values for each
customer and so we can check how good our model is at recovering the true
values on a customer-by-customer basis.

As in previous workbooks, we build our validation datasets and then check the
distribution of $q$-values for both $\lambda$ and $\mu$ across the customer
base.


```{r construct_pnbd_init_validation_qvalues}
#| echo: TRUE

pnbd_init_valid_lst <- create_pnbd_posterior_validation_data(
  stanfit       = pnbd_init_stanfit,
  data_tbl      = customer_fit_stats_tbl,
  simparams_tbl = customer_simparams_tbl
  )

pnbd_init_valid_lst$lambda_qval_plot |> plot()

pnbd_init_valid_lst$mu_qval_plot |> plot()
```

These plots looks like the model is recovering the parameters well, but cannot
rely on this approach once we use real data so we will stop using this now.


## Assess Model Fit Using Simulation

Rather than relying on knowing the 'true' answer, we instead will use our
posterior sample to generate data and compare this simulated data against the
data we fit. This procedure is similar to what we did before but now we focus
on in sample data rather than using validation data.


```{r calculate_pnbd_init_simstats}
#| echo: TRUE

pnbd_stanfit <- pnbd_init_stanfit |>
  recover_types(customer_fit_stats_tbl)

pnbd_init_simstats_tbl <- construct_pnbd_posterior_statistics(
  stanfit         = pnbd_stanfit,
  fitdata_tbl     = customer_fit_subset_tbl
  )

pnbd_init_simstats_tbl |> glimpse()
```

We now want to write out the simulation stats to disk.

```{r write_fit_pnbd_init_simstats}
#! echo: TRUE

pnbd_init_simstats_tbl |>
  write_rds("data/pnbd_init_assess_model_simstats_tbl.rds", compress = "gz")
```


We then use these posterior statistics as inputs to our simulations to help
us assess the in-sample quality of fit.


```{r setup_simulation_pnbd_init_fitdata_transactions}
#| echo: TRUE

fit_label <- "pnbd_init"

precompute_dir <- glue("precompute/{fit_label}")

ensure_exists_precompute_directory(precompute_dir)


pnbd_init_fitsims_index_tbl <- pnbd_init_simstats_tbl |>
  mutate(
    start_dttm = first_tnx_date   |> as.POSIXct(),
    end_dttm   = use_fit_end_date |> as.POSIXct(),
    lambda     = post_lambda,
    mu         = post_mu,
    p_alive    = 1,      ### In-sample validation, so customer begins active
    tnx_mu     = 100,    ### We are not simulating tnx size, so put in defaults
    tnx_cv     = 1       ### 
    ) |>
  group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
  mutate(
    sim_file = glue(
      "{precompute_dir}/sims_fit_{fit_label}_{customer_id}.rds"
      )
    )


pnbd_init_fitsims_index_tbl |> glimpse()
```

We now use this setup to generate our simulations.


```{r generate_simulation_pnbd_init_fitdata_transactions}
#| echo: TRUE

precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
  as.character() |>
  enframe(name = NULL, value = "sim_file")


runsims_tbl <- pnbd_init_fitsims_index_tbl |>
  anti_join(precomputed_tbl, by = "sim_file")


if(nrow(runsims_tbl) > 0) {
  pnbd_init_fitsims_index_tbl <- runsims_tbl |>
    mutate(
      chunk_data = future_map2_int(
        cust_params, sim_file,
        run_simulations_chunk,

        sim_func = generate_pnbd_validation_transactions,

        .options = furrr_options(
          globals  = c(
            "calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
            "generate_pnbd_validation_transactions"
            ),
          packages   = c("tidyverse", "fs"),
          scheduling = Inf,
          seed       = 421
          ),

        .progress = TRUE
        )
      )
}


pnbd_init_fitsims_index_tbl |> glimpse()
```

We now want to load up the summary statistics for each of our customers for
later analysis.


```{r retrieve_sim_stats}
#| echo: TRUE

retrieve_sim_stats <- ~ .x |>
  read_rds() |>
  select(draw_id, sim_data, sim_tnx_count, sim_tnx_last)
```

```{r retrieve_pnbd_init_fit_simstats}
#| echo: TRUE

pnbd_init_fit_simstats_tbl <- pnbd_init_fitsims_index_tbl |>
  mutate(
    sim_data = map(
      sim_file, retrieve_sim_stats,

      .progress = "pnbd_init_fit"
      )
    ) |>
  select(customer_id, sim_data) |>
  unnest(sim_data)

pnbd_init_fit_simstats_tbl |> glimpse()
```


We now use this data to check how well our model fits the data.


### Compare Counts of Multitransaction Customers

We start by checking the high level summary statistics, such as customers with
more than one transaction both in the observed data and the simulation data,
total transaction count observed 

```{r compare_pnbd_init_fit_multitransaction_customer_counts}
#| echo: TRUE

obs_customer_count <- customer_fit_stats_tbl |>
  filter(x > 0) |>
  nrow()

sim_data_tbl <- pnbd_init_fit_simstats_tbl |>
  filter(sim_tnx_count > 0) |>
  count(draw_id, name = "sim_customer_count")

ggplot(sim_data_tbl) +
  geom_histogram(aes(x = sim_customer_count), bins = 50) +
  geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
  labs(
    x = "Count of Multi-transaction Customers",
    y = "Frequency",
    title = "Comparison Plot of Simulated vs Observed Customer Counts",
    subtitle = "(observed value in red)"
    )

```

The observed count of customers at least one additional transaction after the
first is captured by this simulation.


### Total Transaction Count

We now check the count of all transactions.

```{r compare_pnbd_init_fit_total_transaction_count}
#| echo: TRUE

obs_total_count <- customer_fit_stats_tbl |>
  pull(x) |>
  sum()

sim_data_tbl <- pnbd_init_fit_simstats_tbl |>
  count(draw_id, wt = sim_tnx_count, name = "sim_total_count")


ggplot(sim_data_tbl) +
  geom_histogram(aes(x = sim_total_count), bins = 50) +
  geom_vline(aes(xintercept = obs_total_count), colour = "red") +
  labs(
    x = "Count of Total Transactions",
    y = "Frequency",
    title = "Comparison Plot of Simulated vs Observed Total Counts",
    subtitle = "(observed value in red)"
    )

```

As before, these all look good. Our model is doing a good job capturing the
data.

### Transaction Count Quantiles

We now look at the quantiles for the transaction counts across each customer.

```{r compare_pnbd_init_fit_transaction_quantiles}
#| echo: TRUE

obs_quantiles_tbl <- customer_fit_stats_tbl |>
  reframe(
    prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
    prob_value = quantile(x, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
    )
    
sim_data_tbl <- pnbd_init_fit_simstats_tbl |>
  group_by(draw_id) |>
  summarise(
    p10 = quantile(sim_tnx_count, 0.10),
    p25 = quantile(sim_tnx_count, 0.25),
    p50 = quantile(sim_tnx_count, 0.50),
    p75 = quantile(sim_tnx_count, 0.75),
    p90 = quantile(sim_tnx_count, 0.90),
    p99 = quantile(sim_tnx_count, 0.99)
    ) |> 
  pivot_longer(
    cols = !draw_id,
    names_to  = "prob_label",
    values_to = "sim_prob_values"
    ) |>
  inner_join(obs_quantiles_tbl, by = "prob_label")

ggplot(sim_data_tbl) +
  geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
  geom_vline(aes(xintercept = prob_value), colour = "red") +
  facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
  labs(
    x = "Quantile of Counts",
    y = "Frequency",
    title = "Comparison Plots of Transaction Count Quantiles"
    )

```


### Check Overall Day-of-Week Transaction Patterns

We now want to check our assumption of each customer having a single rate of
transaction frequency. This will manifest as a distribution of days of the week
(and possibly months of the year), when contrasted with our simulations.

We will do this both by individual year and overall dataset.

```{r construct_fit_transaction_month_distributions}
#| echo: true

dow_props_fit_lst <- customer_fit_transactions_tbl |>
  calculate_dow_proportions()

tnx_fit_overall_dow_tbl   <- dow_props_fit_lst$overall
tnx_fit_yearmonth_dow_tbl <- dow_props_fit_lst$yearmonth


tnx_fit_overall_dow_tbl   |> glimpse()
tnx_fit_yearmonth_dow_tbl |> glimpse()
```

We now need to do the same thing for our simulations and then construct a
comparison plot.

```{r construct_fit_overall_dow_comparisons}
#| echo: true

sim_fit_tnxdata_tbl <- pnbd_init_fit_simstats_tbl |>
  select(customer_id, draw_id, sim_data) |>
  unnest(sim_data)

propdata_fit_tbl <- sim_fit_tnxdata_tbl |>
  group_nest(draw_id) |>
  mutate(
    prop_data = map(
      data, calculate_dow_proportions,
      
      .progress = "calculate_dow_propotions"
      ),
    overall_data   = map(prop_data, "overall"),
    yearmonth_data = map(prop_data, "yearmonth")
    )

propdata_fit_tbl |> glimpse()
```

We now want to compare the simulation data against the observed proportions.

```{r create_fit_overall_dow_proportion_plot}
#| echo: true

simplot_fit_tbl <- propdata_fit_tbl |>
  select(draw_id, overall_data) |>
  unnest(overall_data) |>
  group_by(dow_label) |>
  summarise(
    .groups = "drop",

    p10 = quantile(obs_prop, 0.10),
    p25 = quantile(obs_prop, 0.25),
    p50 = quantile(obs_prop, 0.50),
    p75 = quantile(obs_prop, 0.75),
    p90 = quantile(obs_prop, 0.90)
    )

ggplot(simplot_fit_tbl) +
  geom_errorbar(
    aes(x = dow_label, ymin = p10, ymax = p90),
    width = 0, linewidth = 1
    ) +
  geom_errorbar(
    aes(x = dow_label, ymin = p25, ymax = p75),
    width = 0, linewidth = 3
    ) +
  geom_point(
    aes(x = dow_label, y = obs_prop),
    data = tnx_fit_overall_dow_tbl, colour = "red"
    ) +
  expand_limits(y = 0) +
  labs(
    x = "Day of Week",
    y = "Proportion",
    title = "Comparison Plot of the Day of Week Proportions"
    )

```


### Check Monthly Day-of-Week Transaction Patterns

We now want to compare the plots for each of the days of week.

```{r create_fit_monthly_dow_proportion_plot}
#| echo: true

simplot_fit_tbl <- propdata_fit_tbl |>
  select(draw_id, yearmonth_data) |>
  unnest(yearmonth_data) |>
  group_by(dow_label, yearmonth_date) |>
  summarise(
    .groups = "drop",

    p10 = quantile(obs_prop, 0.10),
    p25 = quantile(obs_prop, 0.25),
    p50 = quantile(obs_prop, 0.50),
    p75 = quantile(obs_prop, 0.75),
    p90 = quantile(obs_prop, 0.90)
    )

ggplot(simplot_fit_tbl) +
  geom_ribbon(
    aes(x = yearmonth_date, ymin = p10, ymax = p90),
    alpha = 0.5
    ) +
  geom_ribbon(
    aes(x = yearmonth_date, ymin = p25, ymax = p75),
    alpha = 1.0
    ) +
  geom_point(
    aes(x = yearmonth_date, y = obs_prop),
    data = tnx_fit_yearmonth_dow_tbl, colour = "red"
    ) +
  facet_wrap(vars(dow_label)) +
  expand_limits(y = 0) +
  labs(
    x = "Day of Week",
    y = "Proportion",
    title = "Comparison Plot of the Day of Week Proportions by Month"
    ) +
  theme(axis.text.x = element_text(angle = 20, size = 8, vjust = 0.5))

```


### Write to Disk

We write this data to disk

```{r write_pnbd_init_fit_simstats_tbl}
#| echo: true

pnbd_init_fit_simstats_tbl |>
  write_rds("data/pnbd_init_assess_fit_simstats_tbl.rds", compress = "gz")
```


## Assess Out-of-Sample Data

We now repeat this exercise, but for the validation period of 2022.


```{r setup_simulation_pnbd_init_validdata_transactions}
#| echo: TRUE

precompute_dir <- glue("precompute/{fit_label}")

ensure_exists_precompute_directory(precompute_dir)

use_start <- use_valid_start_date |> as.POSIXct()
use_final <- use_valid_end_date   |> as.POSIXct()


pnbd_init_validsims_index_tbl <- pnbd_init_simstats_tbl |>
  mutate(
    start_dttm = use_start,
    end_dttm   = use_final,
    lambda     = post_lambda,
    mu         = post_mu,
    tnx_mu     = 1,      ### We are not simulating tnx size
    tnx_cv     = 1       ### 
    ) |>
  group_nest(customer_id, .key = "cust_params", keep = TRUE) |>
  mutate(
    sim_file = glue(
      "{precompute_dir}/sims_valid_{fit_label}_{customer_id}.rds"
      )
    )

pnbd_init_validsims_index_tbl |> glimpse()
```

We now can run these simulations to check how well our model captures
transactions out of the fitted data.


```{r generate_pnbd_init_validdata_transactions}
#| echo: TRUE

precomputed_tbl <- dir_ls(glue("{precompute_dir}")) |>
  as.character() |>
  enframe(name = NULL, value = "sim_file")


runsims_tbl <- pnbd_init_validsims_index_tbl |>
  anti_join(precomputed_tbl, by = "sim_file")


if(nrow(runsims_tbl) > 0) {
  pnbd_init_validsims_index_tbl <- runsims_tbl |>
    mutate(
      chunk_data = future_map2_int(
        cust_params, sim_file,
        run_simulations_chunk,

        sim_func = generate_pnbd_validation_transactions,

        .options = furrr_options(
          globals  = c(
            "calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
            "generate_pnbd_validation_transactions"
            ),
          packages   = c("tidyverse", "fs"),
          scheduling = Inf,
          seed       = 421
          ),

        .progress = TRUE
        )
      )
}

pnbd_init_validsims_index_tbl |> glimpse()
```

Now that we have generated our simulations we want to load the data from the
files and construct a dataset for use as part of the validation.


```{r retrieve_pnbd_init_valid_simstats}
#| echo: TRUE

pnbd_init_valid_simstats_tbl <- pnbd_init_validsims_index_tbl |>
  mutate(
    sim_data = map(
      sim_file, retrieve_sim_stats,

      .progress = "pnbd_init_valid"
      )
    ) |>
  select(customer_id, sim_data) |>
  unnest(sim_data)

pnbd_init_valid_simstats_tbl |> glimpse()
```


### Compare Counts of Multitransaction Customers

We start by checking the high level summary statistics, such as customers with
more than one transaction both in the observed data and the simulation data,
total transaction count observed 

```{r compare_pnbd_init_valid_multitransaction_customer_counts}
#| echo: TRUE

obs_customer_count <- customer_valid_stats_tbl |>
  filter(tnx_count > 0) |>
  nrow()

sim_data_tbl <- pnbd_init_valid_simstats_tbl |>
  filter(sim_tnx_count > 0) |>
  count(draw_id, name = "sim_customer_count")

ggplot(sim_data_tbl) +
  geom_histogram(aes(x = sim_customer_count), bins = 50) +
  geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
  labs(
    x = "Count of Multi-transaction Customers",
    y = "Frequency",
    title = "Comparison Plot of Simulated vs Observed Customer Counts",
    subtitle = "(observed value in red)"
    )

```


### Total Transaction Count

We now check the count of all transactions.

```{r compare_pnbd_init_valid_total_transaction_count}
#| echo: TRUE

obs_total_count <- customer_valid_stats_tbl |>
  pull(tnx_count) |>
  sum()

sim_data_tbl <- pnbd_init_valid_simstats_tbl |>
  count(draw_id, wt = sim_tnx_count, name = "sim_total_count")


ggplot(sim_data_tbl) +
  geom_histogram(aes(x = sim_total_count), bins = 50) +
  geom_vline(aes(xintercept = obs_total_count), colour = "red") +
  labs(
    x = "Count of Total Transactions",
    y = "Frequency",
    title = "Comparison Plot of Simulated vs Observed Total Counts",
    subtitle = "(observed value in red)"
    )

```


### Transaction Count Quantiles

We now look at the quantiles for the transaction counts across each customer.

```{r compare_pnbd_init_valid_transaction_quantiles}
#| echo: TRUE

obs_quantiles_tbl <- customer_valid_stats_tbl |>
  reframe(
    prob_label = c("p10", "p25", "p50", "p75", "p90", "p99"),
    prob_value = quantile(tnx_count, probs = c(0.10, 0.25, 0.50, 0.75, 0.90, 0.99))
    )
    
sim_data_tbl <- pnbd_init_valid_simstats_tbl |>
  filter(sim_tnx_count > 0) |>
  group_by(draw_id) |>
  summarise(
    p10 = quantile(sim_tnx_count, 0.10),
    p25 = quantile(sim_tnx_count, 0.25),
    p50 = quantile(sim_tnx_count, 0.50),
    p75 = quantile(sim_tnx_count, 0.75),
    p90 = quantile(sim_tnx_count, 0.90),
    p99 = quantile(sim_tnx_count, 0.99)
    ) |> 
  pivot_longer(
    cols = !draw_id,
    names_to  = "prob_label",
    values_to = "sim_prob_values"
    ) |>
  inner_join(obs_quantiles_tbl, by = "prob_label")

ggplot(sim_data_tbl) +
  geom_histogram(aes(x = sim_prob_values), binwidth = 1) +
  geom_vline(aes(xintercept = prob_value), colour = "red") +
  facet_wrap(vars(prob_label), nrow = 2, scales = "free") +
  labs(
    x = "Quantile of Counts",
    y = "Frequency",
    title = "Comparison Plots of Transaction Count Quantiles"
    )

```


### Check Overall Day-of-Week Transaction Patterns

We now want to check our assumption of each customer having a single rate of
transaction frequency. This will manifest as a distribution of days of the week
(and possibly months of the year), when contrasted with our simulations.

We will do this both by individual year and overall dataset.

```{r construct_valid_transaction_month_distributions}
#| echo: true

dow_props_valid_lst <- customer_valid_transactions_tbl |>
  calculate_dow_proportions()

tnx_valid_overall_dow_tbl   <- dow_props_valid_lst$overall
tnx_valid_yearmonth_dow_tbl <- dow_props_valid_lst$yearmonth


tnx_valid_overall_dow_tbl   |> glimpse()
tnx_valid_yearmonth_dow_tbl |> glimpse()
```

We now need to do the same thing for our simulations and then construct a
comparison plot.

```{r construct_valid_overall_dow_comparisons}
#| echo: true

sim_valid_tnxdata_tbl <- pnbd_init_valid_simstats_tbl |>
  select(customer_id, draw_id, sim_data) |>
  unnest(sim_data)

propdata_valid_tbl <- sim_valid_tnxdata_tbl |>
  group_nest(draw_id) |>
  mutate(
    prop_data = map(
      data, calculate_dow_proportions,
      
      .progress = "calculate_dow_propotions"
      ),
    overall_data   = map(prop_data, "overall"),
    yearmonth_data = map(prop_data, "yearmonth")
    )

propdata_valid_tbl |> glimpse()
```

We now want to compare the simulation data against the observed proportions.

```{r create_valid_overall_dow_proportion_plot}
#| echo: true

simplot_valid_tbl <- propdata_valid_tbl |>
  select(draw_id, overall_data) |>
  unnest(overall_data) |>
  group_by(dow_label) |>
  summarise(
    .groups = "drop",

    p10 = quantile(obs_prop, 0.10),
    p25 = quantile(obs_prop, 0.25),
    p50 = quantile(obs_prop, 0.50),
    p75 = quantile(obs_prop, 0.75),
    p90 = quantile(obs_prop, 0.90)
    )

ggplot(simplot_valid_tbl) +
  geom_errorbar(
    aes(x = dow_label, ymin = p10, ymax = p90),
    width = 0, linewidth = 1
    ) +
  geom_errorbar(
    aes(x = dow_label, ymin = p25, ymax = p75),
    width = 0, linewidth = 3
    ) +
  geom_point(
    aes(x = dow_label, y = obs_prop),
    data = tnx_valid_overall_dow_tbl, colour = "red"
    ) +
  expand_limits(y = 0) +
  labs(
    x = "Day of Week",
    y = "Proportion",
    title = "Comparison Plot of the Day of Week Proportions"
    )

```


### Check Monthly Day-of-Week Transaction Patterns

We now want to compare the plots for each of the days of week.

```{r create_valid_monthly_dow_proportion_plot}
#| echo: true

simplot_valid_tbl <- propdata_valid_tbl |>
  select(draw_id, yearmonth_data) |>
  unnest(yearmonth_data) |>
  group_by(dow_label, yearmonth_date) |>
  summarise(
    .groups = "drop",

    p10 = quantile(obs_prop, 0.10),
    p25 = quantile(obs_prop, 0.25),
    p50 = quantile(obs_prop, 0.50),
    p75 = quantile(obs_prop, 0.75),
    p90 = quantile(obs_prop, 0.90)
    )

ggplot(simplot_valid_tbl) +
  geom_ribbon(
    aes(x = yearmonth_date, ymin = p10, ymax = p90),
    alpha = 0.5
    ) +
  geom_ribbon(
    aes(x = yearmonth_date, ymin = p25, ymax = p75),
    alpha = 1.0
    ) +
  geom_point(
    aes(x = yearmonth_date, y = obs_prop),
    data = tnx_valid_yearmonth_dow_tbl, colour = "red"
    ) +
  facet_wrap(vars(dow_label)) +
  expand_limits(y = 0) +
  labs(
    x = "Day of Week",
    y = "Proportion",
    title = "Comparison Plot of the Day of Week Proportions by Month"
    ) +
  theme(axis.text.x = element_text(angle = 20, size = 8, vjust = 0.5))

```


### Write Data to Disk

We write this data to disk

```{r write_pnbd_init_valid_simstats_tbl}
#| echo: true

pnbd_init_valid_simstats_tbl |>
  write_rds("data/pnbd_init_assess_valid_simstats_tbl.rds", compress = "gz")
```


# R Environment {.unnumbered}

```{r show_session_info}
#| echo: TRUE
#| message: TRUE

options(width = 120L)
sessioninfo::session_info()
options(width = 80L)
```