Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document explaining model structures and development plans #114

Merged
merged 7 commits into from
Feb 5, 2025

Conversation

eschrom
Copy link
Collaborator

@eschrom eschrom commented Feb 1, 2025

There are many directions we could take the refining of existing models and/or introduction of new model structures. I have made a document to summarize where we are now, some of the issues observed, and ideas for model development.

@eschrom eschrom requested review from swo and Fuhan-Yang February 1, 2025 01:27
Copy link
Collaborator

@swo swo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only half-way through, but wanted to get some thoughts started!

docs/model_development.md Outdated Show resolved Hide resolved
There are two broad types of models under consideration: autoregressive (AR) and full-curve (FC) models.

- AR: Incident uptake at time $t$, $u_t$, is a function previous incident uptake value(s), among other predictors.
- FC: Cumulative uptake at time $t$, $c_t$, is a function of time, among other predictors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more typical phrasing is "parametric curve fitting," but that might be jargon I made up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I definitely made up the "full curve" jargon haha. As long as it is clear what we are talking about, and how it differs from AR, the specific jargon doesn't matter to me.

docs/model_development.md Show resolved Hide resolved
docs/model_development.md Outdated Show resolved Hide resolved
docs/model_development.md Outdated Show resolved Hide resolved
@swo swo self-requested a review February 4, 2025 14:55
Copy link
Contributor

@Fuhan-Yang Fuhan-Yang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ed for coming up with model ideas! My comments are research questions, which should not gate the model development.

docs/model_development.md Show resolved Hide resolved
## Forecast Uncertainty

Forecasting with AR models naturally produces a cone of uncertainty that expands into the future. Each draw from the posterior distribution is a unique combination of parameter values that defines a trajectory of uptake going forward (still with some stochastic influence on observations, from $\sigma$). All these trajectories sprout from the last observed data point and diverge as they move into the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not clear about how to compound the forecast uncertainty in the AR model. When model is fitted, we get the posterior matrices for parameters [ $\alpha$, $\beta_x$, $v$]. We draw X rows (random indices) from the posterior matrices to calculate $f_{t+1}$, then we will get X $f_{t+1}$. We continue to forecast $f_{t+2}$ using the parameter estimates (X samples? or mean of the posterior) and X $f_{t+1}$. Then each $f_{t+1}$ will be used as predictor and another X samples from [ $\alpha$, $\beta_x$, $v$] . In this way, the number of samples will exponentially increase as we sequentially forecast. Not sure if the uncertainty will explode

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my note above about distinguishing the observed uptake data from the latent uptakes. The cone of uncertainty expands because of the combinations of uncertainties on $\hat{u}_0$, $\beta_i$, and $\epsilon_t$

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even before we distinguish true latent uptake from observed uptake, it is worth clarifying how uncertainty compounds into the future when forecasting with our current AR model. From each $[\alpha, \beta_x, \sigma]$ sample from the posterior distribution $[\alpha, \beta_x]$ defines a deterministic trajectory into the future. But at each timepoint along the way, there is still uncertainty thanks to $\sigma$. So at each timepoint, one value is sampled, to serve as the basis of projecting the next timepoint, and so on.

This diagram may help:
image

image

&A,~H,~n,~\sigma_A,~\sigma_H,~\sigma_n,~\sigma \sim \text{prior distributions} \\
\end{align*}
$$

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure that $c_t$ is bounded by [0,1] without bounding $A$, $H$, $n$, $\sigma_A$, $\sigma_H$, $\sigma_n$ and $\sigma$? The support of these hyperparameters is $R$ without any constraints

Copy link
Collaborator Author

@eschrom eschrom Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! In principle, you are correct. In practice, I imagine that thin enough prior distributions for $A$ and $\sigma_A$ would prevent predicted values of $c_t$ from escaping their bounds.

\begin{align*}
&u_t \sim N(\mu, \sigma) \\
&\mu = \alpha_s + \beta_{u,s}u_{t-1} + \beta_{t,s}t + \beta_{tu,s}tu_{t-1} \\
&\alpha_s \sim N(\alpha,~\sigma_{\alpha}) \\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is conceptually equivalent to deciding when $t=0$ is?

Like, either you say "t=0 is always Sep 1, and there might have been some amount of uptake by that point, and we fit for that amount" or you say "there is some time, around Sep 1, before which there was practically no uptake, and we fit for that date"

Personally, I like having the t=0 date be the thing we fit for, because then you can have a wider prior (e.g., for normal flu seasons) or a very tight prior (e.g., when we know the exact rollout date for a Covid vaccine)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason to prefer t=0 being the variable is that you decouple the temporal thing (when does the season start) from overall uptake. A season with overall uptake probably has higher uptake at t=0 (i.e., $\alpha$), so you would get correlations there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as #30 , I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! That is #30.

Copy link
Collaborator Author

@eschrom eschrom Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer inspection, your comments are about #30 here, but this section is not. This section is about allowing the model parameters to differ by season, to account for different seasons having slightly different curve shapes. But I think some discussion on how rollout is handled is important to add.

## Forecast Uncertainty

Forecasting with AR models naturally produces a cone of uncertainty that expands into the future. Each draw from the posterior distribution is a unique combination of parameter values that defines a trajectory of uptake going forward (still with some stochastic influence on observations, from $\sigma$). All these trajectories sprout from the last observed data point and diverge as they move into the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my note above about distinguishing the observed uptake data from the latent uptakes. The cone of uncertainty expands because of the combinations of uncertainties on $\hat{u}_0$, $\beta_i$, and $\epsilon_t$


And again, factors other than season, such as geographic area or demographic group, could be used to group the data.

## Forecasting Uncertainty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parametric curve fitting should still work... you get the greatest uncertainty on $\hat{u}$ in places where you don't have data, ie in the future. It's like any regression, that error bars are largest as you project further away from the data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Think of the Hill model parameters: A = maximum uptake, H = half-maximal time, and n = steepness. Uncertainty at the end of the season depends mostly on uncertainty in A. Uncertainty at the middle of the season depends on uncertainty in A, H, and n.

Moreover, if we are fitting on 15 past seasons of data and the first half of this season, that means we have 16 data points per timepoint in early and midseason, and 15 data points per timepoint at the end of season. That's nearly the same amount of data to fit with, no matter where you are along the curve.

This is an old figure, but... The red curve here is a Hill model, fit to the flu data before ~Oct 2024, and then projecting ~Nov 2024 onward. (There are no hyperparameters for season here.) Note that the 95% credible interval starts somewhat wide, even right where the observations leave off, and it does not expand much into the future.
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: for a linear regression, the cone of uncertainty continues to grow as you move away from where you have data, but for vaccine uptake, which we know will be a sigmoid that plateaus at some value, we really only have some finite uncertainty about that final uptake

So maybe this actually is desirable behavior? The posterior uncertainty in A is really the thing that matters. And we actually do have, as you point out, for flu, a lot of information about what that's going to be.

Maybe another way I would read this is: the LIUM has more uncertainty about H (half-way point). Like, the LIUM cone (and even point estimates) suggest that uptake is still rising at day 365, while CHM is confident it will have turned over by day 200. So the "cone of uncertainty" isn't so much about A but about H?

I'm not sure if the shaded areas are uncertainty on the latent (i.e., predicted "true" uptake) or the predictions (i.e., including measurement error). Maybe what this is saying is that measurement error is large compared to uncertainty in the latent curve?

Copy link
Collaborator Author

@eschrom eschrom Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see this as desirable behavior. Even if the Hill model matched the data perfectly at the junction between data and forecast (i.e. at the forecast date), the wide uncertainty there says that cumulative uptake could plausibly go down ~5% in the week after the forecast date. That's clearly not right.

Remember, there is no distinction between latent true uptake and noisy observed uptake in these models, yet. And I believe that might be the root of the problem.

  • Because the autoregressive model only estimates the relationship between pairs of successive weeks, when forecasting, it trusts the last data point before the forecast date absolutely. Uncertainty compounds into the future from there.
  • Because the Hill model estimates the shape of a full curve, when forecasting, it does not trust the last data point before the forecast date any more than it trusts any other historical data point. Uncertainty captures variations across the historical curves, even on the forecast date itself.

I am soon to submit another PR which I hope will take a step toward illustrating this a bit better.

@eschrom
Copy link
Collaborator Author

eschrom commented Feb 5, 2025

I think this doc has already done its job, which is to stimulate thoughtful discussion about the next steps in model development. In particular, I think the autoregressive model that @swo outlined, separating latent true uptake from noisy observed uptake, should be the next step.

So I plan to merge this doc and to edit it periodically in the future, whenever we need a catalyst for more model development brainstorming.

@eschrom eschrom merged commit a45956e into main Feb 5, 2025
2 checks passed
@eschrom eschrom deleted the ecs_modeldoc branch February 5, 2025 22:40
Fuhan-Yang pushed a commit that referenced this pull request Feb 7, 2025
There are many directions we could take the refining of existing models
and/or introduction of new model structures. I have made a document to
summarize where we are now, some of the issues observed, and ideas for
model development.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants