-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document explaining model structures and development plans #114
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only half-way through, but wanted to get some thoughts started!
There are two broad types of models under consideration: autoregressive (AR) and full-curve (FC) models. | ||
|
||
- AR: Incident uptake at time $t$, $u_t$, is a function previous incident uptake value(s), among other predictors. | ||
- FC: Cumulative uptake at time $t$, $c_t$, is a function of time, among other predictors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more typical phrasing is "parametric curve fitting," but that might be jargon I made up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I definitely made up the "full curve" jargon haha. As long as it is clear what we are talking about, and how it differs from AR, the specific jargon doesn't matter to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ed for coming up with model ideas! My comments are research questions, which should not gate the model development.
## Forecast Uncertainty | ||
|
||
Forecasting with AR models naturally produces a cone of uncertainty that expands into the future. Each draw from the posterior distribution is a unique combination of parameter values that defines a trajectory of uptake going forward (still with some stochastic influence on observations, from $\sigma$). All these trajectories sprout from the last observed data point and diverge as they move into the future. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not clear about how to compound the forecast uncertainty in the AR model. When model is fitted, we get the posterior matrices for parameters [
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my note above about distinguishing the observed uptake data from the latent uptakes. The cone of uncertainty expands because of the combinations of uncertainties on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even before we distinguish true latent uptake from observed uptake, it is worth clarifying how uncertainty compounds into the future when forecasting with our current AR model. From each
&A,~H,~n,~\sigma_A,~\sigma_H,~\sigma_n,~\sigma \sim \text{prior distributions} \\ | ||
\end{align*} | ||
$$ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we ensure that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! In principle, you are correct. In practice, I imagine that thin enough prior distributions for
\begin{align*} | ||
&u_t \sim N(\mu, \sigma) \\ | ||
&\mu = \alpha_s + \beta_{u,s}u_{t-1} + \beta_{t,s}t + \beta_{tu,s}tu_{t-1} \\ | ||
&\alpha_s \sim N(\alpha,~\sigma_{\alpha}) \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is conceptually equivalent to deciding when
Like, either you say "t=0 is always Sep 1, and there might have been some amount of uptake by that point, and we fit for that amount" or you say "there is some time, around Sep 1, before which there was practically no uptake, and we fit for that date"
Personally, I like having the t=0 date be the thing we fit for, because then you can have a wider prior (e.g., for normal flu seasons) or a very tight prior (e.g., when we know the exact rollout date for a Covid vaccine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason to prefer t=0 being the variable is that you decouple the temporal thing (when does the season start) from overall uptake. A season with overall uptake probably has higher uptake at t=0 (i.e.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as #30 , I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! That is #30.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer inspection, your comments are about #30 here, but this section is not. This section is about allowing the model parameters to differ by season, to account for different seasons having slightly different curve shapes. But I think some discussion on how rollout is handled is important to add.
## Forecast Uncertainty | ||
|
||
Forecasting with AR models naturally produces a cone of uncertainty that expands into the future. Each draw from the posterior distribution is a unique combination of parameter values that defines a trajectory of uptake going forward (still with some stochastic influence on observations, from $\sigma$). All these trajectories sprout from the last observed data point and diverge as they move into the future. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my note above about distinguishing the observed uptake data from the latent uptakes. The cone of uncertainty expands because of the combinations of uncertainties on
|
||
And again, factors other than season, such as geographic area or demographic group, could be used to group the data. | ||
|
||
## Forecasting Uncertainty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parametric curve fitting should still work... you get the greatest uncertainty on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Think of the Hill model parameters: A = maximum uptake, H = half-maximal time, and n = steepness. Uncertainty at the end of the season depends mostly on uncertainty in A. Uncertainty at the middle of the season depends on uncertainty in A, H, and n.
Moreover, if we are fitting on 15 past seasons of data and the first half of this season, that means we have 16 data points per timepoint in early and midseason, and 15 data points per timepoint at the end of season. That's nearly the same amount of data to fit with, no matter where you are along the curve.
This is an old figure, but... The red curve here is a Hill model, fit to the flu data before ~Oct 2024, and then projecting ~Nov 2024 onward. (There are no hyperparameters for season here.) Note that the 95% credible interval starts somewhat wide, even right where the observations leave off, and it does not expand much into the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point: for a linear regression, the cone of uncertainty continues to grow as you move away from where you have data, but for vaccine uptake, which we know will be a sigmoid that plateaus at some value, we really only have some finite uncertainty about that final uptake
So maybe this actually is desirable behavior? The posterior uncertainty in A is really the thing that matters. And we actually do have, as you point out, for flu, a lot of information about what that's going to be.
Maybe another way I would read this is: the LIUM has more uncertainty about H (half-way point). Like, the LIUM cone (and even point estimates) suggest that uptake is still rising at day 365, while CHM is confident it will have turned over by day 200. So the "cone of uncertainty" isn't so much about A but about H?
I'm not sure if the shaded areas are uncertainty on the latent (i.e., predicted "true" uptake) or the predictions (i.e., including measurement error). Maybe what this is saying is that measurement error is large compared to uncertainty in the latent curve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see this as desirable behavior. Even if the Hill model matched the data perfectly at the junction between data and forecast (i.e. at the forecast date), the wide uncertainty there says that cumulative uptake could plausibly go down ~5% in the week after the forecast date. That's clearly not right.
Remember, there is no distinction between latent true uptake and noisy observed uptake in these models, yet. And I believe that might be the root of the problem.
- Because the autoregressive model only estimates the relationship between pairs of successive weeks, when forecasting, it trusts the last data point before the forecast date absolutely. Uncertainty compounds into the future from there.
- Because the Hill model estimates the shape of a full curve, when forecasting, it does not trust the last data point before the forecast date any more than it trusts any other historical data point. Uncertainty captures variations across the historical curves, even on the forecast date itself.
I am soon to submit another PR which I hope will take a step toward illustrating this a bit better.
I think this doc has already done its job, which is to stimulate thoughtful discussion about the next steps in model development. In particular, I think the autoregressive model that @swo outlined, separating latent true uptake from noisy observed uptake, should be the next step. So I plan to merge this doc and to edit it periodically in the future, whenever we need a catalyst for more model development brainstorming. |
There are many directions we could take the refining of existing models and/or introduction of new model structures. I have made a document to summarize where we are now, some of the issues observed, and ideas for model development.
There are many directions we could take the refining of existing models and/or introduction of new model structures. I have made a document to summarize where we are now, some of the issues observed, and ideas for model development.