-
Notifications
You must be signed in to change notification settings - Fork 1
/
overview.Rmd
589 lines (469 loc) · 26.3 KB
/
overview.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
---
title: "Football prediction model in Stan"
author: "Guillem Hurault"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_notebook:
number_sections: yes
toc: yes
---
# Introduction
This repository contains the code of a personal project where I am implementing a simple "Dixon-Coles" model to predict the outcome of football games with the probabilistic programming language Stan.
As a disclaimer, I am not a particular fan of football and the presented model is far too simple to accurately model/predict the outcome of games and fortiori to be used for betting (and if the goal was betting, designing models for individual sports where the outcomes are less uncertain, such as darts or horse racing would probably be safer).
Having said that, this project was fun and a good way for me to work with "clean" data and learn about Bayesian workflow.
Notably, I see my main contribution in the quantities that the model can predict (cf. suffix `_test` in the Stan code) or that can be used for posterior predictive checking (cf. suffix `_rep`):
- the number of games won.
- the number of games lost.
- the number of games ending in a draw.
- the total number of goals scored.
- the goal difference.
- the number of points.
- the ranking.
In this notebook, I present an overview of what I have done in this project and which is directed to an audience with some familiarity in Bayesian modelling.
Before going into the details of the analysis, let's first initialise the notebook.
```{r message=FALSE}
set.seed(1559354162) # Reproducibility
library(HuraultMisc) # Personal function library
library(ggplot2)
library(cowplot)
library(ggtext)
library(rstan)
rstan_options(auto_write = TRUE) # Save compiled model
options(mc.cores = parallel::detectCores()) # Parallel computing
source("functions.R") # Utility functions
```
# Data
In this project, I am using publicly available [football data](http://football-data.co.uk/) of the 2018-2019 English Premier League season.
We will only focus on the total number of goals scored by the home team ("Full Time Home Goal" or FTHG in the data), the total number of goals scored by the away team ("Full Time Away Goal", or FTAG in the data) and the results ("Full Time Results" or FTR in the data), which can can be "Home win" ("H" in the data), "Away win" ("A" in the data) and "Draw" ("D" in the data).
Each of the 20 teams of the Premier League plays the other teams twice, once at home and once away, for a total number of 380 games.
```{r}
df0 <- read.csv("Data/PremierLeague1819.csv")
# Processing
df <- df0[, c("Div", "Date", "HomeTeam", "AwayTeam", "HTHG", "HTAG", "FTHG", "FTAG", "FTR")]
df$FTR <- factor(df$FTR, levels = c("A", "D", "H"), ordered = TRUE)
# Teams
teams <- with(df, sort(unique(c(as.character(HomeTeam), as.character(AwayTeam)))))
# Associate a unique ID to each game
id <- game_id(teams)
df <- merge(df, id, by = c("HomeTeam", "AwayTeam"))
# Order by date
df$Date <- as.Date(df$Date, "%d/%m/%Y")
df <- df[order(df$Date), ]
heatmap_results(df) +
labs(title = "Full time results of the 2018/2019 English Premier League")
```
In the English Premier League, a win is worth 3 points, a draw 1 point and no points is awarded for the losing a game.
The team with the highest number of points at the end of the season wins the championship.
The goal difference (number of goals scored minus number of goals conceded) is used to break ties when teams finish with an equal number of points.
This season, Manchester City won the Premier League with 98 points, followed very closely by Liverpool with 97 points.
```{r}
(fstats <- football_stats(df)) # Football statistics
```
# Model
In our model, we assumed that the number of goals scored by each team follow independent Poisson distributions.
For each game, if we index the home team by $h$ and the away team by $a$, then the rates $\lambda_h$ and $\lambda_a$ of the Poisson distribution are given by:
$$
\begin{aligned}
\log(\lambda_h) & = b + \mathit{attack_h} - \mathit{defence_a} + \mathit{advtg} \\
\log(\lambda_a) & = b + \mathit{attack_a} - \mathit{defence_h}
\end{aligned}
$$
Where:
- $b$ is the intercept, i.e. the logarithm of the average goals rate assuming the attack and defence abilities of the teams cancels out.
- $\mathit{attack_k}$ and $\mathit{defence_k}$ are the latent attack and defence abilities of the $k$-th team.
- $\mathit{advtg}$ is the home advantage.
Priors for the parameters were chosen to be weakly informative and to result in reasonable prior predictive distributions, as we will see in the next section:
- $b \sim \mathcal{N}(0, 0.5^2)$.
This prior can be understood by considering a situation where the two teams have the same underlying attack and defence abilities and there is no home advantage, resulting in average goal rate of $\exp(b)$.
As a rule of thumb, if we consider that $\mathcal{N}(0, 0.5^2)$ ranges from -1 to 1 (approx. 95\% CI), then the average goal rate follows a log-normal distribution ranging from $\exp(-1) \approx 0.37$ to $\exp(1) \approx 2.72$.
- $\mathit{attack}_k$ and $\mathit{defence}_k$ follow the hierarchical prior:
- $\mathit{attack}_k \sim \mathcal{N}(0, \sigma^2)$
- $\mathit{defence}_k \sim \mathcal{N}(0, \sigma^2)$
- $\sigma \sim \mathcal{N}^{+} \Big( 0, \big( \log(5) / 2.3 / \sqrt{2}\big)^2 \Big)$.
This prior can be understood by considering that, if $\mathit{attack}_k$ and $\mathit{defence}_k$ are independent, then $\mathit{attack_h} - \mathit{defence_a} \sim \mathcal{N}\big( 0, (\sqrt{2} \sigma)^2 \big)$.
If we are at the upper tail of the distribution, for instance at the 99\% quantile ($z = 2.3$), this means that the home team would score $\exp(2.3 * \sqrt{2} * \sigma)$ more goals than the global average.
Here we consider that the home team could score at most $5 = \exp(2.3 * \sqrt{2} * \sigma)$ times more goals than the average, hence the value for $\sigma$.
- $\mathit{advtg} \sim \mathcal{N}(0.5, 0.25^2)$.
Here, we assume that the home advantage is positive, meaning the advantage is indeed an advantage in the sense that a team is more likely to score goals, everything else being equal, at home than away, but that the advantage is unlikely to be very big.
As a rule of thumb, $\mathcal{N}(0.5, 0.25^2)$ would range from 0 to 1, meaning that at best, a team would score $\exp{1} \approx 3$ times more goals at home.
The model is implemented in [`Model/DC_model.stan`](Model/DC_model.stan).
# Prior predictive check
In this section, I perform prior predictive check to confirm that the choices of our priors result in simulated data that appears reasonable.
Let's first prepare the ground to run MCMC.
```{r}
compiled_model <- stan_model("Model/DC_model.stan")
# MCMC options
n_chains <- 4
n_it <- 2000
# Parameters of interest
param_pop <- c("b", "home_advantage", "sigma_ability")
param_rep <- c("win_rep", "draw_rep", "lose_rep",
"goal_tot_rep", "goal_diff_rep", "point_rep")
param_ind <- c("attack", "defence", param_rep)
param_obs <- c("home_goals_rep", "away_goals_rep")
param <- c(param_pop, param_ind, param_obs)
```
Then, we can simulate data from the prior predictive distribution by running Stan without evaluating the likelihood.
```{r message=FALSE, warning=FALSE}
# Characteristics of the data to generate
n_teams <- 20
teams_simu <- LETTERS[1:n_teams]
id_simu <- game_id(teams_simu)
data_prior <- list(
N_teams = n_teams,
N_games = n_teams * (n_teams - 1),
home_goals = rep(1, n_teams * (n_teams - 1)), # doesn't matter
away_goals = rep(1, n_teams * (n_teams - 1)), # doesn't matter
home_id = sapply(id_simu[["HomeTeam"]], function(x) {which(x == teams_simu)}),
away_id = sapply(id_simu[["AwayTeam"]], function(x) {which(x == teams_simu)}),
run = 0
)
fit_prior <- sampling(compiled_model,
data = data_prior,
pars = param,
iter = n_it,
chains = n_chains)
par_prior <- extract_parameters(fit_prior, param, param_ind, param_obs, teams_simu, id_simu$Game, data_stan) # Store parameters for later use
```
We can check the distribution of each individual parameter:
```{r}
plot(fit_prior, pars = c(param_pop, paste0(param_ind[1:2], "[1]")), plotfun = "hist")
```
We can also inspect, for example, the number of goals scored by the home team for a random game (all teams or games are interchangeable as this point).
```{r}
goals <- extract(fit_prior, pars = c("home_goals_rep[1]"))[[1]]
summary(goals)
hist(goals, breaks = 40)
hist(goals[goals < 20], breaks = 20)
quantile(goals, probs = c(.25, .5 , .75, .9, .99, .999))
```
Although the prior distribution of goals has most of its mass for small values (e.g. $< 5$), it has a long tail meaning that, for instance, the probability of the home team scoring more than 20 goals in the Premier League during one game is `r signif(mean(goals >= 20), 3)`.
While this probability is small, the probability that happens at least once during 380 games is `r signif(1 - dbinom(0, 380, mean(goals >= 20)), 3)`, which might be considered unrealistic.
This would suggest making changes to the model but we will continue with it for illustration purposes.
We can also look at distribution of the number of games won, lost or ending with a draw for a random team, but we do not detect anything unrealistic:
```{r}
pl <- lapply(c("win", "lose", "draw"),
function(x) {
otc <- extract(fit_prior, pars = paste0(x, "_rep[1]"))[[1]]
otc <- factor(otc, levels = 0:(2 * (n_teams - 1)))
otc <- table(otc) / length(otc)
ggplot(data = data.frame(otc), aes(x = otc, y = Freq)) +
geom_bar(stat = "identity") +
scale_x_discrete(breaks = seq(1, 2 * (n_teams - 1), 2)) +
labs(x = paste0("Number of ", x), y = "Prior probability") +
theme_bw(base_size = 15)
})
plot_grid(plotlist = pl, ncol = 1)
```
# Fake data check
In this section, we evaluate whether the algorithm "works", i.e. whether we can retrieve the parameters of the model from the data, when we know the parameters of the true data-generating mechanism.
To do this, we just sample the prior predictive distribution and fit the model with the simulated data.
```{r}
draw <- 2019 # Draw
# True parameters
true_param_pop <- lapply(extract(fit_prior, pars = param_pop), function(x) {x[draw]})
true_param_ind <- lapply(extract(fit_prior, pars = param_ind), function(x) {x[draw, ]})
true_param <- rbind(
do.call(rbind,
lapply(1:length(true_param_ind),
function(i) {
data.frame(Variable = names(true_param_ind)[i],
True = true_param_ind[[i]],
Team = teams_simu)
})),
do.call(rbind,
lapply(1:length(true_param_pop),
function(i) {
data.frame(Variable = names(true_param_pop)[i],
True = true_param_pop[[i]],
Team = NA)
}))
)
# Fake data
fd <- cbind(id_simu,
data.frame(FTHG = extract(fit_prior, pars = "home_goals_rep")[[1]][draw, ],
FTAG = extract(fit_prior, pars = "away_goals_rep")[[1]][draw, ],
FTR = NA))
fd$FTR[fd$FTHG == fd$FTAG] <- "D"
fd$FTR[fd$FTHG > fd$FTAG] <- "H"
fd$FTR[fd$FTHG < fd$FTAG] <- "A"
fd$FTR <- factor(fd$FTR, levels = c("A", "D", "H"), ordered = TRUE)
```
We can visualise the outcome of these simulated games:
```{r}
heatmap_results(fd)
```
And we can also compute some statistics about this fake data:
```{r}
(fstats_fake <- football_stats(fd))
```
Let's now fit the model with the fake data:
```{r message=FALSE, warning=FALSE}
data_fake <- list(
N_teams = n_teams,
N_games = n_teams * (n_teams - 1),
home_goals = fd$FTHG,
away_goals = fd$FTAG,
home_id = sapply(fd[["HomeTeam"]], function(x) {which(x == teams_simu)}),
away_id = sapply(fd[["AwayTeam"]], function(x) {which(x == teams_simu)}),
run = 1
)
fit_fake <- sampling(compiled_model,
data = data_fake,
pars = param,
iter = n_it,
chains = n_chains)
par_fake <- extract_parameters(fit_fake, param, param_ind, param_obs, teams_simu, fd$Game, data_stan)
```
First, we should check the MCMC diagnostics: nothing to worry about.
```{r}
check_hmc_diagnostics(fit_fake)
pairs(fit_fake, pars = param_pop)
plot(fit_fake, pars = param_pop, plotfun = "trace")
print(fit_fake, pars = param_pop)
```
Then, we can check whether the posterior estimates "close enough" to the true parameters?
```{r}
(ce <- check_estimates(par_fake, true_param, param_pop, param_ind[1:2]))
```
Visually, they appear so, but we can also quantify it by computing, for example, the 90% coverage probability, i.e. the proportion of parameters falling in the 90% credible interval.
Here the coverage is `r signif(ce$Coverage, 2)` which is close enough to what it should be, i.e. 90%.
Finally, we can perform posterior predictive checks to detect any discrepancies between the observed (here, fake) and the posterior replications.
We can investigate several summary statistics such as the number of games won, lost or draws for a random team, as well as the total number of point or even if the final rank.
From the plot, we cannot visually identify any issues with the posterior replications.
```{r fig.height = 10, fig.width = 10}
PPC_football_stats(fit_fake, "win_rep", fstats_fake, teams_simu)
PPC_football_stats(fit_fake, "lose_rep", fstats_fake, teams_simu)
PPC_football_stats(fit_fake, "goal_tot_rep", fstats_fake, teams_simu)
PPC_football_stats(fit_fake, "point_rep", fstats_fake, teams_simu)
PPC_football_stats(fit_fake, "rank_rep", fstats_fake, teams_simu)
```
NB: The fake data check can be repeated to make sure the model can estimate different realisations of the prior predictive distribution in a process that is called Simulation Based Calibration.
# Model fitting
Having confirmed that the model could be fitted in the previous section, in this section, we will train the model with the data from the 2018/2019 season of the English Premier League.
```{r message=FALSE, warning=FALSE}
data_fit <- list(
N_teams = length(teams),
N_games = nrow(df),
home_goals = df[["FTHG"]],
away_goals = df[["FTAG"]],
home_id = sapply(df[["HomeTeam"]], function(x) {which(x == teams)}),
away_id = sapply(df[["AwayTeam"]], function(x) {which(x == teams)}),
run = 1
)
fit <- sampling(compiled_model,
data = data_fit,
pars = param,
iter = n_it,
chains = n_chains)
par <- extract_parameters(fit, param, param_ind, param_obs, teams, df$Game, data__fit)
```
First, we inspect converge diagnostics: nothing to worry about.
```{r}
check_hmc_diagnostics(fit)
pairs(fit, pars = param_pop)
plot(fit, pars = param_pop, plotfun = "trace")
```
Now we can look at the parameter estimates, the population parameters (e.g. parameters that are shared across teams) and the attack and defence abilities for each teams.
The coefficent plot for the population parameters reveal that the priors seems weakly informative enough to "include" the posteriors.
In addition, we notice that Manchester City and Liverpool have the best a posteriori attack and defence abilities of this season, which is consistent with the fact that they finished first and second respectively.
```{r}
HuraultMisc::plot_prior_posterior(par_prior, par, param_pop) +
labs(title = "<b>Posterior</b> vs <b style='color:#E69F00'>prior</b> estimates (mean and 90% CI)",
subtitle = "Population parameters",
y = "") +
theme(plot.title = element_markdown(),
plot.title.position = "plot",
legend.position = "none")
plot_abilities(par)
```
We can also look at the posterior predictive distribution.
For concision, I am not plotting the posterior probability for the number wins, lose, draws, goals or points, but we will look at the posterior ranks.
In the following plot, the size of the colour bars represent the probability at the given rank.
For instance, the posterior probability for Manchester finishing first is slightly above 50% and around 30% for finishing second.
Similarly, we can visually approximate the posterior probability for Liverpool finishing first to be 40% and a similar probability for finishing second.
```{r warning=FALSE}
stackhist_rank(compute_rank(fit, "rep"), teams)
```
# Model validation
While the fit can help us understand what was going on during the season a posteriori, it is interesting to know to what extent the model is predictive.
Since we are dealing with time-series data and want to predict the future based on the past, it is not appropriate to use standard cross-validation techniques such as K-fold cross-validation, rather, we will implement forward chaining where the model is trained on the data from the first week and tested on the next, then trained on the data of the first two weeks and tested on the remaining weeks, etc.
```{r}
HuraultMisc::illustrate_forward_chaining()
```
We can evaluate the performance of the model to predict the full time results using the Ranked Probability Score, a proper scoring rule to measure the accuracy of ordinal (cf. Lose < Draw < Win) probabilistic forecast.
It is also possible to evaluate the model in its ability to predict the number of goals for instance, but I will not show these results here.
The following code implements the forward chaining.
Considering the task is parallel in nature, it can be convenient to take advantage of multiple cores that might be available.
```{r message=FALSE, warning=FALSE}
n_cluster <- floor(parallel::detectCores() / n_chains)
# Training unit
df[["WeekNumber"]] <- strftime(df[["Date"]], format = "%Y-%V")
weeks <- unique(df[["WeekNumber"]])
# Update parameter of interest
param_test <- c("win_test", "draw_test", "lose_test",
"goal_tot_test", "goal_diff_test", "point_test")
param_ind <- c("attack", "defence", param_test)
param_obs <- c("home_goals_test", "away_goals_test")
param <- c(param_pop, param_ind, param_obs)
format_stan_data <- function(df) {
list(
N_teams = length(teams),
N_games = nrow(df),
home_goals = df[["FTHG"]],
away_goals = df[["FTAG"]],
home_id = sapply(df[["HomeTeam"]], function(x) {which(x == teams)}),
away_id = sapply(df[["AwayTeam"]], function(x) {which(x == teams)}),
run = 1
)
}
library(foreach)
library(doParallel)
duration <- Sys.time()
cl <- makeCluster(n_cluster)
registerDoParallel(cl)
writeLines(c(""), "log.txt")
out <- foreach(w = 1:(length(weeks) - 1)) %dopar% {
source("functions.R")
library(rstan)
rstan_options(auto_write = TRUE) # Save compiled model
options(mc.cores = parallel::detectCores()) # Parallel computing
sink("log.txt", append = TRUE)
cat(paste("Starting training at week ", w, " \n", sep = ""))
df_train <- df[df$WeekNumber <= weeks[w], ]
df_test <- df[df[["WeekNumber"]] > weeks[w], ]
data_stan <- format_stan_data(df_train)
fit <- sampling(compiled_model,
data = data_stan,
pars = param,
iter = n_it,
chains = n_chains)
# Parameters
par <- extract_parameters(fit, param = c(param_pop, param_ind), param_ind, param_obs, teams, df_train[["Game"]], data_stan)
par$WeekNumber <- weeks[w]
par$ProportionGamePlayed <- nrow(df_train) / nrow(df)
# Rank
rk <- compute_rank(fit, "test")
rk <- do.call(rbind,
lapply(1:length(teams),
function(i) {
tmp <- table(factor(rk[, i], levels = 1:length(teams))) / nrow(rk)
data.frame(Team = teams[i],
Rank = names(tmp),
Probability = as.numeric(tmp))
}))
rk <- HuraultMisc::factor_to_numeric(rk, "Rank")
rk$WeekNumber <- weeks[w]
rk$ProportionGamePlayed <- nrow(df_train) / nrow(df)
# Metrics
pred <- process_predictions(fit, id)
m <- compute_metrics(pred = pred, act = df, test_game = df_test[["Game"]], var = "FTR")
m$WeekNumber <- weeks[w]
m$ProportionGamePlayed <- nrow(df_train) / nrow(df)
list(Performance = m, Parameters = par, Rank = rk)
}
stopCluster(cl)
(duration = Sys.time() - duration)
m <- do.call(rbind, lapply(out, function(x) {x$Performance}))
par <- do.call(rbind, lapply(out, function(x) {x$Parameters}))
rk <- do.call(rbind, lapply(out, function(x) {x$Rank}))
```
We can now plot the predictive performance of the model as a function of training week, or as a function of the proportion of game played in the season.
```{r}
ggplot(data = subset(m, Metric == "RPS"),
aes(x = ProportionGamePlayed, y = Mean, ymin = Mean - SE, ymax = Mean + SE)) +
geom_pointrange() +
scale_y_continuous(limits = c(0, NA)) +
labs(y = "RPS", title = "RPS learning curve (lower the better)") +
theme_bw(base_size = 15)
```
Although the RPS is slightly improving, it does not seem to be by much, which suggest a limitation of such a simple model.
We could investigate this by plotting how the believes in the teams abilities changes with time.
```{r fig.width=10, fig.height=10}
tmp <- subset(par, Variable %in% c("attack", "defence"))
pl4 <- lapply(teams, function(x) {
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
ggplot(data = subset(tmp, Team == x),
aes(x = ProportionGamePlayed, y = Mean, ymin = `5%`, ymax = `95%`, colour = Variable, fill = Variable)) +
geom_line() +
geom_ribbon(alpha = 0.5) +
scale_colour_manual(values = cbbPalette) +
scale_fill_manual(values = cbbPalette) +
labs(title = x, y = "Ability", colour = "", fill = "") +
coord_cartesian(ylim = c(-1, 1)) +
theme_bw(base_size = 15)
})
plot_grid(get_legend(pl4[[1]] + theme(legend.position = "top")),
plot_grid(plotlist = lapply(pl4, function(p) {p + theme(legend.position = "none")}),
nrow = 5),
nrow = 2, rel_heights = c(.05, .95))
```
Even though the abilities are learnt as more data comes in, they remain uncertain, which could explain the previous result.
It can also be interesting to see how our predictions changes with time, for instance, how the uncertainty over the final ranking is changing as more games are played.
The following plot depicts, as an heatmap, the predicted rank at the end of the season as a function of the number of games played, for each team.
As we could expect, early in the season, the predictions are quite uncertain but becomes more confident as fewer games remain to be played and as the model has better estimates of its parameters.
```{r fig.width=10, fig.height=10}
# Add first week ranking
rk <- rbind(data.frame(expand.grid(Team = teams,
Rank = 1:length(teams)),
Probability = 1 / length(teams),
WeekNumber = strftime(min(df[["Date"]]) - 7, format = "%Y-%V"),
ProportionGamePlayed = 0), # add first week
rk)
# Add last week ranking
tmp <- data.frame(expand.grid(Team = teams, Rank = 1:length(teams)),
Probability = 0,
WeekNumber = weeks[length(weeks)],
ProportionGamePlayed = 1)
for (i in 1:nrow(fstats)) {
id <- which((tmp[["Team"]] == fstats[i, "Team"]) & (tmp[["Rank"]] == fstats[i, "rank"]))
tmp[id, "Probability"] <- 1
}
rk <- rbind(rk, tmp)
pl2 <- lapply(teams,
function(x) {
ggplot(data = subset(rk, Team == x),
aes(x = factor(ProportionGamePlayed), y = Rank, fill = Probability)) +
geom_tile() +
scale_fill_viridis_c() +
scale_y_continuous(expand = c(0, 0), breaks = 1:length(teams)) +
scale_x_discrete(expand = c(0, 0), breaks = c(0, 0.5, 1)) +
labs(title = x, x = "Proportion of game played*") + # * not exactly but close
theme_classic(base_size = 15)
})
plot_grid(get_legend(pl2[[1]] + theme(legend.position = "top")),
plot_grid(plotlist = lapply(pl2, function(x) {x + theme(legend.position = "none")}),
nrow = 5),
nrow = 2, rel_heights = c(.05, .95))
```
For example, we can see how the prediction look like at the middle of the season.
```{r message=FALSE, warning=FALSE}
df_train <- df[df[["WeekNumber"]] <= median(weeks), ]
test_game <- df[df[["WeekNumber"]] > median(weeks), "Game"]
data_fit <- format_stan_data(df_train)
fit <- sampling(compiled_model,
data = data_fit,
pars = param,
iter = n_it,
chains = n_chains)
```
```{r message=FALSE, warning=FALSE, fig.width=10, fig.height=10}
PPC_football_stats(fit, "win_test", fstats, teams)
PPC_football_stats(fit, "lose_test", fstats, teams)
PPC_football_stats(fit, "point_test", fstats, teams)
stackhist_rank(compute_rank(fit, "test"), teams)
```
The predictions look reasonable with respect to the outcome that is observed at the end of the season, however, there is still a lot of uncertainty at the mid-season.
Interestingly, the last plot shows the model predict that Liverpool will win the championship with the probability of approximately 80% when in the end, it is Manchester City that will win!
# Conclusion
The model was successfully fit to the data and can gives valuable insights into the teams abilities.
However, its predictive performance appears limited and we would need to move away from this simple "Dixon-Coles" model so we can make more accurate and practically useful predictions.
Keeping in mind I am far from being a domain expert, I could suggest to:
- Predict the number of goals for each half-time periods. This would effectively double the data we are using and provide more accurate estimates of the teams abilities, and hopefully, better predictions.
- Use a zero-inflated model to describe the fact that "no goals scored" happens more often than expected.
- Assume correlated attack and defence abilities.
- Model the fact that the latent abilities might change with time. For example, we could use a Random Walk or an Exponential Smoothing model to describe the evolution of abilities.
- Team dependent home advantage. Some teams might have a bigger home advantage (better fans?) but at the same time, other teams could less sensitive to this effect.
- If the data is available, model the team abilities as a combination of the abilities of each player.
- etc.
Nonetheless, the Bayesian framework can be useful to make complex predictions beyond which team will win a specific game, but also final ranking at the end of the season.