-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path90_Maximum_Likelihood.Rmd
370 lines (288 loc) · 20.2 KB
/
90_Maximum_Likelihood.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# Maximum Likelihood Estimation
```{r, echo=FALSE}
# Unattach any packages that happen to already be loaded. In general this is unnecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
# Set my default chunk options
knitr::opts_chunk$set( fig.height=3, message=FALSE, warning=FALSE )
```
```{r, message=FALSE, warning=FALSE}
library(tidyverse) # dplyr, tidyr, ggplot2
```
### Learning Outcomes{-}
* Explain how the probability mass/density function $f(x|\theta)$ indicates what data regions are more probable.
* Explain how the likelihood function $\mathcal{L}(\theta|x)$ is defined if we know the probability function.
* Explain how the likelihood function $\mathcal{L}(\theta|x)$ is used to find the maximum likelihood estimate of $\theta$.
* For a given sample of data drawn from a distribution, find the maximum likelihood estimate for the distribution parameters using R.
## Introduction
The goal of statistical modeling is to take data that has some general trend along with some un-explainable variability, and say something intelligent about the trend. For example, the simple regression model
$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i \;\;\;\;\;\; \textrm{where} \;\; \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$$
```{r, echo=FALSE, warning=FALSE, message=FALSE}
n = 35
data <- tibble(x=runif(n, 0,1)) %>%
mutate( Ey = 1 + 2*x,
y = Ey + rnorm(n(), sd=.2) )
ggplot(data, aes(x=x, y=y)) +
geom_point() +
geom_smooth( method=lm, se=FALSE ) +
labs(x='Explanatory', y='Response', title='Simple Regression') +
annotate('label', x=.5, y=2.75, color='blue',
label=latex2exp::TeX('$y_i = \\beta_0 + \\beta_1 x_i + \\epsilon_i$') )
```
There is a general increasing trend in the response (i.e. the $\beta_0 + \beta_1 x_i$ term) and then some un-explainable noise (the $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$ part).
While it has been convenient to write the model in this form, it is also possible to write the simple regression model as
$$y_i \stackrel{ind}{\sim} N(\, \beta_0 + \beta_1x_i,\; \sigma^2 \,)$$
This model contains three parameters $\beta_0, \beta_1$, and $\sigma$ but it certainly isn't clear how to estimate these three values. In this chapter, we'll develop a mechanism for taking observed data sampled from some distribution parameterized by some $\beta$, $\lambda$, $\sigma$, or $\theta$ and then estimating those parameters.
## Distributions
Depending on what values the data can take on (integers, positive values) and the shape of the distribution of values, we might chose to model the data using one of several different distributions. Next we'll quickly introduce the mathematical relationship between the parameter and probable data values of several distributions.
### Poisson
The Poisson distribution is used to model the number of events that happen in some unit of time or space. This distribution is often used to model events that can only be positive integers. This distribution is parameterized by $\lambda$, which represents the expected number of events that happen (as defined as the average over an infinitely large number of draws). Because $\lambda$ represents the average number of events, the $\lambda$ parameter must be greater than or equal to 0.
The function that defines the relationship between the parameter $\lambda$ and what values are most probable is called the *probability mass function* when talking about discrete random variables and *probability density functions* in the continuous case. Either way, these functions are traditionally notated using $f(x)$.
$$f(x|\lambda) = \frac{e^{-\lambda} \lambda^x}{x!} \;\;\;\; \textrm{for}\;\; x \in \{0,1,2,\dots\}$$
```{r, echo=FALSE, warning=FALSE, message=FALSE}
data <- tibble(x=0:10) %>%
mutate(upr = dpois(x, lambda = 3.5), lwr=0) %>%
pivot_longer(lwr:upr, names_to='Type', values_to='y')
ggplot(data, aes(x=x, y=y)) +
geom_point(data=data%>%filter(Type=='upr')) +
geom_path(aes(group=x)) +
annotate('path', x=c(3.5,3.5), y=c(.25,0), color='blue') +
annotate('text', x=3.5, y=0.27, label=latex2exp::TeX('$\\lambda$'), color='blue') +
scale_x_continuous(breaks = seq(0,10,by=2)) +
labs(x='X', y=latex2exp::TeX('$f(x \\,|\\, \\lambda = 3.5)$'),
title=latex2exp::TeX('Poisson $(\\, \\lambda = 3.5 \\,)$') )
```
The notation $f(x|\lambda)$ read as "f *given* $\lambda$" and is used to denote that this is a function that describes what values of the data $X$ are most probable and that the function depends on the parameter value. This is emphasizing that if we were to change the parameter value (to say $\lambda = 10$), then a different set of data values would be more probable. In the above example with $\lambda = 3.5$, the most probable outcome is $3$ but we aren't surprised if we were to observe a value of $x=1,2,$ or $4$. However, from this graph, we see that $x=10$ or $x=15$ would be highly improbable.
### Exponential
The Exponential distribution can be used to model events that take on a positive real value and the distribution of values has some skewness. We will parameterize this distribution using $\beta$ as the mean of the distribution.
$$f(x|\beta) = \beta e^{\beta x} \;\;\;\; \textrm{for} \;\; x > 0$$
```{r, echo=FALSE, warning=FALSE, message=FALSE}
data <- tibble(x=seq(0,20, by=0.01)) %>%
mutate(y = dexp(x, rate= 1/3.5))
ggplot(data, aes(x=x, y=y)) +
geom_line() +
annotate('path', x=c(3.5,3.5), y=c(.25,0), color='blue') +
annotate('text', x=3.5, y=0.27, label=latex2exp::TeX('$\\beta$'), color='blue') +
scale_x_continuous(breaks = seq(0,20,by=5)) +
labs(x='X', y=latex2exp::TeX('$f(x \\,|\\, \\beta = 3.5)$'),
title=latex2exp::TeX('Exponential $(\\, \\beta = 3.5 \\,)$') )
```
In this distribution, the region near zero is the most probable outcome, and larger observations are less probable.
### Normal
The Normal distribution is extremely commonly used in statistics and can be used to modeled continuous variables and is parameterized by the center $\mu$ and spread $\sigma$ parameters.
$$f(x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left[ -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2 \right]$$
where $\exp [w] = e^w$ is just a notational convenience.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
data <- tibble( x= seq(3,9,by=0.01)) %>%
mutate(y = dnorm(x, mean=6, sd=1))
ggplot(data, aes(x=x, y=y)) +
geom_line() +
annotate('path', x=c(6,6), y=c(.42,0), color='blue') +
annotate('text', x=6, y=.45, label=latex2exp::TeX('$\\mu$'), color='blue') +
annotate('path', x=c(6,7), y=c(.24,.24), color='red') +
annotate('text', x=6.5, y=.22, label=latex2exp::TeX('$\\sigma$'), color='red') +
labs(x='X', y=latex2exp::TeX('$f(\\, x \\,|\\, \\mu = 6, \\, \\sigma = 1 \\,)$'),
title=latex2exp::TeX('Normal $(\\mu = 6, \\; \\sigma = 1)$') )
```
All of these distributions (and there are many, many more distributions commonly used) have some mathematical function that defines the how probable a region of response values is and that function depends on the parameters. Importantly $X$ regions with the highest $f(x|\theta)$ are the most probable data values.
There are many additional mathematical details that go into these density functions but the important aspect is that they tell us what data values are most probable given some parameter values $\theta$.
## Likelihood Function
As a researcher, I am not particularly interested in saying "If $\mu=3$ and $\sigma=2$ then I'm likely to observe approximately 95% of my data between $-1$ and $7$. Instead, I want to make an inference about what values for $\mu$ and $\sigma$ are the most concordant with observed data that I've collected. However, the probability density function $f(x|\mu,\sigma)$ is still the mathematical link between the data and parameter and we will continue to use that function, but we'll re-interpret which is *known*.
The *Likelihood* function is just the probability density (or mass) function $f(x|\theta)$ re-interpreted to be a function where the data is the known quantity and we are looking to see what parameter values are consistent with the data.
### Poisson
Suppose that we have observed a single data point drawn from a Poisson($\lambda$) and we don't know what $\lambda$ is. We first write down the likelihood function
$$\begin{aligned}
\mathcal{L}(\lambda|x) = f(x|\lambda) = \frac{e^{-\lambda} \lambda^x}{x!}
\end{aligned}$$
If we have observed $x=4$, then $\mathcal{L}(\lambda|x=4)$ is the following function
```{r, echo=FALSE, fig.height=5, warning=FALSE, message=FALSE}
data <- tibble(lambda=seq(0,15, by=0.01)) %>%
mutate(L = dpois(4, lambda = lambda))
P1 <- ggplot(data, aes(x=lambda, y=L)) +
geom_line() +
annotate('path', x=c(4,4), y=c(.195, 0), color='blue') +
annotate('text', x=4, y=.215, label=latex2exp::TeX('$\\hat{\\lambda}$'), color='blue') +
labs( x = latex2exp::TeX('$\\lambda$'),
y = latex2exp::TeX('$L(\\, \\lambda \\,|\\, x=4 \\,)$'),
title = latex2exp::TeX('Poisson Likelihood given $x=4$') )
P2 <- ggplot(data, aes(x=lambda, y=log(L) )) +
geom_line() +
annotate('path', x=c(4,4), y=c(-20, -1.5), color='blue') +
annotate('text', x=4, y=.5, label=latex2exp::TeX('$\\hat{\\lambda}$'), color='blue') +
scale_y_continuous(limits=c(-20,2)) +
labs( x = latex2exp::TeX('$\\lambda$'),
y = latex2exp::TeX('$logL(\\, \\lambda \\,|\\, x=4 \\,)$'),
title = latex2exp::TeX('Poisson log-Likelihood given $x=4$') )
cowplot::plot_grid(P1, P2, nrow=2)
```
Our best estimate for $\lambda$ is the value that maximizes $\mathcal{L}(\lambda|x)$. We could do this two different ways. First we could mathematically solve by taking the derivative, setting it equal to zero, and then solving for lambda. Often this process is made mathematically simpler (and computationally more stable) by instead maximizing the log of the Likelihood function. This is equivalent because the log function is monotonically increasing and if $a < b$ then $\log(a) < \log(b)$. It is simpler because taking logs makes everything 1 operation simpler and reduces the need for using the chain rule while taking derivatives. We could also find the value of $\lambda$ that maximizes the likelihood using numerical methods. Again because the log function makes everything nicer, in practice we'll always maximize the log likelihood. Many optimization functions are designed around finding function minimums, so to use those, we'll actually seek to minimize the *negative* log likelihood which is simply $-1* \log \mathcal{L}()$.
Numerical solvers are convenient, but are only accurate to machine tolerance you specify. In this case where $x=4$, the actual maximum likelihood value is $\hat{\lambda}=4$.
```{r}
x <- 4
neglogL <- function(param){
dpois(x, lambda=param) %>%
log() %>% # take the log
prod(-1) %>% # multiply by -1
return()
}
# Optimize function will find the maximum of the Likelihood function
# over the range of lambda values [0, 20]. By default, the optimize function
# finds the minimum, but has an option to find the maximum. Alternatively
# we could find the minimum of the -logL function.
optimize(neglogL, interval=c(0,20) )
```
But what if we have multiple observations from this Poisson distribution? If the observations are *independent*, then the probability mass or probability density functions $f(x_i|\theta)$ can just be multiplied together.
$$\begin{aligned}
\mathcal{L}(\lambda|\textbf{x}) = \prod_{i=1}^{n}f(x_i|\lambda) = \prod_{i=1}^n\frac{e^{-\lambda} \lambda^x_i}{x_i!}
\end{aligned}$$
So, suppose we have observed $\textbf{x} = \{4,6,3,3,2,4,3,2\}$. We could maximize this function either using calculus methods or numerical methods this function and discover that the maximum occurs at $\hat{\lambda} = \bar{x} = 3.375$.
```{r, echo=FALSE, fig.height=5, warning=FALSE, message=FALSE}
x <- c(4,6,3,3,2,4,3,2)
data <-
expand.grid( lambda=seq(0,15, by=0.01), x=x) %>%
mutate(L = dpois(.$x, lambda = .$lambda)) %>%
group_by(lambda) %>%
summarize( L = prod(L) )
P1 <-
ggplot(data, aes(x=lambda, y=L)) +
geom_line() +
annotate('path', x=c(3.375,3.375), y=c(.96e-6, 0), color='blue') +
annotate('text', x=3.375, y=1.05e-6, label=latex2exp::TeX('$\\hat{\\lambda}$'), color='blue') +
labs( x = latex2exp::TeX('$\\lambda$'),
y = latex2exp::TeX('$L(\\, \\lambda \\,|\\, \\textbf{x} \\,)$'),
title = latex2exp::TeX('Poisson Likelihood given $\\textbf{x}$') )
P2 <-
ggplot(data, aes(x=lambda, y=log(L) )) +
geom_line() +
annotate('path', x=c(3.375,3.375), y=c(-13.75, -100), color='blue') +
annotate('text', x=3.375, y=-5, label=latex2exp::TeX('$\\hat{\\lambda}$'), color='blue') +
scale_y_continuous(limits=c(-100,-0)) +
labs( x = latex2exp::TeX('$\\lambda$'),
y = latex2exp::TeX('$logL(\\, \\lambda \\,|\\, \\textbf{x} \\,)$'),
title = latex2exp::TeX('Poisson log-Likelihood given $\\textbf{x}$') )
cowplot::plot_grid(P1, P2, nrow=2)
```
If we are using the log-Likelihood, then the multiplication is equivalent to summing and we'll define and subsequently optimize our log-Likelihood function like this:
```{r}
x <- c(4,6,3,3,2,4,3,2)
neglogL <- function(param){
dpois(x, lambda=param) %>%
log() %>% sum() %>% prod(-1) %>%
return()
}
optimize(neglogL, interval=c(0,20) )
```
### Exponential Example
We next consider data sampled from the exponential distribution. Recall the exponential distribution can be parametrized by a single parameter, $\beta$ and which is the expectation of the distribution and variance is $\beta^2$. We might consider using as an estimator *either* the sample mean or the sample standard deviation. It turns out that the sample mean is the maximum likelihood estimator in this case. For a concrete example, suppose that we had observed $\textbf{x}=\{15.6, 2.03, 9.12, 1.54, 3.69\}$
The likelihood is
$$\mathcal{L}(\beta | \textbf{x}) = \prod_{i=1}^n \beta e^{\beta x_i}$$
A bit of calculus will show that this is maximized at $\hat{\beta} = \bar{x} = 6.39$. We can numerically see this following the same process as previously seen.
```{r, echo=FALSE, fig.height=5, warning=FALSE, message=FALSE}
x <- c(15.6, 2.03, 9.12, 1.54, 3.69)
mean(x)
data <-
expand.grid( beta=seq(.01,30, by=0.01), x=x) %>%
mutate(L = dexp(.$x, rate = 1/ .$beta)) %>%
group_by(beta) %>%
summarize( L = prod(L), .groups='drop' ) %>%
as.data.frame()
P1 <-
ggplot(data, aes(x=beta, y=L)) +
geom_line() +
annotate('path', x=rep(mean(x),2), y=c(6.29e-7, 0), color='blue') +
annotate('text', x=mean(x), y=6.70e-7, label=latex2exp::TeX('$\\hat{\\beta}$'), color='blue') +
# scale_y_continuous(limits=c(0,1.55e-7)) +
labs( x = latex2exp::TeX('$\\beta$'),
y = latex2exp::TeX('$L(\\, \\beta \\,|\\, \\textbf{x} \\,)$'),
title = latex2exp::TeX('Exponential Likelihood given $\\textbf{x}$') )
P2 <-
ggplot(data, aes(x=beta, y=log(L) )) +
geom_line() +
annotate('path', x=rep(mean(x),2), y=c(-14.3, -30), color='blue') +
annotate('text', x=mean(x), y=-13.2, label=latex2exp::TeX('$\\hat{\\beta}$'), color='blue') +
scale_y_continuous(limits=c(-30,-13.0)) +
labs( x = latex2exp::TeX('$\\beta$'),
y = latex2exp::TeX('$logL(\\, \\lambda \\,|\\, \\textbf{x} \\,)$'),
title = latex2exp::TeX('Exponential log-Likelihood given $\\textbf{x}$') )
cowplot::plot_grid(P1, P2, nrow=2)
```
```{r}
x <- c(15.6, 2.03, 9.12, 1.54, 3.69)
neglogL <- function(param){
dexp(x, rate=1/param, log = TRUE) %>% # We defined beta as 1/rate
sum() %>%
prod(-1) %>%
return()
}
optimize(neglogL, interval=c(0,30) )
```
### Normal
We finally consider the case where we have observation coming from a distribution that has multiple parameters. The normal distribution is parameterized by a mean $\mu$ and spread $\sigma$. Suppose that we had observed $x_i \stackrel{iid}{\sim} N(\mu, \sigma^2)$ and saw $\textbf{x} = \{5, 8, 9, 7, 11, 9\}$.
As usual we can calculate the likelihood function
$$\mathcal{L}(\mu,\sigma | \textbf{x}) = \prod_{i=1}^n f(x_i|\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \exp \left[ -\frac{1}{2}\frac{(x_i-\mu)^2}{\sigma^2} \right]$$
Again using calculus, it can be shown that the maximum likelihood estimators in this model are
$$\hat{\mu} = \bar{x} = 8.16666$$
$$\hat{\sigma}_{mle} = \sqrt{ \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2 } = 1.8634$$
which is somewhat unexpected because the typical estimator we use has a $\frac{1}{n-1}$ multiplier.
```{r}
x <- c(5, 8, 9, 7, 11, 9)
xbar <- mean(x)
s2 <- sum( (x - xbar)^2 / 6 )
s <- sqrt(s2)
```
```{r, echo=FALSE, fig.height=5, warning=FALSE, message=FALSE}
x <- c(5, 8, 9, 7, 11, 9)
data <-
expand.grid( mu=seq(5,12, by=0.01), sigma=seq(1, 3.0, by=0.01), x=x) %>%
mutate(L = dnorm(.$x, mean=.$mu, sd=.$sigma)) %>%
group_by(mu, sigma) %>%
summarize( L = prod(L) )
P1 <-
ggplot(data, aes(x=mu, y=sigma, z=L)) +
geom_contour(color='black') +
geom_vline( xintercept=xbar, color='blue') +
geom_hline( yintercept=s, color='red' ) +
annotate('text', x=8.3, y=1, label=latex2exp::TeX('$\\hat{\\mu}$'), color='blue') +
annotate('text', x=6, y=1.8, label=latex2exp::TeX('$\\hat{\\sigma}_{mle}$'), color='red') +
scale_x_continuous(limits=c(5, 12 )) +
scale_y_continuous(limits=c(1, 2.5)) +
labs( x = latex2exp::TeX('$\\mu$'),
y = latex2exp::TeX('$\\mu$'),
title = latex2exp::TeX('Normal Likelihood given $\\textbf{x}$') )
P1
```
```{r}
x <- c(5, 8, 9, 7, 11, 9)
neglogL <- function(param){
dnorm(x, mean=param[1], sd=param[2]) %>%
log() %>% sum() %>% prod(-1) %>%
return()
}
# Bivariate optimization uses the optim function that only can search
# for a minimum. The first argument is an initial guess to start the algorithm.
# So long as the start point isn't totally insane, the numerical algorithm should
# be fine.
optim(c(5,2), neglogL )
```
## Discussion
1. How could the numerical maximization happen? Assume we have a 1-dimensional parameter space, we have a reasonable starting estimate, and the function to be maximized is continuous, smooth, and $\ge 0$ for all $x$. *While knowing the derivative function $f'(x)$ would allow us to be much more clever, lets think about how to do the maximization by just evaluating $f(x)$ for different values of $x$.*
2. Convince yourself that if $x_0$ is the value of $x$ that maximizes $f(x)$, then it is also the value that maximizes $log( f( x ) )$. This will rest on the idea that $log()$ is a strictly increasing function.
## Exercises
1. The $\chi^2$ distribution is parameterized by its degrees of freedom parameter $\nu$ which corresponds to the mean of the distribution ($\nu$ must be $>0$). The density function $f(x|\nu)$ can be accessed R using the `dchisq(x, df=nu)`.
a) For different values of $\nu$, plot the distribution function of $f(x|\nu)$. You might consider $\nu = 5, 10,$ and $20$. The valid range of $x$ values is $[0,\infty)$ so select the
b) Suppose that we've observed $x = \{9,7,7,6,10,7,9)$. Calculate the sample mean and standard deviation.
c) Graphically show that the maximum likelihood estimator of $\nu$ is $\bar{x} = 7.857$.
d) Show that the maximum likelihood estimator of $\nu$ is $\bar{x} = 7.857$ using a numerical maximization function.
2. The Beta distribution is often used when dealing with data that are proportions. It is parameterized by *two* parameters, usually called $\alpha$ and $\beta$ (both of which must be greater than zero). The mean of this distribution is by $$E[X]=\frac{\alpha}{\alpha+\beta}$$
while the spread of the distribution is inversely related to the magnitude of $\alpha$ and $\beta$. The density function $f(x|\alpha,\beta)$ can be accessed in R using the `dbeta(x, alpha, beta)` function.
a) For different values of $\alpha$ and $\beta$, plot the distribution function $f(x|\alpha, \beta)$. You might consider keeping the ratio between $\alpha$ and $\beta$ constant and just increase their magnitude. The valid range of $x$ values is $[0,1]$.
b) Suppose that we've observed $x = \{0.25, 0.42, 0.45, 0.50, 0.55\}$. Calculate the sample mean and standard deviation.
c) Graphically show that the maximum likelihood estimators of $\alpha$ and $\beta$ are approximately $9.5$ and $12.5$.
d) Show that the maximum likelihood estimators of $\alpha$ and $\beta$ are approximately $9.5$ and $12.5$.