-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathprobability.tex
180 lines (162 loc) · 8.22 KB
/
probability.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
\chapter{Probability}\label{sec:probability}
This course will use probability theory quite a lot, but we will often use
a fairly informal notation. Bayesian
statistics is really just probability theory used for a particular purpose,
to describe uncertainty. The two most important rules of probability are given below,
for reference.
\section{The Product Rule}
The first important rule of probability is the
product rule. This tells us how to calculate the probability that any two
propositions or hypotheses, $A$ and $B$, are {\bf both} true.
The probability of $A$ {\bf and}
$B$, will be denoted $P(A, B)$. This can be calculated by first finding the
probability that $A$ is true, and then multiplying by the probability that $B$
is true {\it given} that $A$ is true.
\begin{eqnarray}
P(A, B) &=& P(A)P(B|A)\label{product1}
\end{eqnarray}
We could also have done this the other way around: first finding the
probability that $B$ is true and then the probability that $A$ is true given
that $B$ is true:
\begin{eqnarray}
P(A, B) &=& P(B)P(A|B).\label{product2}
\end{eqnarray}
When using the product rule (or any rule of probability, for that matter), you
must ensure that the statements to the right of the ``given'' sign (or the
absence of any statements) are consistent throughout. For example,
$P(A, B|C) = P(A|C)P(B|A, C)$ is a valid use of the product rule, since
``given $C$'' is part of the background information in all of the terms.
You may be familiar with the idea of a {\it tree diagram} from earlier
statistics courses or maybe even high school. A tree diagram is a helpful way
to work with the product rule. If you find tree diagrams helpful, feel free to
use them, although tree diagrams themselves will not be examinable.
An example tree diagram is given in Figure~\ref{fig:tree_diagram}.
\begin{figure}[!ht]
\begin{center}
\includegraphics[scale=0.9]{Figures/tree_diagram.pdf}
\caption{\it A tree diagram.\label{fig:tree_diagram}}
\end{center}
\end{figure}
The product rule can also be applied to more than two propositions, like so:
\begin{eqnarray}
P(A, B, C) &=& P(A)P(B|A)P(C|B,A).
\end{eqnarray}
You can also apply the product rule in a situation where there is a common
statement in the ``given part'' of all probabilities in the expression. For
example, the following is also valid:
\begin{eqnarray}
P(A, B, C | D) &=& P(A |D)P(B|A,D)P(C|B,A,D).
\end{eqnarray}
In fact, it's best to regard even ``unconditional'' probabilities such
as $P(A)$ as being conditional on some prior information $I$, which is just
left out to keep the notation simple.
\subsection{Bayes' Rule}
Looking at Equations~\ref{product1} and~\ref{product2}, they are both equations
for the same
thing, $P(A,B)$. Therefore we can equate the right hand sides. Doing this gives
a result known as Bayes' rule:
\begin{eqnarray}
P(A|B) &=& \frac{P(A)P(B|A)}{P(B)}. \label{bayes}
\end{eqnarray}
Bayes' rule will be used extensively throughout this course. You will need to
know it and know how to use it!
\section{The Sum Rule}
The sum rule is the second important rule of probability. A general statement
of the sum rule is
\begin{align}
P(A \vee B) &= P(A) + P(B) - P(A, B).\label{eqn:sum_rule}
\end{align}
where $\vee$ means logical {\em or}.
The sum rule is often used to calculate the
probability of some statement
$A$ when we only know the probability of $A$ conditional on some
some other statement $B$. Then we
can use the sum rule like this:
\begin{eqnarray}
P(A) &=& P(A, B) + P(A, \neg B)\\
&=& P(B)P(A|B) + P(\neg B)P(A|\neg B).
\end{eqnarray}
where the $\neg$ symbol means ``not'', i.e. $\neg B$ is the statement that
$B$ is false. To understand
this formula, imagine we want to know the probability of $A$. There are
two mutually exclusive ways that could happen: via $B$ being true, or via $B$
being false. The first way has probability $P(B)P(A|B)$, and the second way
has probability $P(\neg B)P(A|\neg B)$.
If, instead of just two mutually exclusive and exhaustive
pathways $B$ and $\neg B$, there are many, such as $B_1, ..., B_n$.
Then the sum rule takes the form
\begin{eqnarray}
P(A) &=& \sum_{i=1}^n P(B_i)P(A|B_i).
\end{eqnarray}
As an exercise, you can try proving this version of the sum rule
starting from the simpler version of Equation~\ref{eqn:sum_rule}.
In Bayesian statistics the sum rule is most often used to calculate the
marginal likelihood $P(D)$, and to marginalise out ``nuisance
parameters'' from the posterior distribution. In STATS 331 we will mostly use
MCMC to do the latter.
\section{Random Variables}
Throughout this course I will use the term ``probability distribution'' to
refer to both the probability mass function for a discrete random variable, and
the probability density function for a continuous random variable. I will also
use a common shorthand notation.
\subsection{Discrete Random Variables}
We will also see quite a lot of random variables in this course (although
without using that terminology very much, as I consider the word ``random'' to be
worse than useless). A discrete
random variable is just a quantity $X$ that has a countable number of possible
values. A discrete random variable has a {\it probability mass function}
that tells you the probability as a function of the {\it possible} values $x$.
For example, the equation for the Poisson distribution (a useful discrete
distribution) is:
\begin{eqnarray}
P(X=x) &=& \frac{\lambda^x e^{-\lambda}}{x!}\label{eq:poisson}
\end{eqnarray}
for $x \in \{0, 1, 2, 3, ...\}$. The actual random variable is named $X$, and
$x$ is just used so we can write the probabilities ($P(X=0), P(X=1), ...)$ as
a formula.
\subsection{Continuous Random Variables}
Continuous random variables are those where the set of possibilities is
continuous. For example, with a normal distribution, technically any real value
is possible. Therefore it doesn't make sense to ask, for example, the probability
that $X=1.32$. The answer is zero because the total probability of 1 has to be
spread among an infinite number of possibilities. Instead, we can ask the probability
that $X$ is in some region that has a nonzero size. In general, if $X$ has a
probability density function $f(x)$, then the probability that $X \in [a, b]$ is:
\begin{eqnarray}
P(a \leq X \leq b) &=& \int_a^b f(x) \, dx.
\end{eqnarray}
Note the lower case $x$ in the probability density function. This is analogous
to the lower case $x$ in the probability mass function of a discrete random
variable. Note that I won't often get you to do an integral analytically. One
of the major reasons why MCMC is so awesome is that you can get away without
having to do hard integrals!
\subsection{Shorthand Notation}
The notation of random variables can be cumbersome. For example, consider
inferring a (discrete) parameter $Y$ from (discrete) data $X$. Bayes' rule gives us:
\begin{eqnarray}
P(Y=y | X=x) &=& \frac{P(Y=y)P(X=x|Y=y)}{P(X=x)}.
\end{eqnarray}
That's very verbose, so instead we use the shorthand:
\begin{eqnarray}
p(y | x) &=& \frac{p(y)p(x|y)}{p(x)}.
\end{eqnarray}
In this notation we don't distinguish between the symbol for the random variable
and the dummy variables that allow us to write the probability distribution as
a formula, we just use the lower case for everything. Despite this simplification,
everything still works. Just read
$p(y|x)$ as ``the probability distribution for $y$ given $x$'' and everything
will be fine. Astute readers may have noticed that when I gave the Poisson formula
(Equation~\ref{eq:poisson}), ``given $\lambda$'' was implicit throughout.
To be clear when we are talking about a simple probability and when we are
talking about the probability distribution for a variable, I will use upper case
$P$ for the former and lower-case $p$ for the latter.
\section{Useful Probability Distributions}
In the course we will study Bayesian models which involve the following
discrete probability distributions: general (all probabilities given explicitly,
such as in a Bayes' Box), binomial, Poisson, discrete uniform,
negative binomial, multinomial.
We may also use the following continuous distributions: uniform, normal,
Cauchy, student-$t$, beta, log-uniform, exponential, gamma, Dirichlet.
Wikipedia is an excellent resource for looking up all the properties of these
distributions, but I will also give various properties (e.g. the mean of a beta
distribution) and describe the distributions when we need them.