-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsignificance_model.tex
94 lines (86 loc) · 7.63 KB
/
significance_model.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
\section{Analysis Phase}\label{ht}
In the analysis phase, we determine whether social influence bias is statistically significant by analyzing spread of ratings around the median for the participants that changed their ratings.
There are three principle challenges in testing this hypothesis.
The first challenge is that parametric significance tests comparing two sample means such as the two sample t-test and z-test are known to
perform poorly for multimodal and discrete distributions.
Another significance test that is commonly applied to compare spreads of distributions is the F-test, which is also known to perform poorly for many non-normal distributions \cite{markowski1990conditions}.
Furthermore, this test is usually used to test the spread of data around the mean, which only in very special conditions, such as normal distributions, aligns with the median which is the parameter of interest in the CRC.
The discreteness of our data leads to multi-modal distributions which are not optimal for these testing methods.
The second challenge is that there is a natural tendency for ratings to concentrate around the median even without a bias.
Consider the following participant behavioral model.
Suppose that participants are not accustomed to a slider-based input.
We can model the first rating that the participant leaves as uniformly randomly anywhere on the slider.
As the participant begins to understand how to use the slider, their use becomes more accurate, ultimately settling on a rating from our observed distribution of final ratings.
This model, the first rating is uniformly random and the second rating is a sample from the observed distribution, would result in a strong regression towards the median; even if there is no causal link with seeing the median.
Finally, the median $m_i$ changes as ratings arrive and thus can be different for each participant.
The median rating is calculated over all prior participants and thus is dependent on when the participant submitted their first rating.
In practice, the median will eventually converge for a large number of participants, but it would be incorrect to measure concentration around a final median.
To address these three challenges, we propose a nonparametric model based on the Wilcoxon statistic to test the hypothesis that the group of participants that changed their ratings are more tightly centered around the median value that those participants observed.
Our tests compare absolute deviations around the median for $P_n$, $P_c$, and $R$; which, as a relative comparison, controls for the natural tendency for ratings group around the median.
Furthermore, it is more robust to the effects of alternate models such as the one described in our second challenge in comparison to a direct test of correlation (see Section \ref{exp-robust}).
\subsection{Non-parametric Significance Test}
Recall that $P_n$ is the set of participants that did not change their ratings and $P_c$ be the set of participants that changed their ratings.
We define a set $X_c,X_n$ of absolute deviations from the observed median of the final rating for each group:
\begin{equation}
X_c = \{|m[j] - g_f[j]|\}\text{ }\forall j \in P_c
\end{equation}
\begin{equation}
X_n = \{|m[j] - g_f[j]|\}\text{ }\forall j \in P_n
\end{equation}
For the purposes of hypothesis testing, we ignore the sign of the deviation.
However, in Section \ref{changemod}, where we build a predictive model for the changes, we include the sign.
Now, for the set $X_c$, we calculate the Wilcoxon rank-sum statistic.
We assign a rank to each of the absolute deviations in the union set $\textbf{X} = X_c \cup X_n$ (ie. the largest change has rank 1 and the smallest has rank $|X_c \cup X_n|$.
For $X_c$, we sum the ranks of the deviations within its set:
\begin{equation}
W_c = \sum_{j \in P_c} R_j
\end{equation}
The \emph{Null Hypothesis} is that absolute deviations in $X_c$ are the same size as $X_n$.
Under this null hypothesis $median(X_n) = median(X_c)$, the ranks will be evenly distributed between each group.
Therefore, the null expected value and variance of $W$ is:
\begin{equation}
\mathbb{E}(W) = \frac{(|\textbf{X}| + 1)\cdot |X_c|}{2}
\end{equation}
\begin{equation}
var(W) = \frac{(|\textbf{X}| + 1)\cdot |X_c| \cdot |X_n|}{12}
\end{equation}
For the significance level $\alpha$, we can test the probability that our calculated $W_c$ comes from the null distribution.
In other words, the test calculates the probability that a random subset of users (ignoring the categorization $P_n$ and $P_c$) can have the observed difference in rank-sum values.
A significant result means that for the participants that changed their ratings the changed changes are more tightly centered around the median they observed.
For many distributions, the Wilcoxon statistic is more robust as it uses ranks rather than the actual values, making it more resilient to outliers.
Even in the case where the data is normally distributed, the optimal condition for the t-test, the relative efficiency of the Wilcoxon rank-sum statistic compared to the typically used t-statistic is $\frac{3}{\pi}=95.4\%$.
We trade off a small amount of efficiency in the normally distributed case, for increased efficiency and robustness in many non-normal distributions (eg. exponential $3\times$ more efficient).
Recommender system data is almost always collected from discrete inputs which are usually not normally distributed.
The same analysis can be used to test $X_c$ against the absolute deviations in the reference survey $X_r$
\begin{equation}
X_r = \{|m[j] - g_i[j]|\}\text{ }\forall j \in R
\end{equation}
or for initial vs. final ratings in the change group $X_c'$:
\begin{equation}
X_c' = \{|m[j] - g_i[j]|\}\text{ }\forall j \in P_c
\end{equation}
\subsection{Quantifying Concentration of Ratings}
In addition to testing social influence bias, we can also estimate by how much the absolute deviations differ.
The Wilcoxon statistic can be inverted to estimate a most likely \emph{shift parameter} $\Delta$, that is a shift $\Delta$ in the distribution of absolute deviations $X_c$ that maximally aligns them with $X_n$.
In other words, $X_c + \Delta$ is most supported by the null hypothesis (no social influence bias), or the distance from this hypothesis.
An intuitive interpretation of $\Delta$ is that it measures how much our deviations have to be increased so that the no social influence bias hypothesis is the most likely conclusion.
Since $X_c$ is a set of absolute deviations, $\Delta$ tells us how much more concentrated $X_c$ is than $X_n$ around the observed medians.
This parameter is relevant to the design of recommendation algorithms use similarity (eg. clustering or nearest neighbors), as it characterizes how much more on average are participants closer to the median.
We refer to \cite{lehmann2006nonparametrics} on the derivation of $\Delta$ and its confidence interval:
\begin{equation}
D = \{x_n[j] - x_c[i]\} \text{ } \forall i,j \in X_n, X_c
\end{equation}
\begin{equation}
\Delta = median(D)
\end{equation}
%\subsection{Discussion About the Grade Distribution}
%In Figure \ref{dist-1}, we show the distribution of absolute deviations for the Marriage Rights issue.
%\begin{figure}[ht!]
% \centering
% \includegraphics[scale=0.40]{../plots/absolute-deviations.eps}
% \caption{\textbf{TODO}}
% \label{dist-1}
%\end{figure}
%We see that the distribution is multimodal and discrete.
%Parametric tests such as the z-test and the t-test have been shown to have weaker statistical power in many families of multimodal distributions such as mixtures of Gaussians \cite{???}.
%Rank-based tests tend to be more robust to the multimodality and in fact don't depend on the actual values on the relative frequency of the ranks in the test set.