Skip to content

Commit

Permalink
Update paper, example and AE
Browse files Browse the repository at this point in the history
  • Loading branch information
tgsmith61591 committed Jun 22, 2017
1 parent 5c15e37 commit b1338ff
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 51 deletions.
2 changes: 1 addition & 1 deletion doc/smrt.tex
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ \section{Introduction}

A dataset may be considered imbalanced if its classification labels are disproportionately represented across classes. While class imbalance is very common and manifests itself in varying degrees, of particular interest are the cases in which one class---the majority class---is significantly more present than one or more minority classes, which are represented at a much smaller ratio. This can detrimentally impact a learning algorithm's ability to estimate a generalizable decision boundary. Consider, for instance, a medical test to determine whether a patient suffers a rare disease. The dataset may be 99.8\% composed of negative observations with only 0.2\% positive examples. Even the most na\"ive classifier can achieve 99.8\% classification accuracy in this case by simply learning to always predict the negative class \citep{lewis1994heterogeneous}. However, the utility of this test is completely absent, since it will never accurately predict the condition of value. This is by no means an isolated case, either; many machine learning domains---such as fraud detection, network security, spam filtration---frequently face some level of class imbalance. This paper presents some of the pitfalls of training machine learning models on such dataset, and presents a remedying class-balancing technique.

Throughout this paper, we focus on inducing classification algorithms on a given training set, $X \in \mathbb{R}^{m \times n}$, with a corresponding set of class labels, $y \in \{0, 1, ..., c\}$ in which one or more of the minority class labels is/are represented at a significantly smaller proportion than that of one or more majority class labels. As noted by countless studies and the aforementioned medical test example, classifier efficacy often cannot meaningfully be expressed or assessed via conventional, cost-insensitive metrics such as accuracy (or the percentage of testing observations properly identified by the learner). This greatly complicates the classification task since such metrics will offer misleadingly optimistic scores on an otherwise ineffective classifier. Therefore, what makes class imbalance a particularly interesting and relevant problem is the frequent tangible cost with which misclassification of rare events is typically associated. The real-world impact of such errors can be especially perilous in the medical domain, where diagnostic datasets are especially susceptible to class disparity, as high risk examples (e.g., instances of rare diseases) tend to constitute the minority class \citep{rahman2013addressing}.
Throughout this paper, we focus on inducing classification algorithms on a given training set, $X \in \mathbb{R}^{m \times n}$, with a corresponding set of class labels, $y \in \{0, 1, ..., c\}$ in which one or more of the minority class labels is/are represented at a significantly smaller proportion than that of one or more majority class labels. As noted by countless studies and the aforementioned medical test example, classifier efficacy often cannot meaningfully be measured via conventional, cost-insensitive metrics such as accuracy (or the percentage of testing observations properly identified by the learner). This greatly complicates the classification task since such metrics will offer misleadingly optimistic scores on an otherwise ineffective classifier. Therefore, what makes class imbalance a particularly interesting and relevant problem is the frequent tangible cost with which misclassification of rare events is typically associated. The real-world impact of such errors can be especially perilous in the medical domain, where diagnostic datasets are especially susceptible to class disparity, as high risk examples of interest (e.g., instances of rare diseases) tend to constitute the minority class \citep{rahman2013addressing}.

Section 2 presents previous work to which our approach may be compared. Section 3 introduces generative models, and more specifically, variational auto-encoders. Section 4 outlines the details of our technique. Section 5 details the specifics of our experiments and the performance of our technique compared with other common class imbalance solutions. \\

Expand Down
97 changes: 49 additions & 48 deletions examples/MNIST example.ipynb

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions smrt/autoencode/autoencoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ class AutoEncoder(BaseAutoEncoder):
learning_function : str, optional (default='rms_prop')
The optimizing function for training. Default is ``'rms_prop'``, which will use
the ``tf.train.RMSPropOptimizer``. Can be one of { ``'adadelta'``, ``'adagrad'``,
the ``tf.train.RMSPropOptimizer``. Can be one of {``'adadelta'``, ``'adagrad'``,
``'adagrad-da'``, ``'adam'``, ``'momentum'``, ``'proximal-sgd'``, ``'proximal-adagrad'``,
``'rms_prop'``, ``'sgd'``}
Expand Down Expand Up @@ -334,7 +334,7 @@ class VariationalAutoEncoder(BaseAutoEncoder):
learning_function : str, optional (default='rms_prop')
The optimizing function for training. Default is ``'rms_prop'``, which will use
the ``tf.train.RMSPropOptimizer``. Can be one of { ``'adadelta'``, ``'adagrad'``,
the ``tf.train.RMSPropOptimizer``. Can be one of {``'adadelta'``, ``'adagrad'``,
``'adagrad-da'``, ``'adam'``, ``'momentum'``, ``'proximal-sgd'``, ``'proximal-adagrad'``,
``'rms_prop'``, ``'sgd'``}
Expand Down

0 comments on commit b1338ff

Please sign in to comment.