Merge pull request scikit-learn#2222 from FedericoV/Out_of_core_example

[MRG] Added comparison of other classifiers using partial fit methods
rsivapr · Sep 18, 2013 · 9cb3b96 · 9cb3b96
2 parents 8765d95 + e18d11a
commit 9cb3b96
Show file tree

Hide file tree

Showing 5 changed files with 280 additions and 60 deletions.
diff --git a/doc/modules/naive_bayes.rst b/doc/modules/naive_bayes.rst
@@ -190,6 +190,9 @@ incrementally as done with other classifiers as demonstrated in
 Contrary to the ``fit`` method, the first call to ``partial_fit`` needs to be
 passed the list of all the expected class labels.
 
+For an overview of available strategies in scikit-learn, see also the
+:ref:`out-of-core learning <scaling_strategies>` documentation.
+
 note::
 
   The ``partial_fit`` method call of naive Bayes models introduces some

diff --git a/doc/modules/scaling_strategies.rst b/doc/modules/scaling_strategies.rst
@@ -0,0 +1,124 @@
+.. _scaling_strategies:
+
+==================
+Scaling Strategies
+==================
+
+For some applications the amount of examples, features (or both) and/or the 
+speed at which they need to be processed are challenging for traditional 
+approaches. In these cases scikit-learn has a number of options you can 
+consider to make your system scale. 
+
+Scaling with instances using out-of-core learning
+=================================================
+
+Out-of-core (or "external memory") learning is a technique used to learn from
+data that cannot fit in a computer's main memory (RAM). 
+
+Here is sketch of a system designed to achieve this goal:
+
+  1. a way to stream instances
+  2. a way to extract features from instances
+  3. an incremental algorithm
+
+Streaming instances
+------------------
+Basically, 1. may be a reader that yields instances from files on a
+hard drive, a database, from a network stream etc. However, 
+details on how to achieve this are beyond the scope of this documentation.
+
+Extracting features
+-------------------
+\2. could be any relevant way to extract features among the 
+different :ref:`feature extraction <feature_extraction>` methods supported by
+scikit-learn. However, when working with data that needs vectorization and 
+where the set of features or values is not known in advance one should take 
+explicit care. A good example is text classification where unknown terms are
+likely to be found during training. It is possible to use a statefull 
+vectorizer if making multiple passes over the data is reasonable from an
+application point of view. Otherwise, one can turn up the difficulty by using
+a stateless feature extractor. Currently the preferred way to do this is to
+use the so-called :ref:`hashing trick<feature_hashing>` as implemented by 
+:class:`sklearn.feature_extraction.FeatureHasher` for datasets with categorical
+variables represented as list of Python dicts or
+:class:`sklearn.feature_extraction.text.HashingVectorizer` for text documents.
+
+Incremental learning
+--------------------
+Finally, for 3. we have a number of options inside scikit-learn. Although all
+algorithms cannot learn incrementally (i.e. without seeing all the instances
+at once), all estimators implementing the ``partial_fit`` API are candidates.
+Actually, the ability to learn incrementally from a mini-batch of instances 
+(sometimes called "online learning") is key to out-of-core learning as it
+guarantees that at any given time there will be only a small amount of
+instances in the main memory. Choosing a good size for the mini-batch that 
+balances relevancy and memory footprint could involve some tuning [1]_.
+
+Here is a list of incremental estimators for different tasks:
+
+  - Classification
+      + :class:`sklearn.naive_bayes.MultinomialNB`
+      + :class:`sklearn.naive_bayes.BernoulliNB`
+      + :class:`sklearn.linear_model.Perceptron`
+      + :class:`sklearn.linear_model.SGDClassifier`
+      + :class:`sklearn.linear_model.PassiveAggressiveClassifier`
+  - Regression
+      + :class:`sklearn.linear_model.SGDRegressor`
+      + :class:`sklearn.linear_model.PassiveAggressiveRegressor`
+  - Clustering
+      + :class:`sklearn.cluster.MiniBatchKMeans`
+  - Decomposition / feature Extraction
+      + :class:`sklearn.decomposition.MiniBatchDictionaryLearning`
+      + :class:`sklearn.cluster.MiniBatchKMeans`
+
+For classification, a somewhat important thing to note is that although a 
+stateless feature extraction routine may be able to cope with new/unseen
+attributes, the incremental learner itself may be unable to cope with 
+new/unseen targets classes. In this case you have to pass all the possible
+classes to the first ``partial_fit`` call using the ``classes=`` parameter.
+
+Another aspect to consider when choosing a proper algorithm is that all of them
+don't put the same importance on each example over time. Namely, the
+`Perceptron` is still sensitive to badly labeled examples even after many 
+examples whereas the `SGD*` and `PassiveAggressive*` families are more robust 
+to this kind of artifacts. Conversely, the later also tend to give less 
+importance to remarkably different, yet properly labeled examples when they 
+come late in the stream as their learning rate decreases over time. 
+
+Examples
+--------
+Finally, we have a full-fledged example of
+:ref:`example_applications_plot_out_of_core_classification.py`. It is aimed at
+providing a starting point for people wanting to build out-of-core learning
+systems and demonstrates most of the notions discussed above.
+
+Furthermore, it also shows the evolution of the performance of different
+algorithms with the number of processed examples.
+
+.. |accuracy_over_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_1.png
+    :target: ../auto_examples/applications/plot_out_of_core_classification.html
+    :scale: 80
+
+.. centered:: |accuracy_over_time|
+
+Now looking at the computation time of the different parts, we see that the
+vectorization is much more expensive than learning itself. From the different
+algorithms, `MultinomialNB` is the most expensive, but its overhead can be 
+mitigated by increasing the size of the mini-batches (exercise: change 
+`minibatch_size` to 100 and 10000 in the program and compare). 
+
+.. |computation_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_3.png
+    :target: ../auto_examples/applications/plot_out_of_core_classification.html
+    :scale: 80
+
+.. centered:: |computation_time|
+
+
+Notes
+-----
+
+.. [1] Depending on the algorithm the mini-batch size can influence results or
+       not. SGD*, PassiveAggressive*, and discrete NaiveBayes are truly online
+       and are not affected by batch size. Conversely, MiniBatchKMeans 
+       convergence rate is affected by the batch size. Also, its memory
+       footprint can vary dramatically with batch size.
diff --git a/doc/user_guide.rst b/doc/user_guide.rst
@@ -66,3 +66,4 @@
    model_selection.rst
    data_transforms.rst
    Dataset loading utilities <datasets/index.rst>
+   modules/scaling_strategies.rst
diff --git a/doc/whats_new.rst b/doc/whats_new.rst
@@ -2283,3 +2283,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 .. _Mikhail Korobov: http://kmike.ru/pages/about/
 
 .. _Kyle Kastner: http://kastnerkyle.github.io
+
+.. _@FedericoV: https://github.com/FedericoV/
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2283,3 +2283,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
		.. _Mikhail Korobov: http://kmike.ru/pages/about/

		.. _Kyle Kastner: http://kastnerkyle.github.io

		.. _@FedericoV: https://github.com/FedericoV/