Update README

PetrochukM · Oct 23, 2019 · fd14acf · fd14acf
1 parent d337bb9
commit fd14acf
Show file tree

Hide file tree

Showing 2 changed files with 98 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -1,26 +1,23 @@
 <p align="center"><img width="55%" src="docs/_static/img/logo.svg" /></p>
 
-<h3 align="center">Supporting Rapid Prototyping with a Deep Learning NLP Toolkit&nbsp;&nbsp;
-  <a href="https://twitter.com/intent/tweet?text=Supporting%20rapid%20prototyping%20for%20research,%20PyTorch-NLP%20has%20LAUNCHED,%20a%20deep%20learning%20natural%20language%20processing%20(NLP)%20toolkit!%20&url=https://github.com/PetrochukM/PyTorch-NLP&hashtags=pytorch,nlp,research">
-    <img style='vertical-align: text-bottom !important;' src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social" alt="Tweet">
-  </a>
-</h3>
+<h3 align="center">Basic Utilities for PyTorch NLP Software</h3>
 
-PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research.
-
-Join our community, add datasets and neural network layers! Chat with us on [Gitter](https://gitter.im/PyTorch-NLP/Lobby) and join the [Google Group](https://groups.google.com/forum/#!forum/pytorch-nlp), we're eager to collaborate with you.
+PyTorch-NLP, or `torchnlp` for short, is a library of basic utilities for PyTorch
+Natural Language Processing (NLP). `torchnlp` extends PyTorch to provide you with
+basic text data processing functions.
 
 ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pytorch-nlp.svg?style=flat-square)
 [![Codecov](https://img.shields.io/codecov/c/github/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://codecov.io/gh/PetrochukM/PyTorch-NLP)
 [![Downloads](http://pepy.tech/badge/pytorch-nlp)](http://pepy.tech/project/pytorch-nlp)
-[![Documentation Status](	https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
+[![Documentation Status](https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
 [![Build Status](https://img.shields.io/travis/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://travis-ci.org/PetrochukM/PyTorch-NLP)
+[![Twitter: PetrochukM](https://img.shields.io/twitter/follow/MPetrochuk.svg?style=social)](https://twitter.com/MPetrochuk)
 
 _Logo by [Chloe Yeo](http://www.yeochloe.com/)_
 
-## Installation
+## Installation 🐾
 
-Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
+Make sure you have Python 3.5+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
 pip:
 
     pip install pytorch-nlp
@@ -29,15 +26,16 @@ Or to install the latest code via:
 
     pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
 
-## Docs 📖
+## Docs
 
-The complete documentation for PyTorch-NLP is available via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).
+The complete documentation for PyTorch-NLP is available
+via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).
 
-## Basics
+## Get Started
 
-Add PyTorch-NLP to your project by following one of the common use cases:
+Within an NLP data pipeline, you'll want to implement these basic steps:
 
-### Load a [Dataset](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html)
+### Load Your Data 🐿
 
 Load the IMDB dataset, for example:
 
@@ -49,51 +47,107 @@ train = imdb_dataset(train=True)
 train[0]  # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}
 ```
 
-### Apply [Neural Networks](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.nn.html) Layers
+Load a custom dataset, for example:
+
+```python
+from pathlib import Path
+
+from torchnlp.download import download_file_maybe_extract
+
+directory_path = Path('data/')
+train_file_path = Path('trees/train.txt')
+
+download_file_maybe_extract(
+    url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
+    directory=directory_path,
+    check_files=[train_file_path])
+
+open(directory_path / train_file_path)
+```
+
+Don't worry we'll handle caching for you!
 
-For example, from the neural network package, apply state-of-the-art LockedDropout:
+### Text To Tensor
+
+Tokenize and encode your text as a tensor. For example, a `WhitespaceEncoder` breaks
+text into terms whenever it encounters a whitespace character.
+
+```python
+from torchnlp.encoders.text import WhitespaceEncoder
+
+loaded_data = ["now this ain't funny", "so don't you dare laugh"]
+encoder = WhitespaceEncoder(loaded_data)
+encoded_data = [encoder.encode(example) for example in loaded_data]
+```
+
+### Tensor To Batch
+
+With your loaded and encoded data in hand, you'll want to batch your dataset.
 
 ```python
 import torch
-from torchnlp.nn import LockedDropout
+from torchnlp.samplers import BucketBatchSampler
+from torchnlp.utils import collate_tensors
+from torchnlp.encoders.text import stack_and_pad_tensors
 
-input_ = torch.randn(6, 3, 10)
-dropout = LockedDropout(0.5)
+encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]
 
-# Apply a LockedDropout to `input_`
-dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
+train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
+train_batch_sampler = BucketBatchSampler(
+    train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])
+
+batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
+batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]
 ```
 
-### [Encode Text](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.encoders.text.html)
+PyTorch-NLP builds on top of PyTorch's existing `torch.utils.data.sampler`, `torch.stack`
+and `default_collate` to support sequential inputs of varying lengths!
+
+### Your Good To Go!
+
+With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.
+
+### Last But Not Least
+
+PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗
 
-Tokenize and encode text as a tensor. For example, a `WhitespaceEncoder` breaks text into terms whenever it encounters a whitespace character.
+#### Pre-Trained Word Vectors
+
+Now that you've computed your vocabulary, you may want to make use of
+pre-trained word vectors, like so:
 
 ```python
+import torch
 from torchnlp.encoders.text import WhitespaceEncoder
+from torchnlp.word_to_vector import GloVe
 
-# Create a `WhitespaceEncoder` with a corpus of text
 encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
 
-# Encode and decode phrases
-encoder.encode("this ain't funny.") # RETURNS: torch.Tensor([6, 7, 1])
-encoder.decode(encoder.encode("This ain't funny.")) # RETURNS: "this ain't funny."
+vocab = set(encoder.vocab)
+pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab)
+embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
+for i, token in enumerate(encoder.vocab):
+    embedding_weights[i] = pretrained_embedding[token]
 ```
 
-### Load [Word Vectors](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.word_to_vector.html)
+#### Neural Networks Layers
 
-For example, load FastText, state-of-the-art English word vectors:
+For example, from the neural network package, apply the state-of-the-art `LockedDropout`:
 
 ```python
-from torchnlp.word_to_vector import FastText
+import torch
+from torchnlp.nn import LockedDropout
 
-vectors = FastText()
-# Load vectors for any word as a `torch.FloatTensor`
-vectors['hello']  # RETURNS: [torch.FloatTensor of size 300]
+input_ = torch.randn(6, 3, 10)
+dropout = LockedDropout(0.5)
+
+# Apply a LockedDropout to `input_`
+dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
 ```
 
-### Compute [Metrics](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.metrics.html)
+#### Metrics
 
-Finally, compute common metrics such as the BLEU score.
+Compute common NLP metrics such as the BLEU score.
 
 ```python
 from torchnlp.metrics import get_moses_multi_bleu
@@ -131,8 +185,8 @@ AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to b
 
 ## Authors
 
-* [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
-* [Chloe Yeo](http://www.yeochloe.com/) — Logo Design
+- [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
+- [Chloe Yeo](http://www.yeochloe.com/) — Logo Design
 
 ## Citing
 

diff --git a/torchnlp/download.py b/torchnlp/download.py
@@ -127,14 +127,14 @@ def download_file_maybe_extract(url, directory, filename=None, extension=None, c
     """ Download the file at ``url`` to ``directory``. Extract to ``directory`` if tar or zip.
 
     Args:
-        url (str): Url of file.
+        url (str or Path): Url of file.
         directory (str): Directory to download to.
         filename (str, optional): Name of the file to download; Otherwise, a filename is extracted
             from the url.
         extension (str, optional): Extension of the file; Otherwise, attempts to extract extension
             from the filename.
-        check_files (list of str): Check if these files exist, ensuring the download succeeded.
-            If these files exist before the download, the download is skipped.
+        check_files (list of str or Path): Check if these files exist, ensuring the download
+            succeeded. If these files exist before the download, the download is skipped.
 
     Returns:
         (str): Filename of download file.
@@ -145,8 +145,9 @@ def download_file_maybe_extract(url, directory, filename=None, extension=None, c
     if filename is None:
         filename = _get_filename_from_url(url)
 
+    directory = str(directory)
     filepath = os.path.join(directory, filename)
-    check_files = [os.path.join(directory, f) for f in check_files]
+    check_files = [os.path.join(directory, str(f)) for f in check_files]
 
     if len(check_files) > 0 and _check_download(*check_files):
         return filepath