Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
PetrochukM committed Oct 23, 2019
1 parent d337bb9 commit fd14acf
Show file tree
Hide file tree
Showing 2 changed files with 98 additions and 43 deletions.
132 changes: 93 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,23 @@
<p align="center"><img width="55%" src="docs/_static/img/logo.svg" /></p>

<h3 align="center">Supporting Rapid Prototyping with a Deep Learning NLP Toolkit&nbsp;&nbsp;
<a href="https://twitter.com/intent/tweet?text=Supporting%20rapid%20prototyping%20for%20research,%20PyTorch-NLP%20has%20LAUNCHED,%20a%20deep%20learning%20natural%20language%20processing%20(NLP)%20toolkit!%20&url=https://github.com/PetrochukM/PyTorch-NLP&hashtags=pytorch,nlp,research">
<img style='vertical-align: text-bottom !important;' src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social" alt="Tweet">
</a>
</h3>
<h3 align="center">Basic Utilities for PyTorch NLP Software</h3>

PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research.

Join our community, add datasets and neural network layers! Chat with us on [Gitter](https://gitter.im/PyTorch-NLP/Lobby) and join the [Google Group](https://groups.google.com/forum/#!forum/pytorch-nlp), we're eager to collaborate with you.
PyTorch-NLP, or `torchnlp` for short, is a library of basic utilities for PyTorch
Natural Language Processing (NLP). `torchnlp` extends PyTorch to provide you with
basic text data processing functions.

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pytorch-nlp.svg?style=flat-square)
[![Codecov](https://img.shields.io/codecov/c/github/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://codecov.io/gh/PetrochukM/PyTorch-NLP)
[![Downloads](http://pepy.tech/badge/pytorch-nlp)](http://pepy.tech/project/pytorch-nlp)
[![Documentation Status]( https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
[![Documentation Status](https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
[![Build Status](https://img.shields.io/travis/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://travis-ci.org/PetrochukM/PyTorch-NLP)
[![Twitter: PetrochukM](https://img.shields.io/twitter/follow/MPetrochuk.svg?style=social)](https://twitter.com/MPetrochuk)

_Logo by [Chloe Yeo](http://www.yeochloe.com/)_

## Installation
## Installation 🐾

Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
Make sure you have Python 3.5+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
pip:

pip install pytorch-nlp
Expand All @@ -29,15 +26,16 @@ Or to install the latest code via:

pip install git+https://github.com/PetrochukM/PyTorch-NLP.git

## Docs 📖
## Docs

The complete documentation for PyTorch-NLP is available via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).
The complete documentation for PyTorch-NLP is available
via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).

## Basics
## Get Started

Add PyTorch-NLP to your project by following one of the common use cases:
Within an NLP data pipeline, you'll want to implement these basic steps:

### Load a [Dataset](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html)
### Load Your Data 🐿

Load the IMDB dataset, for example:

Expand All @@ -49,51 +47,107 @@ train = imdb_dataset(train=True)
train[0] # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}
```

### Apply [Neural Networks](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.nn.html) Layers
Load a custom dataset, for example:

```python
from pathlib import Path

from torchnlp.download import download_file_maybe_extract

directory_path = Path('data/')
train_file_path = Path('trees/train.txt')

download_file_maybe_extract(
url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
directory=directory_path,
check_files=[train_file_path])

open(directory_path / train_file_path)
```

Don't worry we'll handle caching for you!

For example, from the neural network package, apply state-of-the-art LockedDropout:
### Text To Tensor

Tokenize and encode your text as a tensor. For example, a `WhitespaceEncoder` breaks
text into terms whenever it encounters a whitespace character.

```python
from torchnlp.encoders.text import WhitespaceEncoder

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]
```

### Tensor To Batch

With your loaded and encoded data in hand, you'll want to batch your dataset.

```python
import torch
from torchnlp.nn import LockedDropout
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors

input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)
encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]

# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])

batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]
```

### [Encode Text](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.encoders.text.html)
PyTorch-NLP builds on top of PyTorch's existing `torch.utils.data.sampler`, `torch.stack`
and `default_collate` to support sequential inputs of varying lengths!

### Your Good To Go!

With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.

### Last But Not Least

PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗

Tokenize and encode text as a tensor. For example, a `WhitespaceEncoder` breaks text into terms whenever it encounters a whitespace character.
#### Pre-Trained Word Vectors

Now that you've computed your vocabulary, you may want to make use of
pre-trained word vectors, like so:

```python
import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe

# Create a `WhitespaceEncoder` with a corpus of text
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

# Encode and decode phrases
encoder.encode("this ain't funny.") # RETURNS: torch.Tensor([6, 7, 1])
encoder.decode(encoder.encode("This ain't funny.")) # RETURNS: "this ain't funny."
vocab = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
embedding_weights[i] = pretrained_embedding[token]
```

### Load [Word Vectors](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.word_to_vector.html)
#### Neural Networks Layers

For example, load FastText, state-of-the-art English word vectors:
For example, from the neural network package, apply the state-of-the-art `LockedDropout`:

```python
from torchnlp.word_to_vector import FastText
import torch
from torchnlp.nn import LockedDropout

vectors = FastText()
# Load vectors for any word as a `torch.FloatTensor`
vectors['hello'] # RETURNS: [torch.FloatTensor of size 300]
input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)

# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
```

### Compute [Metrics](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.metrics.html)
#### Metrics

Finally, compute common metrics such as the BLEU score.
Compute common NLP metrics such as the BLEU score.

```python
from torchnlp.metrics import get_moses_multi_bleu
Expand Down Expand Up @@ -131,8 +185,8 @@ AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to b

## Authors

* [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
* [Chloe Yeo](http://www.yeochloe.com/) — Logo Design
- [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
- [Chloe Yeo](http://www.yeochloe.com/) — Logo Design

## Citing

Expand Down
9 changes: 5 additions & 4 deletions torchnlp/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,14 +127,14 @@ def download_file_maybe_extract(url, directory, filename=None, extension=None, c
""" Download the file at ``url`` to ``directory``. Extract to ``directory`` if tar or zip.
Args:
url (str): Url of file.
url (str or Path): Url of file.
directory (str): Directory to download to.
filename (str, optional): Name of the file to download; Otherwise, a filename is extracted
from the url.
extension (str, optional): Extension of the file; Otherwise, attempts to extract extension
from the filename.
check_files (list of str): Check if these files exist, ensuring the download succeeded.
If these files exist before the download, the download is skipped.
check_files (list of str or Path): Check if these files exist, ensuring the download
succeeded. If these files exist before the download, the download is skipped.
Returns:
(str): Filename of download file.
Expand All @@ -145,8 +145,9 @@ def download_file_maybe_extract(url, directory, filename=None, extension=None, c
if filename is None:
filename = _get_filename_from_url(url)

directory = str(directory)
filepath = os.path.join(directory, filename)
check_files = [os.path.join(directory, f) for f in check_files]
check_files = [os.path.join(directory, str(f)) for f in check_files]

if len(check_files) > 0 and _check_download(*check_files):
return filepath
Expand Down

0 comments on commit fd14acf

Please sign in to comment.