An example of how to use R in Jupyter notebooks and then make a Binder environment to run them interactively on the web. This repo was inspired from a Tweet in a discussion about Episode 7 of The Bayes Factor podcast.
Disclaimer: I am a physicist and primarily a Python and C++ programmer and I don't use/know R. This repo is just what I know from being able to read code and understanding how Jupyter works.
Before learning how to setup R in Jupyter, first go check out how cool it is in Binder! Just click the "launch Binder" badge above.
- Requirements
- Setup and Installation
- Enable package dependency management with packrat
- Using papermill with Jupyter
- Testing Jupyter notebooks with pytest
- Automating testing alongside development with CI
- Setting up a Binder environment
- Preservation and DOI with Zenodo
- R Markdown in Jupyter with jupytext
- Further Reading and Resources
- Acknowledgements
Before you can begin you need to make sure that you have the following in your working environment
- Install Jupyter
- If you aren't familiar with Python and
pip
then just follow the instructions for installing with Anaconda
- If you aren't familiar with Python and
- Install R
- Install the R kernel for Jupyter (IRkernel)
Enable package dependency management with packrat
The first step in any project should be making sure that the library dependencies are specified and reproducible. This can be done in R with the packrat library.
First install packrat
R -e 'install.packages("packrat")'
Then from your project directory initialize Packrat for your project
R -e 'packrat::init()'
which will determine the R libraries used in your project and then build the list of dependencies for you. This effectively creates an isolated computing environment (known as "virtual environments" it other programming languages).
Running packrat::init()
results in the directory packrat
being created with the files init.R
, packrat.lock
and packrat.opts
inside of it. It will additionally also create or edit a project .Rprofile
and edit any existing .gitignore
file. The following files should be kept under version control for your project to be reproducible anywhere:
.Rprofile
packrat/init.R
packrat/packrat.lock
packrat/packrat.opts
As you work on your project and use more libraries you can update your dependencies managed by packrat
with (inside the R environment)
packrat::snapshot()
which updates packrat/packrat.lock
and packrat/packrat.opts
. You can also check if you have introduces new dependencies with
packrat::status()
If you remove libraries from use that were managed by packrat
you can check this with
packrat::status()
and then remove them from packrat
with
packrat::clean()
to ensure the minimal environment necessary for reproducibility is kept. Checking the status should now show
packrat::status()
Up to date.
If you have a packrat.lock
file that you want to create an environment from that doesn't already exist you can build the environment by running (from the command line)
R -e 'packrat::restore()'
This is one way in which you could setup the same packrat
environment on a different machine from the Git repository.
Using papermill with Jupyter
Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.
This means that you can use papermill to externally run, manipulate, and test Jupyter notebooks. This allows you to use Jupyter notebooks as components of an automated data analysis pipeline or for procedurally testing variations.
- To use with Jupyter notebooks running in the IRkernel install the R bindings for papermill (
papermillR
)
A toy example of how to use papermill is demonstrated in the example Jupyter notebook.
Testing Jupyter notebooks with pytest
To provide testing for Jupyter notebooks we can use pytest in combination with papermill.
- Install pytest
- If you installed Jupyter with Conda then you can also install pytest with Conda
Once you have installed pytest and done some minimal reading of the docs then create a tests
directory and write your test files in Python inside of it.
An example of some very simple tests using papermill is provided in tests/test_notebooks.py
. Once you read though and understand what the testing file is doing execute the tests with pytest
in the top level directory of the repo by running
pytest
To see the output that the individual testing functions would normally print to stdout
run with the -s
flag
pytest -s
There are numerous reasons to test your code, but as a scientist an obvious one is ensured reproducibility of experimental results. If your analysis code has unit tests and the analysis itself exists in an automatically testable pipeline then you and your colleagues should have more confidence in your analysis. Additionally, your analysis becomes (by necessity) a well documented and transparent process.
Want to learn more? Check out the Test and Code podcast hosted by Brian Okken.
pytest is the most comprehensive and scalable testing framework that I know of. I am biased, but I continue to be impressed with how nimble, powerful, and easy it is to work with. It makes me want to write tests. For the purposes of this demo repository it is also important as it allows for writing tests that use papermill (papermill's execute_notebook
is only accessible through the Python API).
There are testing frameworks in R, most notably testthat, which I assume are good. So I would encourage you to explore those as well.
Assuming that you're using Git to develop your analysis code then you can have a continuous integration service (such as Travis CI or CircleCI) automatically test your code in a fresh environment every time you push to GitHub. Testing with CI is a great way to know that your analysis code is working exactly as expected in a reproducible environment from installation all the way through execution as you develop, revise, and improve it. To see the output of the build/install and testing of this repo in Travis click on the build status badge at the top of the README
(also here: ).
To start off with I would recommend using Travis CI (it is the easiest to get up and running).
- Getting started with Travis CI
- Example
.travis.yml
in this repo - Travis CI docs on writing YAML CI files for R
There may be instances where you want to have your Git repository be private until work is complete or other information is made publicly available, and you still want to be able to use CI services.
Travis CI (currently) only works with GitHub and is free only for public repositories. CircleCI works with any Git web hosting service (i.e., GitHub, GitLab, Bitbucket) and allows for free use with public and private repositories up to a monthly use time budget. Additionally, GitLab offers their own CI service that is integrated into the GitLab platform. If your organization self-hosts an instance of GitLab (GitLab is open core) then you can use those CI tools with your private GitLab hosted repositories. If your organization has access to the enterprise version of GitLab then you can even run GitLab CI on GitHub hosted repositories.
Setting up a Binder environment
Binder turns your GitHub repository into a fully interactive computational environment (as you hopefully have already seen from the demo notebook). It then allows people to run any code that exists in the repository from their web browser without having to install any code and is a great tool for collaboration and sharing results.
The Binder team has done amazing work to make "Binderizing" a GitHub repository as simple as possible. In the case of getting an R computing environment many times all that you need (in addition to a DESCRIPTION
file and maybe an install.R
) is a runtime.txt
file that dictates which daily snapshot of MRAN to use. See the binder
directory for an example of what is needed to get this repository to run in Binder.
You'll note that the "launch Binder" badge at the top of the README
automatically launches into the R-in-Jupyter-Example.ipynb
notebook. This was configured to do so, but the default Binder behavior is to launch the Jupyter server and then show the directory structure of the repository.
To see that behavior launch Binder from here:
Once the server loads click on any file to open it in an editor or as a Jupyter notebook.
Preservation and DOI with Zenodo
To further make your analysis code more robust you can preserve it and make it citable by getting a DOI for the project repository with Zenodo. Activating version tracking on your GitHub repository with Zenodo will allow it to automatically freeze a version of the repository with each new version tag and then archive it. Additionally, Zenodo will create a DOI for your project and versioned DOIs for the project releases which can be added as a DOI badge. This makes it trivial for others to cite your work and allows you to indicate what version of your code was used in any publications.
R Markdown in Jupyter with jupytext
R Markdown is a very popular way to present beautifully rendered R along Markdown in different forms of documents. However, it is source only and not dynamically interactive as the R and Markdown needed to be rendered together with Pandoc (Pandoc is awesome).
jupytext is a utility to open and run R markdown notebooks in Jupyter and save Jupyter notebooks as R markdown.
Once you have installed jupytext create a Jupyter config with
jupyter notebook --generate-config
which creates the config file at
.jupyter/jupyter_notebook_config.py
Add the following line to the Jupyter config
c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
If you now launch a Jupyter notebook server and open a .Rmd
file the R Markdown should now be rendered in the interactive environment of Jupyter!
To get R Markdown working in Binder simple create a requirements.txt
file in the binder
directory and add jupytext
to it. Binder should take care of the rest!
- Here's a minimal example using the
Example_Rmd.Rmd
file from this repository:
- Jupyter And R Markdown: Notebooks With R, by Karlijn Willems
- Rocker's R configurations for Docker repo
- Noam Ross, "Docker for the UseR", New York Open Statistical Programming Meetup (nyhackr) (July 11th, 2018)
- rOpenSci
- The Jupyter and Binder team for making amazing open source software
- Marc Wouts for creating jupytext
- Achintya Rao for insightful feedback, thoughtful discussion, and excellent ideas