Skip to content

Commit

Permalink
ready for merge / release
Browse files Browse the repository at this point in the history
  • Loading branch information
znmeb committed Mar 18, 2018
1 parent 20841f5 commit 66440c2
Show file tree
Hide file tree
Showing 2 changed files with 274 additions and 89 deletions.
132 changes: 97 additions & 35 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,36 +18,51 @@ knitr::opts_chunk$set(echo = TRUE)
# Data Science Pet Containers
M. Edward (Ed) Borasky <[email protected]>, `r Sys.Date()`

## Overview
Data Science Pet Containers comprise a collection of open-source software for all phases of the data science workflow, from ingestion of raw data through visualization, exploration, analysis and reporting. We provide the following tools:

* PostgreSQL / PostGIS / pgRouting: an industrial strength relational database management system with geographic information systems (GIS) extensions,
* Anaconda Python tools, including a Jupyter notebook server, and
* R language tools, including RStudio® Server.

As the name implies, the software is distributed via Docker. The user simply clones a Git repository and uses the command `docker-compose up` to bring up the services.

Why do it this way?

* Provide a standardized common working environment for data scientists and DevOps engineers at Hack Oregon. We want to build using the same tools we'll use for deployment as much as possible.
* Deliver advanced open source technologies to Windows and MacOS desktops and laptops. While there are "native" installers for most of these tools, some are readily available and only heavily tested on Linux.
* Isolation: for the most part, software running in containers is contained. It interacts with the desktop / laptop user through well-defined mechanisms, often as a web server.

## Quick start
1. Clone this repository.
2. `cd data-science-pet-containers/containers`.
3. Copy `sample.env` to `.env`. Edit `.env` and change the `POSTGRES_PASSWORD`.
4. Copy any PostgreSQL database backups you want restored to `data-science-pet-containers/containers/Backups`.
5. `docker-compose -f postgis.yml up -d --build`. The first time you run this, it will take some time. Once the image is built and the databases restored, it will be faster.
1. Clone this repository and `cd data-science-pet-containers/containers`.
2. Copy `sample.env` to `.env`. Edit `.env` and change the `POSTGRES_PASSWORD`. You don't need to change the other values.
3. Copy any PostgreSQL database backups you want restored to `data-science-pet-containers/containers/Backups`. Copy any raw data files you want on the image to `data-science-pet-containers/containers/Raw`.
4. `docker-compose -f postgis.yml up -d --build`. The first time you run this, it will take some time. Once the image is built and the databases restored, it will be faster.

When it's done you'll see

```
Successfully tagged postgis:latest
Creating containers_postgis_1 ... done
```
6. Type `docker logs -f containers_postgis_1` to verify that the restores worked and the service is listening.
5. Type `docker logs -f containers_postgis_1` to verify that the restores worked and the service is listening.
```
PostgreSQL init process complete; ready for start up.
2018-03-13 08:11:43.060 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2018-03-13 08:11:43.060 UTC [1] LOG: listening on IPv6 address "::", port 5432
2018-03-13 08:11:43.177 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2018-03-13 08:11:43.412 UTC [76] LOG: database system was shut down at 2018-03-13 08:11:41 UTC
2018-03-13 08:11:43.473 UTC [1] LOG: database system is ready to accept connections
LOG: database system was shut down at 2018-03-18 05:19:22 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
```
Type `CTL-C` to stop following the container log.
7. Connect to the container from the host: user name is `postgres`, host is `localhost`, port is the value of `HOST_POSTGRES_PORT`, usually 5439, and password is the value of `POSTGRES_PASSWORD`.
6. Connect to the container from the host: user name is `postgres`, host is `localhost`, port is the value of `HOST_POSTGRES_PORT`, usually 5439, and password is the value of `POSTGRES_PASSWORD`. You can connect with any client that uses the PostgreSQL protocol including pgAdmin and QGIS.
To stop the service, type `docker-compose -f postgis.yml stop`. To start it back up again, `docker-compose -f postgis,yml start`.
To stop the service, type `docker-compose -f postgis.yml stop`. To start it back up again, `docker-compose -f postgis.yml start`.
The container and its filesystem will persist across host reboots. To destroy them, type `docker-compose -f postgis.yml down`.
## Setting up
1. Clone this repository and `cd data-science-pet-containers/containers`.
Expand All @@ -56,51 +71,56 @@ To stop the service, type `docker-compose -f postgis.yml stop`. To start it back
* Edit `.env`. The variables you need to define are
* `HOST_POSTGRES_PORT`: If you have PostgreSQL installed on your host, it's probably listening on port 5432. The `postgis` service listens on port 5432 inside the Docker network, so you'll need to map its port 5432 to another port. Set `HOST_POSTGRES_PORT` to the value you want; 5439 is what I use.
* `POSTGRES_PASSWORD`: To connect to the `postgis` service, you need a user name and a password. The user name is the default, the database superuser `postgres`. Docker will set the password for the `postgres` user in the `postgis` service to the value of `POSTGRES_PASSWORD`.
* `DB_USERS_TO_CREATE`: When the `postgis` service first comes up, the users in this list are created in the database. If you're working on the 2018 Hack Oregon projects, there's no reason to change this.
Here's `sample.env`:
```
# postgis container
HOST_POSTGRES_PORT=5439
POSTGRES_PASSWORD=some.string.you.can.remember.that.nobody.else.can.guess
DB_USERS_TO_CREATE=disaster-resilience housing-affordability local-elections transportation-systems urban-development
```
## Starting the services
1. Choose your version:
* `postgis.yml`: PostGIS only. If you're doing all the analysis on the host and just want the PostGIS service, choose this. If you're an experienced Linux command-line user, this image has a comprehensive collection of extract-transform-load (ETL) and GIS tools.
* `miniconda.yml`: PostGIS and Miniconda. Choose this if you want to run a Jupyter notebook server inside the Docker network.
* `rstudio.yml`: PostGIS and RStudio Server. Choose this if you want an RStudio Server inside the Docker network.
* `jupyter.yml`: PostGIS and Jupyter Choose this if you want to run a Jupyter notebook server inside the Docker network.
* `rstats.yml`: PostGIS and RStudio Server. Choose this if you want an RStudio Server inside the Docker network.
* `amazon.yml`: PostGIS and an Amazon Linux 2 server running PostgreSQL. This is a specialized configuration for testing database backups for AWS server readiness. Most users won't need to use this.
2. Type `docker-compose -f <version> up -d --build`. Docker will build/rebuild the images and start the services.
Note that if you want to bring up ***all*** the services in one shot, just type `docker-compose up -d --build`. This takes quite a bit of time - from 45 minutes to an hour the first time, depending on download bandwidth and disk I/O speed.
## The PostGIS service
The `postgis` service is based on the official PostgreSQL image from the Docker Store: <https://store.docker.com/images/postgres>. It is running
* PostgreSQL 10,
* PostgreSQL 9.6,
* PostGIS 2.4,
* pgRouting 2.5, and
* all of the foreign data wrappers that are available in a Debian `stretch` PostgreSQL server.
* all of the foreign data wrappers that are available in a Debian `jessie` PostgreSQL server.
All three images acquire PostgreSQL and its accomplices from the official PostgreSQL Global Development Group (PGDG) Debian repositories: <https://www.postgresql.org/download/linux/debian/>.
All the images except `amazon` acquire PostgreSQL and its accomplices from the official PostgreSQL Global Development Group (PGDG) Debian repositories: <https://www.postgresql.org/download/linux/debian/>.
### Using the command line
I've tried to provide a comprehensive command line experience. `Git`, `curl`, `wget`, `lynx`, `nano` and `vim` are there, as is most of the command-line GIS stack (`gdal`, `proj`, `spatialite`, `rasterlite`, `geotiff`, `osm2pgsql` and `osm2pgrouting`), and of course `psql`.
I've also included `python3-csvkit` for Excel, CSV and other text files, `unixodbc` for ODBC connections and `mdbtools` for Microsoft Access files. If you want to extend this image further, it is based on Debian `stretch`.
I've also included `python3-csvkit` for Excel, CSV and other text files, `unixodbc` for ODBC connections and `mdbtools` for Microsoft Access files. If you want to extend this image further, it is based on Debian `jessie`.
You can log in as the Linux superuser `root` with `docker exec -it -u root containers_postgis_1 /bin/bash`.
I've added a database superuser called `dbsuper`. This should be your preferred login, rather than using the system database superuser `postgres`. Log in with `docker exec -it -u dbsuper containers_postgis_1 /bin/bash`.
I've added a database superuser called `dbsuper`. This should be your preferred login, rather than using the system database superuser `postgres`. Log in with `docker exec -it -u dbsuper -w /home/dbsuper containers_postgis_1 /bin/bash`.
### Virtualenvwrapper
You can use the Python `virtualenvwrapper` utility. See <https://virtualenvwrapper.readthedocs.io/en/latest/#> for the documentation.
To activate, enter `source /usr/share/virtualenvwrapper/virtualenvwrapper.sh`.
### Setting up `git`
1. Log in with `docker exec` as `postgres` as described above.
2. `cd /home/postgres`.
1. Log in with `docker exec` as `dbsuper` as described above.
2. `cd /home/dbsuper`.
3. Edit `configure-git.bash`. You'll need to supply your email address and name.
4. Enter `./configure-git.bash`.
Expand All @@ -110,8 +130,8 @@ In either case, once you've authenticated, `git` will cache your credentials for
Cloning this repository:
1. Log in with `docker exec` as `postgres` as described above.
2. `cd /home/postgres`.
1. Log in with `docker exec` as `dbsuper` as described above.
2. `cd /home/dbsuper`.
3. Enter `./clone-me.bash`.
You will find the repository in `$HOME/Projects/data-science-pet-containers`
Expand Down Expand Up @@ -139,29 +159,46 @@ When the `postgis` service first starts, it initializes the database. After that
To use this feature:
1. For each database you want restored, create a file `<dbname>.backup` with either a pgAdmin `Backup` operation or with `pg_dump`. The file must be in [`pg_dump` format](https://www.postgresql.org/docs/current/static/app-pgdump.html).
1. For each database you want restored, create a backup file. For documentation / repeatability, do this with `pg_dump` on the command line or in a script.
```
pg_dump -Fp -v -C --if-exists -d <database> \
| gzip -c > <database>.sql.gz
```
`<database>` is the name of the database.
At restore time, a new database will be created (`-C`). This is done by dropping an existing one; the `--if-exists` keeps this drop from failing if the database doesn't exist.
The owner of the database in the source of the backup ***must*** exist in the destination server or the restore will not work!
2. Copy the database backup files to `data-science-pet-containers/containers/Backups`. Note that `.gitignore` is set for `*.backup`, so these backup files won't be version-controlled.
3. Type `docker-compose -f postgis.yml build`.
Docker will copy the backup files into `/home/dbsuper/Backups` on the `postgis` image, and place a script `restore-all.sh` in `/docker-entrypoint-initdb.d/`. The first time the image runs, `restore-all.sh` will restore all the `.backup` files it finds in `/home/dbsuper/Backups`.
Docker will copy the backup files into `/home/dbsuper/Backups` on the `postgis` image, and place a script `restore-all.sh` in `/docker-entrypoint-initdb.d/`. The first time the image runs, `restore-all.sh` will restore all the `.sql.gz` files it finds in `/home/dbsuper/Backups`.
`restore-all.sh` creates a new database with the same name as the file. For example, `passenger_census.backup` will be restored to a freshly-created database called `passenger_census`. Ownership information in the backups will be ignored; the new databases will have the owner `postgres`.
### The `Raw` directory
If you want to load raw data onto the `postgis` image, copy the files to the `data-science-pet-containers/containers/Raw` directory. The next time the image is built they will be copied to `/home/dbsuper/Raw`. Like the backups, these files are not version-controlled.
## Miniconda
## Jupyter
This service is based on the Anaconda, Inc. (formerly Continuum) `miniconda3` image: <https://hub.docker.com/r/continuumio/miniconda3/>. I've added a non-root user `jupyter` to avoid the security issues associated with running Jupyter notebooks as "root".
The `jupyter` user has a Conda environment, also called `jupyter`. In addition to `jupyter`, the environment has
* pandas,
* geopandas,
* jupyter,
* matplotlib,
* pandas,
* psycopg2,
* requests,
* seaborn,
* statsmodels,
* requests, and
* psycopg2.
* cookiecutter, and
* osmnx.
By default the Jupyter notebook server starts when Docker brings up the service. Type `docker logs containers_miniconda_1`. You'll see something like this:
By default the Jupyter notebook server starts when Docker brings up the service. Type `docker logs conatiners_jupyter_1`. You'll see something like this:
```
$ docker logs containers_miniconda_1
$ docker logs conatiners_jupyter_1
[I 08:00:22.931 NotebookApp] Writing notebook server cookie secret to /home/jupyter/.local/share/jupyter/runtime/notebook_cookie_secret
[I 08:00:23.238 NotebookApp] Serving notebooks from local directory: /home/jupyter
[I 08:00:23.238 NotebookApp] 0 active kernels
Expand Down Expand Up @@ -201,6 +238,9 @@ To install packages:
3. Enter `source activate jupyter`.
4. Use `conda search` to find packages in the Conda ecosystem, then install them with `conda install`. You can also install packages with `pip` if they're not in the Conda repositories.
### Connecting to the `postgis` service
To connect to the `postgis` service, use the user name and maintenance database name `postgres`. The host is `postgis`, the port is 5432 and the password is the value of `POSTGRES_PASSWORD`.
### Creating a Cookiecutter data science project
Reference: <https://drivendata.github.io/cookiecutter-data-science/>
Expand All @@ -211,10 +251,10 @@ The script will install `cookiecutter` in the `jupyter` environment if necessary
Follow the instructions to set up the project.
## RStudio
## Rstats
This service is based on the `rocker/rstudio` image from Docker Hub: <https://hub.docker.com/r/rocker/rstudio/>. I've added header files so that the R packages `RPostgres`, `odbc`, `sf` and `devtools` will install from source, but there are no R packages on the image besides those that ship with `rocker/rstudio`.
Browse to `localhost:8787`. The user name and password are both `rstudio`. ***Note that if you're using Firefox, you'll have to adjust a setting to use the terminal feature.***
Browse to `localhost:8787`. The user name and password are both `rstudio`. ***Note that if you're using Firefox, you may have to adjust a setting to use the terminal feature.***
* Go to `Tools -> Global Options -> Terminal`.
* For Firefox, uncheck the `Connect with WebSockets` option.
Expand All @@ -240,6 +280,28 @@ However, if you find an R package that won't install because of missing header o
Most packages that have missing dependencies will list the name of the Debian packages you need to install. If that's the case, open a `root` console with `docker exec -it -u root containers_rstudio_1 /bin/bash`. Then type `apt install <package-name>`. After the Debian package is installed, you should be able to install the R package.
### Connecting to the `postgis` service
To connect to the `postgis` service, use the user name and maintenance database name `postgres`. The host is `postgis`, the port is 5432 and the password is the value of `POSTGRES_PASSWORD`.
## Amazon
This image is based on the Amazon Linux 2 "2-with-sources" Docker image at <https://hub.docker.com/_/amazonlinux/>. The main reason it's in this collection is to provide a means of restore-testing backup files before handing them off to the DevOps engineers for deployment on AWS.
1. Read the section on automatic restores and backup file preparation above ([Using automatic database restores]).
2. Copy the backup files into `data-science-pet-containers/containers/Backups`.
3. `docker-compose -f amazon.yml up -d --build`. The backup files will be copied to `/home/dbsuper/Backups` on both the `postgis` and `amazon` images.
4. When the services are up, type `docker logs -f containers_postgis_1`. The backup files should be automatically restored. If there are errors, you'll need to fix your backup files. When the restores are done, type `CTL-C` to stop following the log.
5. Log in to the `amazon` container - `docker exec -it -u dbsuper -w /home/dbsuper containers_amazon_1 /bin/bash`.
6. `cd Backups; ls`. You'll see the backup files. For example:
```
$ cd Backups; ls
odot_crash_data.sql.gz passenger_census.sql.gz restore-all.sh
```
Those are the same backup files you just successfully restored in the `postgis` image.
7. Type `./restore-all.sh`. This is the same script that did the automatic restores on `postgis` and it should have the same result. If there are no errors in the automatic restore on `postgis` and the restore you just did in `amazon` the backup files are good.
To bring it up, type `docker-compose -f amazon,yml up -d --build`.
## About the name
This all started with an infamous "cattle, not pets" blog post. For some history, see <http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/>. In the Red Hat / Kubernetes / OpenShift universe, it's common for people to have a workstation that's essentially a Docker / Kubernetes host with all the actual work being done in containers. See <https://rhelblog.redhat.com/2016/06/08/in-defense-of-the-pet-container-part-1-prelude-the-only-constant-is-complexity/> and <https://www.projectatomic.io/blog/2018/02/fedora-atomic-workstation/>.
Expand Down
Loading

0 comments on commit 66440c2

Please sign in to comment.