Docker Setup for Data Processing and Jupyter Notebooks

This repository contains the Docker setup for running Luigi pipelines, PySpark, machine learning/deep learning code, and Jupyter notebooks, along with a PostgreSQL database.

Services

dataenv: The main data processing environment. Starts with a long-running process to act as an always-active shell.
jupyter: Jupyter notebook server.
db: PostgreSQL database.

Usage

Set up environment variables:

Create a .env file in the root of your project with the following content:

POSTGRES_USER=your_postgres_user
POSTGRES_PASSWORD=your_postgres_password
POSTGRES_DB=your_database_name
DD_API_KEY=your_datadog_api_key
DD_SITE=datadoghq.com

Create GitHub Secrets:

Ensure the following secrets are created in your GitHub repository:
- DOCKER_USERNAME
- DOCKER_PASSWORD

Set up the self-hosted runner (optional):

Follow the GitHub documentation to set up a self-hosted runner.

If you prefer to use GitHub-hosted runners, make the following changes to your workflow YAML:

jobs:
  setup:
    runs-on: ubuntu-latest
    ...
  build_dataenv:
    runs-on: ubuntu-latest
    ...
  build_jupyter:
    runs-on: ubuntu-latest
    ...
  build_db:
    runs-on: ubuntu-latest
    ...

Build and run the containers:
```
docker-compose up --build
```
Access the dataenv container:
```
docker exec -it dataenv /bin/bash
```
Access the Jupyter Notebook:

Navigate to http://localhost:8888 in your web browser. Use the token provided in the Jupyter logs.

Set up the cron job to update containers hourly:

a. Create the configuration file config/project_path.conf with the following content:

PROJECT_PATH=/your/path/here
IMAGES_LIST=dockerhub_username/dataenv:latest,dockerhub_username/jupyter:latest,dockerhub_username/db:latest

b. Run the setup script to install the cron job:

./scripts/setup_cron.sh

c. For macOS users, you might need to manually restart the cron service:

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.periodic-daily.plist
sudo launchctl load /System/Library/LaunchDaemons/com.apple.periodic-daily.plist

The cron job will run every hour to update the containers.

Database Setup

The database is initialized with the schema defined in db/init.sql.
The data is persisted using the postgres_data volume, ensuring it remains intact across container restarts and rebuilds.

Monitoring

Logs for the update script can be found in logs/update_containers.log.
Logs for the cron job execution can be found in logs/cron.log.

Notes

The dataenv container starts with a long-running process, allowing you to interact with it as an always-active shell.
Ensure you have sufficient resources on your computer to run all the containers.
Ensure the .env file contains your Datadog API key and site information for monitoring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Docker Setup for Data Processing and Jupyter Notebooks

Services

Usage

Database Setup

Monitoring

Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

Docker Setup for Data Processing and Jupyter Notebooks

Services

Usage

Database Setup

Monitoring

Notes