This repository contains the Docker setup for running Luigi pipelines, PySpark, machine learning/deep learning code, and Jupyter notebooks, along with a PostgreSQL database.
- dataenv: The main data processing environment. Starts with a long-running process to act as an always-active shell.
- jupyter: Jupyter notebook server.
- db: PostgreSQL database.
-
Set up environment variables:
Create a
.env
file in the root of your project with the following content:POSTGRES_USER=your_postgres_user POSTGRES_PASSWORD=your_postgres_password POSTGRES_DB=your_database_name DD_API_KEY=your_datadog_api_key DD_SITE=datadoghq.com
-
Create GitHub Secrets:
Ensure the following secrets are created in your GitHub repository:
DOCKER_USERNAME
DOCKER_PASSWORD
-
Set up the self-hosted runner (optional):
Follow the GitHub documentation to set up a self-hosted runner.
If you prefer to use GitHub-hosted runners, make the following changes to your workflow YAML:
jobs: setup: runs-on: ubuntu-latest ... build_dataenv: runs-on: ubuntu-latest ... build_jupyter: runs-on: ubuntu-latest ... build_db: runs-on: ubuntu-latest ...
-
Build and run the containers:
docker-compose up --build
-
Access the dataenv container:
docker exec -it dataenv /bin/bash
-
Access the Jupyter Notebook:
Navigate to http://localhost:8888 in your web browser. Use the token provided in the Jupyter logs.
-
Set up the cron job to update containers hourly:
a. Create the configuration file
config/project_path.conf
with the following content:PROJECT_PATH=/your/path/here IMAGES_LIST=dockerhub_username/dataenv:latest,dockerhub_username/jupyter:latest,dockerhub_username/db:latest
b. Run the setup script to install the cron job:
./scripts/setup_cron.sh
c. For macOS users, you might need to manually restart the cron service:
sudo launchctl unload /System/Library/LaunchDaemons/com.apple.periodic-daily.plist sudo launchctl load /System/Library/LaunchDaemons/com.apple.periodic-daily.plist
The cron job will run every hour to update the containers.
- The database is initialized with the schema defined in
db/init.sql
. - The data is persisted using the
postgres_data
volume, ensuring it remains intact across container restarts and rebuilds.
- Logs for the update script can be found in
logs/update_containers.log
. - Logs for the cron job execution can be found in
logs/cron.log
.
- The
dataenv
container starts with a long-running process, allowing you to interact with it as an always-active shell. - Ensure you have sufficient resources on your computer to run all the containers.
- Ensure the
.env
file contains your Datadog API key and site information for monitoring.