Skip to content

canthonissen/datalake-in-a-box

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Background

This is a small demo project that sets up an Iceberg datalake and allows you to run dbt-spark models on it. It was based on the work of Muhammed Irshad, with a few minor tweaks. The orignial hive-metastore/maria-db configuration was failing, so I used PostgreSQL and a different (relatively) lightweight hive-metastore image (naushadh/hive-metastore). Docker and docker-compose were replaced with podman and podman-compose.
With a few tweaks you can quickly:

  • Reconfigure to a Delta Lake instead of Iceberg stack.
  • Replace Minio with an AWS back-end if you want to use cloud instead of local storage.

My intention is to make this as light as possible so I can have my own little data-lake-in-a-box.

Instructions

Only podman and podman-compose need to be installed locally.
Everything else runs in containers (even python and dbt).
Podman can run rootless containers and you may need to check that you use. I needed to change the network backend from CNI to netavark to get it to work rootless.
This should get you started:

  1. Change directory to the docker folder.
  2. Run command docker-compose up -d.
  3. Change directory to the dbt_iceberg_sample folder.
  4. Run commmands:
    ./dbt-pod seed
    ./dbt-pod run
    or
    ./dbt-pod build
    to populate data lake.
  5. Run the following command to start a beeline console, from where you can query the datalake in SQL:
    podman exec -it spark-thrift /usr/spark/bin/beeline -u “jdbc:hive2://spark-thrift:10000/test_schema;auth=noSasl” -n root
    Enter show tables; to see the tables.
    You can continue to execute SQL commands to interrogate the data.

Technology stack

Last tested with the following technologies:

Podman will pull the latest images when you build, so breaking changes may start to creep as things change.
You can update SparkThriftDockerfile and docker-compose.yml to lock the image versions to revert to a working configuration like the one above.

Acknowledgements

Thanks to Muhammed Irshad for leading the way:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages