Background

This is a small demo project that sets up an Iceberg datalake and allows you to run dbt-spark models on it. It was based on the work of Muhammed Irshad, with a few minor tweaks. The orignial hive-metastore/maria-db configuration was failing, so I used PostgreSQL and a different (relatively) lightweight hive-metastore image (naushadh/hive-metastore). Docker and docker-compose were replaced with podman and podman-compose.
With a few tweaks you can quickly:

Reconfigure to a Delta Lake instead of Iceberg stack.
Replace Minio with an AWS back-end if you want to use cloud instead of local storage.

My intention is to make this as light as possible so I can have my own little data-lake-in-a-box.

Instructions

Only podman and podman-compose need to be installed locally.
Everything else runs in containers (even python and dbt).
Podman can run rootless containers and you may need to check that you use. I needed to change the network backend from CNI to netavark to get it to work rootless.
This should get you started:

Change directory to the docker folder.
Run command docker-compose up -d.
Change directory to the dbt_iceberg_sample folder.
Run commmands:
./dbt-pod seed
./dbt-pod run
or
./dbt-pod build
to populate data lake.
Run the following command to start a beeline console, from where you can query the datalake in SQL:
podman exec -it spark-thrift /usr/spark/bin/beeline -u “jdbc:hive2://spark-thrift:10000/test_schema;auth=noSasl” -n root
Enter show tables; to see the tables.
You can continue to execute SQL commands to interrogate the data.

Technology stack

Last tested with the following technologies:

dbt - 1.8.2
dbt-spark - 1.8.0
Apache Spark and Hadoop - Spark 3.5.1 & Hadoop 3.3
Apache Iceberg - 1.5.2
Hive Standalone Metastore - 3.0.0
PostgreSQL - 16.3
Minio - RELEASE.2024-06-29T01-20-47Z
Hadoop Aws - 3.3.3
AWS Java SDK - 1.12.754
podman - 4.9.3
podman-compose - 1.0.6

Podman will pull the latest images when you build, so breaking changes may start to creep as things change.
You can update SparkThriftDockerfile and docker-compose.yml to lock the image versions to revert to a working configuration like the one above.

Acknowledgements

Thanks to Muhammed Irshad for leading the way:

Medium article: Integrating dbt-spark with Apache Iceberg
Github repo: irshadgit/dbt-spark-iceberg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Background

Instructions

Technology stack

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Background

Instructions

Technology stack

Acknowledgements