This repository will be the central location for the hands-on programming component of the course.
The goal of the course is to build an end-to-end data pipeline processing Amazon reviews.
The data pipeline you construct will look like below:
- Week 1 - Environment Setup - Configure your environment to begin the programming course work
- Week 2 - Spark SQL - write a Python Spark application to analyze local Amazon review data
- Week 3 - Write to Amazon S3 - the program will now connect to Amazon S3 and write data to the storage
- Week 4 - Kafka + Bronze layer - read from Kafka instead of the local file, and use Spark structured streaming to be output to Amazon S3 creating the Bronze layer
- Week 5 - Silver layer - transform and enrich data from the Bronze layer, creating the Silver layer
- Week 6 - Gold layer - define a schema for the silver layer, streams the data from the silver layer, transforms the data, and establishes the gold layer
- TODO: Week 7 BI