In this project, data has been extracted from a AWS S3 bucket. The data processed, fact and dimension tables have been created. The final output has been load back into S3. This process has been deployed in Spark session.
- Install Python 3
- Install pyspark, os, pyspark.sql
pip install pyspark
- Jupyter Notebook
- PyCharm
-
Read data from S3
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
-
Transform the data using Spark
- Create five different tables
songplays - data lives in log data.
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
users - users Fields - user_id, first_name, last_name, gender, level
songs - songs in the database Fields - song_id, title, artist_id, year, duration
artists - artists in the database Fields - artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units Fields - start_time, hour, day, week, month, year, weekday
- Populate the dwh.cfg config file with AWSAccessKeyId and AWSSecretKey
- Setup ETL.py and run
[KEY]
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx