Performing Batch ETL of ATM transactions data using Apache Sqoop, Apache PySpark, loading the table data into Amazon S3 and warehousing using Amazon RedShift to analyze ATM withdrawl behaviours to optimally manage the refill frequency.
We have the ATM Transaction data hosted on a RDS instance within table SRC_ATM_TRANS in testdatabase.
- RDS Connection String: <your_rds_connection_string>
- Username: username
- Password: password
- Database: testdatabase
- Table Name: SRC_TRANS
We have to ingest the data from RDS into HDFS in our EMR cluster and perform transformations using Spark. To do this, we need to create an EMR cluster with the following services installed:
- Hadoop
- Sqoop
- Spark
- Hue
- Jupyter Notebook
- Livy
- Zeppelin
Note that I am using Spark 2.4 for this project.
To set up Sqoop to connect to RDS in the EMR instance, follow these steps (As Root user):
wget https://de-mysql-connector.s3.amazonaws.com/mysql-connector-java-8.0.25.tar.gz
tar -xvf mysql-connector-java-8.0.25.tar.gz
cd mysql-connector-java-8.0.25/
sudo cp mysql-connector-java-8.0.25.jar /usr/lib/sqoop/lib/
Run the following Sqoop command (as the hadoop user) to import data into HDFS. Experiment with the number of mappers for optimization:
sqoop import \
--connect your_rds_connection_string/testdatabase \
--table SRC_ATM_TRANS \
--username username --password password \
--target-dir /user/livy/data \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
-m 1
- In the above method, the Sqoop job gets executed faster but the resulting file size is very large: 506 MB. Note that Sqoop creates a directory with the same name as the table in /user/hadoop if no target-dir is specified.
- If the target-dir lies in /user/hadoop/, then Sqoop creates a new directory with the table name within that.
- Otherwise, it directly writes in the specified target-dir.
To check if the data is imported correctly, run the following command:
hadoop fs -ls /user/livy/data
- Run the
SparkETLCode.ipynb
notebook in Jupyter to create the fact and dimensions and store them into separate folders in S3. - Here, we need to ensure that our EMR cluster has IAM role that enables it to access S3 objects.
- Create a Redshift cluster with two nodes of dc2.large instances.
- Create the schema and tables. Then load data into these tables from an S3 bucket. Ensure that the IAM role you associate with the Redshift cluster has appropriate permissions to read from S3.
- Follow the commands in
model_creation.sql
(S3 objects URI and region can be noted from their properties tab). - Now analyze the data using the queries from
analysis.sql
. Feel free to analyze more on the data to derive insights.
- Parent table: When foreign key of a table references some attribute of this table
- Child table : The table that contains the foreign key referencing the parent table
Since the tables have parent-child relationships between them, it is important to remember that:
- We should first upload data to parent table then into child table.
- We should first delete the child table, then the parent table.
To delete a table with dependencies you can also use:
drop table <table-name> cascade;