Skip to content

bijuthottathil/Databricks-Kafka-LiveStreaming

Repository files navigation

Databricks- Live streaming with Kafka

image

To stream data from Alpha Vantage and store it in Confluent Kafka , we will be fetching data from Alpha Vantage continuously and send to Kafka in real time. We will use one Pyspark note book to simulate it Then we will create Databricks notebooks to consude data from Kafka topic and put in delta tables

Prerequisistes

  1. Alpha Vantage API Key 2) Confluent Kafka Cluster 3) PySpark Environment in Databricks

API used in this project is https://www.alphavantage.co/query?function=TIME_SERIES_WEEKLY_ADJUSTED&symbol=IBM&apikey=demo

Based on the Symbol we provide, we can get the most recent 100 intraday OHLCV bars by default when the outputsize parameter is not set

image

To consume this API, we have to generate API Key by visiting https://www.alphavantage.co/support/#api-key . It is free of cost. In free plan only 5 API request allowed per minute

We are using Unity Catalog to store data in Medallion structure. So we setup ADLS Storage to store Metastore,Catalog, Schema, External tables,Checkpoints and External Volumes. Structure is like below image

Catalog and corresponding schema's are created in Databricks like this. All data will be pointing to corresponding external storages defined in Azure ADLS given below

image

External locations are defined in Databricks

image

Created Kafka cluster in Confluent

image

image

image

We wll using a topic to store messages coming from API

image

Topic is created. At present no messages in this topic

Setup new client

image

Need to create API key next

image image

Please make sure to note down below details to use in Databricks notebook to connect to Kafka

image

Instead of keeping few secret details in Key Vault, I used env variables in cluster to keep API Key, Kafka Server and Kafka Topic name. This is not adviced in prod

I have created a key vault in Azure to store all required API_ID, Kafka Server, Kafka Username , Kafka Passoword and Topic name .

Note- Make sure to give access to service pricipal accees to "Azure Databricks" "Key Vault Secrets User" . It will help to access key vault records from Databricks

image

image

We will navigate to Databricks and start creating note books

To load data in Kafka topic, I created one notebook. It will simulate data from API continuously to Topic

Topic is ready. Appropriate configurations to connect to Kafka is mentioned in the notebook

image

image

Now you can see data loading to topic from API

image

Now check Topic

image

I changed stock symbol to Google and loaded data. You can see more data loading in topic. It is still running

image

Now we have another Bronze note book I created to load data from Kafka Topic and load it in Bronze layer storage

image

image

image image

image

Now we will get bronze data , cleanup and loading in silver layer

image

image

Just noted Kafka topic is still getting data from api

image

Silver layer table loading with below data. It is cleaned and column names are changed

image

Now we will focus on Gold layer. Here we will capture aggregated data based on Silver table

image

Mean time always review new tables created under Catalog Kafka

image

And you can see table data is populating in ADLS storage created in Azure too

image

image

image

Streaming jobs are still running in Bronze Layer

image

After loading pending data, I will interupt reading data from Kafka and stop streaming in Bronze Layer

Final step is to show aggregated data in Power BI

Select Partnor Connect from page and choose power Bi

image

image

Report file will be downloaded in your system

image

Open this file in PowerBI Desktop

You can see your tables

image

image

image

By dragging and dropping you can create appropriate visualizations in Power Bi and publish to many destinations

image

Next we will create a workflow (job) to execute these notebooks in a scheduled interval

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published