- Microsoft Certified: Azure Data Engineer Associate: https://learn.microsoft.com/en-us/credentials/certifications/azure-data-engineer/
- Plan to manage Azure costs: https://learn.microsoft.com/en-us/azure/cost-management-billing/understand/plan-manage-costs
- Data engineers work with multiple types of data to perform a variety of data operations using a range of tools and scripting language.
- Types of data: Structured, Unstructured, Semi-structured.
- Data Operations: Integrations, Transformation, Consolidation
- Languague: SQL, Python, Others
- Important data engineering concepts:
- Opearational and analytical data: Transaction data used by applications, Optimized for analysis and reporting.
- Streaming data: Perpetual, real-time data feeds.
- Data pipeline: Orchestrated activities to transfer and transform data. Used to implement ETL or ELT operations.
- Data Lake: Unstructured/Analytical data stored in files distributed storage for massive scalibility.
- Data Warehouse: Analytical data in RDBS typically modeled as a star schema to optimize summary analysis
- Apache Spark: OS engine for distributed data processing
- Data engineeing in Azure
- Operaation Data > Data Ingestion/ETL > Analytical data storage and Processing > Data modeling and visualization.
- Data Lake Storages Gen2
- Distributed cloud storage for data lakes
- Azure Data Lake storage stores data in HDFS-compatability common file system for Handoop, Spark and others.
- Flexible security through folder and files level permission.
- Built on Azure storages: High performance and scalability, Data redundancy through built-in replication.
- By enabling Hierarchical namespace will enable to use Azure Data Lake Stograge Gen2
- Azure Synapse Analytics is a cloud platform for data analytics
- Large-scale data warehousing
- Advanced analytics
- Data exploration and discovery
- Real time analytics, Data integration, Integrated analytics
- Explore Azure Synapse Analytics:
- Pipelines allows to move data from one to another.
- Dedicated SQL Pool allows to create a data warehouse.
- Notebook for analyze data by python code with text descriptions.
- One of the key tasks you can perform with Azure Synapse Analytics is to define pipelines that transfer (and if necessary, transform) data from a wide range of sources into your workspace for analysis.
- Concept:
- Relation metadata layer over files in data lake. Data is stored in files(csv, json, parquet)
- Query it using SQL (T-SQL in Serverless Pools or Spark SQL API)
- How to create: DB designer, Table Storage settings, Fields names,
- How to use: Using Serverless SQL Pool, Using Apache Spark pool
- Modify and save dataframes
- Typical oprations on a dataframe (filter rows, modify columns values, derive new columns or drop columns)
- Save the transformed data (Parquet format)
- Partition data files
- Partition is an optimization technique that enables Spark to maximize the perfromance across the worker nodes.
- Partition the output file(partionBy method)
- filter perquet files in the query
- Transform data with SQL
- Define tables and views
- use SQL query and transform the data.
- Query the metastore, drop tables.
- Extnernal tables are loosly bound to underlying files and deleting the tables does not delete the file
- Delta Lake
- Open-source storage layer that adds relation database semantics to Spark Serilization format
- Benefits: CRUD, Supports ACID transactions (Atomicity, Consistency, Isolation, Durablilty)
- Data versioning and time travel
- Support for batch and streaming data
- Standar formats and interoperability.
- Create Delta Lake Tables
- Creating Delta Lake table from a dataframe
- Making condition updates(update, delete, and merge operations)
- Querying a previous version of data.
- Create catalog tables
- External vs managed tabels
- A managed table is defined without a specified location, dropping the table deletes the files.
- An external table is defined for a custom file location, dropping the table does not delete the files.
- Creating catalog tables
- From a dataframe (df.write.format("delta))
- Using SQL (Create TABLE myextTable using DELTA LOCATION '/delta/mytable')
- Using DataTableBuilder API
- External vs managed tabels
- Use Delta Lake with Streaming Data
- Spark Structured streaming
- Streaming with Delta lake tables
- As a streaming source
- As a streaming sink
- Use Delta lake in a SQL Pool
- Querying delta formatted files with OPENROWSET
- Querying Catalog tabels.
- Design
- Star Schema: Fact/Dimension tables
- Snowflake schema
- Dedicated SQL Pool - https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture
- Create table
- creating a dedicated SQL Pool
- consideration of creating tables
- Load tables
- COPY Into and considerations
- Query a data warehouse
- Aggregate measures by dimension attributes.
- joins in snowflake tables.
- using Ranking functions (ROW_Number, Rank, Dense_Rank, TILE)
- Retriving an approximate count.
- Load Staging tables
- Load dimension tables.
- Load time dimension tables
- Load slowly changing dimensions
- Load fact tables
- Perform post load optimization (Rebuild indexes, Update statistics)
- What (Activities, Integration runtime, Linked Services, Datasets)
- How to create (Graphical design interface, Defining with JSON)
- Define data flows (Sources, Transformation)
- Run a pipepline (Immediate, Schedule, Trigger based on an event, Lineage: Integrate Synapse and MS Purview)
- HTAP Patters
- OLTP versus OLAP: OLTP(online Transaction processing) helps you manage and process real-time transactions from a single source. OLAP(Online analytical processing) uses historical and aggregated data from multiple sources
- Azure Synapse Link for Cosmos DB: Azure Synapse Link for Azure Cosmos DB is a cloud-native hybrid transactional analytical processing (HTAP) technology that enables you to run near-real-time analytics over operational data stored in Azure Cosmos DB from Azure Synapse Analytics.
- Cossmos DB Containner
- Synapse SQL and Apache Spark runtimes
- Azure Synapse Link for SQL: Replicates the table data to dedicated SQL Pool.
- Azure Syanpase Link for Dataverse: Replicates the data to Azure DL storage Gen2
- https://aka.ms/mslearn-synapse-cosmos (Lab process below)
- Configure Synapse Link in Azure Cosmos DB
- Enable the Synapse Link feature in your Cosmos DB account
- Create an analytical store container
- Configure Synapse Link in Azure Synapse Analytics
- Query Azure Cosmos DB from Azure Synapse Analytics
- Query Azure Cosmos DB from a Spark pool
- Query Azure Cosmos DB from a serverless SQL pool
- Verify data modifications in Cosmos DB are reflected in Synapse
- Configure Synapse Link in Azure Cosmos DB
- Data Streams
- Examples
- Real-time clickstream data to provide recommendations.
- Telemetry data to remotely monitor manufacturing experiments.
- Credit Card transactions to detect fraud.
- Characteristics of Stream Processing Solutions.
- The source data stream is unbounded
- Support real-time automation or visualization.
- Examples
- Event Processing
- Azure stream Analytics and clusters
- inputs (Azure event Hubs, IoT Hub, Blob Storage, Data Lake G2)
- Output (Data Lake, Data Warehouse, PBI, Generate filtered or summarized events)
- Queries (Select into from timestamp where...)
- Windows functions
- Tumbling, Hopping, Sliding, Session, Snapshot
- https://aka.ms/mslearn-stream-lab
- Stream Ingestion Scenarios
- Configure inputs and outputs (Streaming data inputs, Azure Synapse Analytics outputs, Azure Data Lake Storage G2 outputs)
- Define a query to select, filter and aggregate data (selecting fields, filtering event data)
- Run a job to ingest data (Query data in relation data warehouse, query data in a data lake)
- Use a PBI output in Azure Stream Analytics > Output alias, Group workspace, Dataset/table name, Authorize connection.
- Creat a query for real-time visualization
- Creat real-time data viz in PBI > Dashboard
- About
- It is unified data-governance service
- Data mapping (Automate and manage metadata at scale)
- Data Catalog (data discovery)
- Data Estate Insights (Access data estate health)
- Data Policy (Govern access to data)
- Handling
- Loading data in the Data Map (Mapping/Scanning data)
- Browser and Search information (Semantic Search and browser, glossary & workflows)
- Data Lineage
- Lab: [Integrat Azure Synapse Analytics & MS Purview] (https://aka.ms/mslearn-synapse-purview)
- Catalog Synapse data access in Purview
- Connect Purview to a synapse workspace
- Search a Purview catalog in Synapse Studio
- Track data lineage in piplelines
- Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.
- Fully managed, cloud-based data analytics platform that built on Apache Spark with a web-based portal.
- Azure Databricks workloads:
- Data Science and Engineering
- Machine Learning
- SQL Warehouse
- Key Needs/concepts:
- Apache Spark Clusters - Provide highly scalable compute for dist. data processing
- Databricks File System DBFS (mount storage) - distributed shared storage for data lakes
- Notebooks - interactive env. for combining code, notes and images
- Hive metastore - provide relations abstraction layer, enabling you to define tables based on data in files.
- Delta Lake - build on the metastore to enable common relations database capabilities
- SQL Warehouse - provide relational compute endpoints for querying data in tables.
- Lab: Explore Azure Databricks
- Get to know Spark
- Creae a Spark cluster: Cluster mode (Standar, High Concurrecy, Single ode)
- Use Spark in notebooks.
- Use Spark to work with data files: (Exploring data with dataframes, filtering/grouping and using SQL experssion in Spark)
- Visualize data (built-in notebook charts, using graphics pakages in code)
- Lab: Explore Azure Databricks