From ed4d04e1eb39c7f57adf4dd83429c2cae9f88d8a Mon Sep 17 00:00:00 2001 From: andkret Date: Wed, 11 Dec 2024 11:06:29 +0100 Subject: [PATCH] Added 81 questions Added questions to ask for platorm & pipeline design --- README.md | 3 + sections/03-AdvancedSkills.md | 129 ++++++++++++++++++++++++++++++++++ sections/10-Updates.md | 3 + 3 files changed, 135 insertions(+) diff --git a/README.md b/README.md index fcd244a..7946ebd 100644 --- a/README.md +++ b/README.md @@ -138,6 +138,9 @@ Find the change log with all recent updates here: [SEE UPDATES](sections/10-Upda - [Scaling Up](sections/03-AdvancedSkills.md#scaling-up) - [Scaling Out](sections/03-AdvancedSkills.md#scaling-out) - [When not to Do Big Data](sections/03-AdvancedSkills.md#please-dont-go-big-data) +- [Platform & Pipeline Design basics](sections/03-AdvancedSkills.md#platform-and-pipeline-design-basics) + - [Data Source Questions](sections/03-AdvancedSkills.md#data-source-questions) + - [Goals and Destination Questions](sections/03-AdvancedSkills.md#goals-and-destination-questions) - [Connect](sections/03-AdvancedSkills.md#connect) - [REST APIs](sections/03-AdvancedSkills.md#rest-apis) - [API Design](sections/03-AdvancedSkills.md#api-design) diff --git a/sections/03-AdvancedSkills.md b/sections/03-AdvancedSkills.md index 55efc45..7ffab96 100644 --- a/sections/03-AdvancedSkills.md +++ b/sections/03-AdvancedSkills.md @@ -14,6 +14,9 @@ Advanced Data Engineering Skills - [Scaling Up](03-AdvancedSkills.md#scaling-up) - [Scaling Out](03-AdvancedSkills.md#scaling-out) - [When not to Do Big Data](03-AdvancedSkills.md#please-dont-go-big-data) +- [Platform & Pipeline Design basics](03-AdvancedSkills.md#platform-and-pipeline-design-basics) + - [Data Source Questions](03-AdvancedSkills.md#data-source-questions) + - [Goals and Destination Questions](03-AdvancedSkills.md#goals-and-destination-questions) - [Connect](03-AdvancedSkills.md#connect) - [REST APIs](03-AdvancedSkills.md#rest-apis) - [API Design](03-AdvancedSkills.md#api-design) @@ -336,6 +339,132 @@ If you don't need it it's making absolutely no sense at all! On the other side: If you really need big data tools they will save your ass :) +## Platform and Pipeline Design Basics +Many people ask: "How do you select the platform, tools and design the pipelines?" +The options seem infinite. Technology however should never dictate the decisions. + +Here are 81 questions that you should answer when starting a project + + +### Data Source Questions +(Comprehensive Questions for Data Engineers) + +#### Data Origin and Structure +- **What is the source?** Understand the "device." +- **What is the format of the incoming data?** (e.g., JSON, CSV, Avro, Parquet) +- **What’s the schema?** +- **Is the data structured, semi-structured, or unstructured?** +- **What is the data type?** Understand the content of the data. +- **Is the schema well-defined, or is it dynamic?** +- **How are changes in the data structure from the source (schema evolution) handled?** + +#### Data Volume & Velocity +- **How much data is transmitted per transmission?** +- **How fast is the data coming in?** (e.g., messages per minute) +- **What is the maximum data volume expected per source per day?** +- **What scaling of sources/data is expected?** +- **Are there peaks for incoming data?** +- **How much data is posted per day across all sources?** +- **How does the data volume fluctuate?** (e.g., seasonal peaks, hourly/daily variations) +- **How will the system handle bursts of data?** (e.g., throttling or buffering) + +#### Source Reliability & Redundancy +- **Is there data arriving late?** +- **Is there a risk of duplicate data from the source?** How will we handle de-duplication? +- **How reliable are the sources?** What’s the expected failure rate? +- **How do we handle data corruption or loss during transmission?** +- **What happens if a source goes offline?** Is there a fallback or failover source? +- **Do we need to retry failed transmissions or have fault-tolerance mechanisms in place?** + +#### Data Extraction & New Sources +- **Do we need to extract the data from the sources?** +- **How many sources are there?** +- **Will new sources be implemented?** + +#### Data Source Connectivity & Authentication +- **How is the data arriving?** (API, bucket, etc.) +- **How is the authentication done?** +- **What kind of connection is required for the data source?** (e.g., streaming, batch, API) +- **What protocols are used for data ingestion?** (e.g., REST, WebSocket, FTP) +- **Are there any rate limits or quotas imposed by the data source?** +- **How do we handle credentials?** Is there an API? +- **What is the retry strategy if data fails to be processed or transmitted?** + +#### Data Security & Compliance +- **Does the data need to be encrypted at the source before being transmitted?** +- **Are there any compliance frameworks (e.g., GDPR, HIPAA) that the source data must adhere to?** +- **Is there a requirement for data masking or obfuscation at the source?** + +#### Metadata & Audit +- **Is there metadata for the client transmission stored somewhere?** +- **What metadata should be captured for each transmission?** (e.g., record counts, latency) +- **How do we track and log data ingestion events for audit purposes?** +- **Is there a need for tracking data lineage?** (i.e., source origin and changes over time) + +--- + +### Goals and Destination Questions +(Comprehensive Questions for Data Engineers) + +#### Use Case & Data Consumption +- **What kind of use case is this?** (Analytics, BI, ML, Transactional processing, Visualization, User Interfaces, APIs) +- **What are the typical use cases that require this data?** (e.g., predictive analytics, operational dashboards) +- **What are the downstream systems or platforms that will consume this data?** +- **How critical is real-time data versus historical data in this use case?** + +#### Data Query & Delivery +- **How is the data visualized?** (raw data, aggregated data) +- **How much raw data is processed at once?** +- **How much data is cold data, and how often is cold data queried?** +- **How fast do the results need to appear?** +- **How much data is going to be queried at once?** +- **How fresh does the data need to be?** +- **How often is the data queried?** (frequency) +- **What are the SLAs for delivering data to downstream systems or applications?** + +#### Aggregation & Modeling +- **How is the data aggregated?** (by devices, topic, time) +- **When does the aggregation happen?** (on query, on schedule, while streaming) +- **What kind of data models are needed for this use case?** (e.g., star schema, snowflake schema) +- **Is there a need for pre-aggregations to speed up queries?** +- **Should partitioning or indexing strategies be implemented to optimize query performance?** + +#### Performance & Availability +- **What is the processing time requirement?** +- **What is the availability of analytics output?** (input vs output delay) +- **How fresh does the data need to be?** +- **What are the performance expectations for query speed?** +- **What is the acceptable query response time for end-users?** +- **How will the system handle an increase in concurrent queries from multiple users?** +- **What is the expected lag between data ingestion and availability for querying?** +- **Do we need horizontal scaling for query engines or databases?** + +#### Data Lifecycle & Retention +- **What’s the data retention time?** +- **How often is data archived or moved to lower-cost storage?** +- **Will old data need to be transformed or reprocessed for new use cases?** +- **What are the data retention policies?** (e.g., hot vs cold storage) +- **How will the use case evolve as the data grows?** Will this affect how data is consumed or visualized? + +#### Monitoring & Debugging +- **How will data delivery to the destination be monitored?** (e.g., time-to-load, query failures) +- **How will we monitor data pipeline health at the destination?** (e.g., throughput, latency) +- **What tools or methods will be used for debugging data delivery failures or performance bottlenecks?** +- **What metrics should be tracked to ensure data pipeline health?** (e.g., latency, throughput) +- **How do we handle issues such as data corruption or incomplete data at the destination?** + +#### Data Access & Permissions +- **Who is working with the platform, and who has access to query or visualize the data?** +- **Which tools are used to query the data?** +- **What kind of data export capabilities are required?** (e.g., CSV, API, direct database access) +- **Is role-based access control (RBAC) needed to segment data views for different users?** +- **How will access to sensitive data be managed?** (e.g., row-level security, encryption) + +#### Scaling & Future Requirements +- **What are the scalability requirements for the data platform as data volume grows?** +- **How will future business goals or scalability needs affect the design of data aggregation and retention strategies?** +- **How will the system handle an increasing load as more users query data or as data volume grows?** + ## Connect diff --git a/sections/10-Updates.md b/sections/10-Updates.md index ec1eba6..94e40ac 100644 --- a/sections/10-Updates.md +++ b/sections/10-Updates.md @@ -2,6 +2,9 @@ Updates ============ What's new? Here you can find a list of all the updates with links to the sections +- **2024-12-11** + - Prepared the most important questions for platform & pipeline design. Specifically looking at the data source and the goals [click here](03-AdvancedSkills.md#platform-and-pipeline-design-basics) + - **2024-11-28** - Prepared a GenAI RAG example project that you can run on your own computer without internet. It uses Ollama with Mistral model and Elasticsearch. Working on a way of creating embeddings from pdf files and inserting them into Elsaticsearch for queries [click here](04-HandsOnCourse.md#genai-retrieval-augmented-generation-with-ollama-and-elasticsearch)