From ed4d04e1eb39c7f57adf4dd83429c2cae9f88d8a Mon Sep 17 00:00:00 2001
From: andkret <andreas@teamdatascience.com>
Date: Wed, 11 Dec 2024 11:06:29 +0100
Subject: [PATCH] Added 81 questions

Added questions to ask for platorm & pipeline design
---
 README.md                     |   3 +
 sections/03-AdvancedSkills.md | 129 ++++++++++++++++++++++++++++++++++
 sections/10-Updates.md        |   3 +
 3 files changed, 135 insertions(+)

diff --git a/README.md b/README.md
index fcd244a..7946ebd 100644
--- a/README.md
+++ b/README.md
@@ -138,6 +138,9 @@ Find the change log with all recent updates here: [SEE UPDATES](sections/10-Upda
     - [Scaling Up](sections/03-AdvancedSkills.md#scaling-up)
     - [Scaling Out](sections/03-AdvancedSkills.md#scaling-out)
     - [When not to Do Big Data](sections/03-AdvancedSkills.md#please-dont-go-big-data)
+- [Platform & Pipeline Design basics](sections/03-AdvancedSkills.md#platform-and-pipeline-design-basics)
+  - [Data Source Questions](sections/03-AdvancedSkills.md#data-source-questions)
+  - [Goals and Destination Questions](sections/03-AdvancedSkills.md#goals-and-destination-questions)
 - [Connect](sections/03-AdvancedSkills.md#connect)
   - [REST APIs](sections/03-AdvancedSkills.md#rest-apis)
     - [API Design](sections/03-AdvancedSkills.md#api-design)
diff --git a/sections/03-AdvancedSkills.md b/sections/03-AdvancedSkills.md
index 55efc45..7ffab96 100644
--- a/sections/03-AdvancedSkills.md
+++ b/sections/03-AdvancedSkills.md
@@ -14,6 +14,9 @@ Advanced Data Engineering Skills
     - [Scaling Up](03-AdvancedSkills.md#scaling-up)
     - [Scaling Out](03-AdvancedSkills.md#scaling-out)
     - [When not to Do Big Data](03-AdvancedSkills.md#please-dont-go-big-data)
+- [Platform & Pipeline Design basics](03-AdvancedSkills.md#platform-and-pipeline-design-basics)
+  - [Data Source Questions](03-AdvancedSkills.md#data-source-questions)
+  - [Goals and Destination Questions](03-AdvancedSkills.md#goals-and-destination-questions)
 - [Connect](03-AdvancedSkills.md#connect)
   - [REST APIs](03-AdvancedSkills.md#rest-apis)
     - [API Design](03-AdvancedSkills.md#api-design)
@@ -336,6 +339,132 @@ If you don't need it it's making absolutely no sense at all!
 On the other side: If you really need big data tools they will save your
 ass :)
 
+## Platform and Pipeline Design Basics
+Many people ask: "How do you select the platform, tools and design the pipelines?"
+The options seem infinite. Technology however should never dictate the decisions.
+
+Here are 81 questions that you should answer when starting a project
+
+
+### Data Source Questions
+(Comprehensive Questions for Data Engineers)
+
+#### Data Origin and Structure
+- **What is the source?** Understand the "device."
+- **What is the format of the incoming data?** (e.g., JSON, CSV, Avro, Parquet)
+- **What’s the schema?**
+- **Is the data structured, semi-structured, or unstructured?**
+- **What is the data type?** Understand the content of the data.
+- **Is the schema well-defined, or is it dynamic?**
+- **How are changes in the data structure from the source (schema evolution) handled?**
+
+#### Data Volume & Velocity
+- **How much data is transmitted per transmission?**
+- **How fast is the data coming in?** (e.g., messages per minute)
+- **What is the maximum data volume expected per source per day?**
+- **What scaling of sources/data is expected?**
+- **Are there peaks for incoming data?**
+- **How much data is posted per day across all sources?**
+- **How does the data volume fluctuate?** (e.g., seasonal peaks, hourly/daily variations)
+- **How will the system handle bursts of data?** (e.g., throttling or buffering)
+
+#### Source Reliability & Redundancy
+- **Is there data arriving late?**
+- **Is there a risk of duplicate data from the source?** How will we handle de-duplication?
+- **How reliable are the sources?** What’s the expected failure rate?
+- **How do we handle data corruption or loss during transmission?**
+- **What happens if a source goes offline?** Is there a fallback or failover source?
+- **Do we need to retry failed transmissions or have fault-tolerance mechanisms in place?**
+
+#### Data Extraction & New Sources
+- **Do we need to extract the data from the sources?**
+- **How many sources are there?**
+- **Will new sources be implemented?**
+
+#### Data Source Connectivity & Authentication
+- **How is the data arriving?** (API, bucket, etc.)
+- **How is the authentication done?**
+- **What kind of connection is required for the data source?** (e.g., streaming, batch, API)
+- **What protocols are used for data ingestion?** (e.g., REST, WebSocket, FTP)
+- **Are there any rate limits or quotas imposed by the data source?**
+- **How do we handle credentials?** Is there an API?
+- **What is the retry strategy if data fails to be processed or transmitted?**
+
+#### Data Security & Compliance
+- **Does the data need to be encrypted at the source before being transmitted?**
+- **Are there any compliance frameworks (e.g., GDPR, HIPAA) that the source data must adhere to?**
+- **Is there a requirement for data masking or obfuscation at the source?**
+
+#### Metadata & Audit
+- **Is there metadata for the client transmission stored somewhere?**
+- **What metadata should be captured for each transmission?** (e.g., record counts, latency)
+- **How do we track and log data ingestion events for audit purposes?**
+- **Is there a need for tracking data lineage?** (i.e., source origin and changes over time)
+
+---
+
+### Goals and Destination Questions
+(Comprehensive Questions for Data Engineers)
+
+#### Use Case & Data Consumption
+- **What kind of use case is this?** (Analytics, BI, ML, Transactional processing, Visualization, User Interfaces, APIs)
+- **What are the typical use cases that require this data?** (e.g., predictive analytics, operational dashboards)
+- **What are the downstream systems or platforms that will consume this data?**
+- **How critical is real-time data versus historical data in this use case?**
+
+#### Data Query & Delivery
+- **How is the data visualized?** (raw data, aggregated data)
+- **How much raw data is processed at once?**
+- **How much data is cold data, and how often is cold data queried?**
+- **How fast do the results need to appear?**
+- **How much data is going to be queried at once?**
+- **How fresh does the data need to be?**
+- **How often is the data queried?** (frequency)
+- **What are the SLAs for delivering data to downstream systems or applications?**
+
+#### Aggregation & Modeling
+- **How is the data aggregated?** (by devices, topic, time)
+- **When does the aggregation happen?** (on query, on schedule, while streaming)
+- **What kind of data models are needed for this use case?** (e.g., star schema, snowflake schema)
+- **Is there a need for pre-aggregations to speed up queries?**
+- **Should partitioning or indexing strategies be implemented to optimize query performance?**
+
+#### Performance & Availability
+- **What is the processing time requirement?**
+- **What is the availability of analytics output?** (input vs output delay)
+- **How fresh does the data need to be?**
+- **What are the performance expectations for query speed?**
+- **What is the acceptable query response time for end-users?**
+- **How will the system handle an increase in concurrent queries from multiple users?**
+- **What is the expected lag between data ingestion and availability for querying?**
+- **Do we need horizontal scaling for query engines or databases?**
+
+#### Data Lifecycle & Retention
+- **What’s the data retention time?**
+- **How often is data archived or moved to lower-cost storage?**
+- **Will old data need to be transformed or reprocessed for new use cases?**
+- **What are the data retention policies?** (e.g., hot vs cold storage)
+- **How will the use case evolve as the data grows?** Will this affect how data is consumed or visualized?
+
+#### Monitoring & Debugging
+- **How will data delivery to the destination be monitored?** (e.g., time-to-load, query failures)
+- **How will we monitor data pipeline health at the destination?** (e.g., throughput, latency)
+- **What tools or methods will be used for debugging data delivery failures or performance bottlenecks?**
+- **What metrics should be tracked to ensure data pipeline health?** (e.g., latency, throughput)
+- **How do we handle issues such as data corruption or incomplete data at the destination?**
+
+#### Data Access & Permissions
+- **Who is working with the platform, and who has access to query or visualize the data?**
+- **Which tools are used to query the data?**
+- **What kind of data export capabilities are required?** (e.g., CSV, API, direct database access)
+- **Is role-based access control (RBAC) needed to segment data views for different users?**
+- **How will access to sensitive data be managed?** (e.g., row-level security, encryption)
+
+#### Scaling & Future Requirements
+- **What are the scalability requirements for the data platform as data volume grows?**
+- **How will future business goals or scalability needs affect the design of data aggregation and retention strategies?**
+- **How will the system handle an increasing load as more users query data or as data volume grows?**
+
 
 ## Connect
 
diff --git a/sections/10-Updates.md b/sections/10-Updates.md
index ec1eba6..94e40ac 100644
--- a/sections/10-Updates.md
+++ b/sections/10-Updates.md
@@ -2,6 +2,9 @@ Updates
 ============
 
 What's new? Here you can find a list of all the updates with links to the sections
+- **2024-12-11**
+  - Prepared the most important questions for platform & pipeline design. Specifically looking at the data source and the goals [click here](03-AdvancedSkills.md#platform-and-pipeline-design-basics)
+
 
 - **2024-11-28**
   - Prepared a GenAI RAG example project that you can run on your own computer without internet. It uses Ollama with Mistral model and Elasticsearch. Working on a way of creating embeddings from pdf files and inserting them into Elsaticsearch for queries [click here](04-HandsOnCourse.md#genai-retrieval-augmented-generation-with-ollama-and-elasticsearch)