diff --git a/docs/01-Introduction.md b/docs/01-Introduction.md index 203269b..9acfaef 100644 --- a/docs/01-Introduction.md +++ b/docs/01-Introduction.md @@ -1,9 +1,3 @@ ---- -sidebar_label: Introduction -title: ' ' ---- - - Introduction ============ @@ -11,11 +5,7 @@ Introduction ## Contents - [What is this Cookbook](01-Introduction.md#what-is-this-cookbook) -- [Data Engineer vs Data Scientist](01-Introduction.md#data-engineer-vs-data-scientist) - - [Data Engineer](01-Introduction.md#data-engineer) - - [Data Scientist](01-Introduction.md#data-scientist) - - [Machine Learning Workflow](01-Introduction.md#machine-learning-workflow) - - [Machine Learning Model and Data](01-Introduction.md#machine-learning-model-and-data) +- [Data Engineers](01-Introduction.md#data-engineers) - [My Data Science Platform Blueprint](01-Introduction.md#my-data-science-platform-blueprint) - [Connect](01-Introduction.md#connect) - [Buffer](01-Introduction.md#buffer) @@ -23,23 +13,33 @@ Introduction - [Store](01-Introduction.md#store) - [Visualize](01-Introduction.md#visualize) - [Who Companies Need](01-Introduction.md#who-companies-need) +- [How to Learn Data Engineering](01-Introduction.md#how-to-learn-data-engineering) + - [Andreas interview on the Super Data Science Podcast](01-Introduction.md#Interview-with-Andreas-on-the-Super-Data-Science-Podcast) + - [Building Blocks to Learn Data Engineering](01-Introduction.md#building-blocks-to-learn-data-engineering) + - [Roadmap for Data Analysts](01-Introduction.md#roadmap-for-data-analysts) + - [Roadmap for Data Scientists](01-Introduction.md#roadmap-for-data-scientists) + - [Roadmap for Software Engineers](01-Introduction.md#roadmap-for-software-engineers) +- [Data Engineers Skills Matrix](01-Introduction.md#data-engineers-skills-matrix) +- [How to Become a Senior Data Engineer](01-Introduction.md#how-to-become-a-senior-data-engineer) + + ## What is this Cookbook I get asked a lot: "What do you actually need to learn to become an awesome data engineer?" -Well, look no further. you'll find it here! +Well, look no further. You'll find it here! If you are looking for AI algorithms and such data scientist things, this book is not for you. **How to use this Cookbook:** -This book is intended to be a starting point for you. It is not a training! I want to help you to identify the topics to look into and becoming an awesome data engineer in the process. +This book is intended to be a starting point for you. It is not a training! I want to help you to identify the topics to look into to become an awesome data engineer in the process. -It hinges on my Data Science Platform Blueprint, check it out below. Once you understand it, you can find in the book tools that fit into each key area of a Data Science platform (Connect, Buffer, Processing Framework, Store, Visualize). +It hinges on my Data Science Platform Blueprint. Check it out below. Once you understand it, you can find in the book tools that fit into each key area of a Data Science platform (Connect, Buffer, Processing Framework, Store, Visualize). -Select a few tools you are interested in, research and work with them. +Select a few tools you are interested in, then research and work with them. Don't learn everything in this book! Focus. @@ -51,47 +51,39 @@ and case studies. **This book is a work in progress!** As you can see, this book is not finished. I'm constantly adding new -stuff and doing videos for the topics. But obviously, because I do this -as a hobby my time is limited. You can help making this book even +stuff and doing videos for the topics. But, obviously, because I do this +as a hobby, my time is limited. You can help make this book even better. **Help make this book awesome!** If you have some cool links or topics for the cookbook, please become a contributor on GitHub: . Fork the -repo, add them and create a pull request. Or join the discussion by +repo, add them, and create a pull request. Or join the discussion by opening Issues. Tell me your thoughts, what you value, what you think should be included, or correct me where I am wrong. You can also write me an email any time to plumbersofdatascience\@gmail.com anytime. **This Cookbook is and will always be free!** -I don't want to sell you this book, but please support what you like and -join my Patreon: . -Or send me a message and support through PayPal: - -Check out this podcast episode where I talk in detail why I decided to -share all this information for free: [\#079 Trying to stay true to -myself and making the cookbook public on -GitHub](https://youtu.be/k1bS5aSPos8) - ## If You Like This Book & Need More Help: -Check out my Data Engineering Academy and personal Coaching at LearnDataEngineering.com +Check out my Data Engineering Academy at LearnDataEngineering.com **Visit learndataengineering.com:** [Click Here](https://learndataengineering.com) -- New content every week! -- Step by step course from researching job postings, creating and doing your project to job application tips -- Full AWS Data Engineering example project (Azure in development) -- 1+ hours Ultimate Introduction to Data Engineering course -- Data Engineering Fundamentals course -- Data Platform & Pipeline Design course -- Apache Spark Fundamentals course -- Choosing Data Stores Course -- Private Member Slack Workspace (lifetime access) -- Weekly Q&A live stream & Archive -- Currently over 24 hours of videos +- Huge Step by step Data Engineering Academy with over 30 courses +- Unlimited access incl. future courses during subsciption +- Access to all courses and example projects in the Academy +- Associate Data Engineer Certification +- Data Engineering on AWS E-Commerce example project +- Microsoft Azure example project +- Document Streaming example project with Docker, FastAPI, Apache Kafka, Apache Spark, +- MongoDB and Streamlit +- Time Series example project with InfluxDB and Grafana +- Lifetime access to the private Discord Workspace +- Course certificates +- Currently over 54 hours of videos ## Support This Book For Free! @@ -108,45 +100,37 @@ Please use the "Issues" function for comments. -Data Engineer vs Data Scientist +Data Engineers ------------------------------- -| Podcast Episode: #050 Data Engineer, Scientist or Analyst - Which One Is For You? -|----------------------------------------------------------------------------------- -| In this podcast we talk about the differences between data scientists, analysts and engineers. Which are the three main data science jobs. All three are super important. This makes it easy to decide -| [Watch on YouTube](https://youtu.be/64TYZETOEdQ) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/050-Data-Engineer-Scientist-or-Analyst-Which-One-Is-For-You-e45ibl) - - -### Data Engineer - Data Engineers are the link between the management's data strategy -and the data scientists that need to work with data. +and the data scientists or analysts that need to work with data. -What they do is building the platforms that enable data scientists to do +What they do is build the platforms that enable data scientists to do their magic. These platforms are usually used in five different ways: -- Data ingestion and storage of large amounts of data +- Data ingestion and storage of large amounts of data. -- Algorithm creation by data scientists +- Algorithm creation by data scientists. - Automation of the data scientist's machine learning models and - algorithms for production use + algorithms for production use. -- Data visualization for employees and customers +- Data visualization for employees and customers. - Most of the time these guys start as traditional solution architects for systems that involve SQL databases, web servers, SAP installations and other "standard" systems. -But to create big data platforms the engineer needs to be an expert in -specifying, setting up and maintaining big data technologies like: -Hadoop, Spark, HBase, Cassandra, MongoDB, Kafka, Redis and more. +But, to create big data platforms, the engineer needs to be an expert in +specifying, setting up, and maintaining big data technologies like: +Hadoop, Spark, HBase, Cassandra, MongoDB, Kafka, Redis, and more. What they also need is experience on how to deploy systems on cloud -infrastructure like at Amazon or Google or on-premise hardware. +infrastructure like at Amazon or Google, or on-premise hardware. | Podcast Episode: #048 From Wannabe Data Scientist To Engineer My Journey @@ -154,159 +138,6 @@ infrastructure like at Amazon or Google or on-premise hardware. |In this episode Kate Strachnyi interviews me for her humans of data science podcast. We talk about how I found out that I am more into the engineering part of data science. | [Watch on YouTube](https://youtu.be/pIZkTuN5AMM) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/048-From-Wannabe-Data-Scientist-To-Engineer-My-Journey-e45i2o)| -### Data Scientist - -Data scientists aren't like every other scientist. - -Data scientists do not wear white coats or work in high tech labs full -of science fiction movie equipment. They work in offices just like you -and me. - -What differs them from most of us is that they are math experts. They -use linear algebra and multivariable calculus to create new insight from -existing data. - -How exactly does this insight look? - -Here's an example: - -An industrial company produces a lot of products that need to be tested -before shipping. - -Usually such tests take a lot of time because there are hundreds of -things to be tested. All to make sure that your product is not broken. - -Wouldn't it be great to know early if a test fails ten steps down the -line? If you knew that you could skip the other tests and just trash the -product or repair it. - -That's exactly where a data scientist can help you, big-time. This field -is called predictive analytics and the technique of choice is machine -learning. - -Machine what? Learning? - -Yes, machine learning, it works like this: - -You feed an algorithm with measurement data. It generates a model and -optimises it based on the data you fed it with. That model basically -represents a pattern of how your data is looking. You show that model -new data and the model will tell you if the data still represents the -data you have trained it with. This technique can also be used for -predicting machine failure in advance with machine learning. Of course -the whole process is not that simple. - -The actual process of training and applying a model is not that hard. A -lot of work for the data scientist is to figure out how to pre-process -the data that gets fed to the algorithms. - -In order to train an algorithm you need useful data. If you use any data -for the training the produced model will be very unreliable. - -An unreliable model for predicting machine failure would tell you that -your machine is damaged even if it is not. Or even worse: It would tell -you the machine is ok even when there is a malfunction. - -Model outputs are very abstract. You also need to post-process the model -outputs to receive the outputs you desire - -![The Machine Learning Pipeline](/images/Machine-Learning-Pipeline.jpg) - - -### Machine Learning Workflow - -![The Machine Learning Workflow](/images/Machine-Learning-Workflow.jpg) - -Data Scientists and Data Engineers. How does that all fit together? - -You have to look at the data science process. How stuff is created and how data -science is done. How machine learning is -done. - -The machine learning process shows, that you start with a training phase. A phase where you are basically training the algorithms to create the right output. - -In the learning phase you are having the input parameters. Basically the configuration of the model and you have the input data. - -What you're doing is you are training the algorithm. While training the algorithm modifies the training -parameters. It also modifies the used data and then you are getting to an output. - -Once you get an output you are evaluating. Is that output okay, or is that output not the desired output? - -if the output is not what you were looking for? Then you are continuing with the training phase. - -You're trying to retrain the model hundreds, thousands, hundred thousands of times. Of course all this is being done automatically. - -Once you are satisfied with the output, you are putting the model into production. In production it is no longer fed with training -data it's fed with the live data. - -It's evaluating the input data live and putting out live results. - -So, you went from training to production and then what? - -What you do is monitoring the output. If the output keeps making sense, all good! - -If the output of the model changes and it's on longer what you have expected, it means the model doesn't work anymore. - -You need to trigger a retraining of the model. It basically gets to getting trained again. - -Once you are again satisfied with the output, you put it into production again. It replaces the one in production. - -This is the overall process how machine learning. It's how the learning part of data science is working. - - -### Machine Learning Model and Data - -![The Machine Learning Model](/images/Machine-Learning-Model.jpg) - -Now that's all very nice. - -When you look at it, you have two very important places where you have data. - -You have in the training phase two types of data: -Data that you use for the training. Data that basically configures the model, the hyper parameter configuration. - -Once you're in production you have the live data that is streaming in. Data that is coming in from from an app, from -a IoT device, logs, or whatever. - -A data catalog is also important. It explains which features are available and how different data sets are labeled. - -All different types of data. Now, here comes the engineering part. - -The Data Engineers part, is making this data available. Available to the data scientist and the machine learning process. - -So when you look at the model, on the left side you have your hyper parameter configuration. You need to store and manage these configurations somehow. - -Then you have the actual training data. - -There's a lot going on with the training data: - -Where does it come from? Who owns it? Which is basically data governance. - -What's the lineage? Have you modified this data? What did you do, what was the basis, the raw data? - -You need to access all this data somehow. In training and in production. - -In production you need to have access to the live data. - -All this is the data engineers job. Making the data available. - -First an architect needs to build the platform. This can also be a good data engineer. - -Then the data engineer needs to build the pipelines. How is the data coming in and how is the platform -connecting to other systems. - -How is that data then put into the storage. Is there a pre processing for the algorithms necessary? He'll do it. - -Once the data and the systems are available, it's time for the machine learning part. - -It is ready for processing. Basically ready for the data scientist. - -Once the analytics is done the data engineer needs to build pipelines to make it then accessible again. For instance for other analytics processes, for APIs, for front ends and so on. - -All in all, the data engineer's part is a computer science part. - -That's why I love it so much :) - ## My Data Science Platform Blueprint @@ -314,18 +145,14 @@ I have created a simple and modular big data platform blueprint. It is based on what I have seen in the field and read in tech blogs all over the internet. -Why do I believe it will be super useful to you? - -Because, unlike other blueprints it is not focused on technology. +Why do I believe it will be super useful to you? Because, unlike other blueprints, it is not focused on technology. Following my blueprint will allow you to create the big data platform that fits exactly your needs. Building the perfect platform will allow -data scientists to discover new insights. - -It will enable you to perfectly handle big data and allow you to make -data driven decisions. +data scientists to discover new insights. It will enable you to perfectly handle big data and allow you to make +data-driven decisions. -The blueprint is focused on the five key areas: Connect, Buffer, Processing Frameworks, Store and Visualize. +The blueprint is focused on the five key areas: Connect, Buffer, Processing Frameworks, Store, and Visualize. ![Data Science Platform Blueprint](/images/Data-Science-Blueprint-New.jpg) @@ -334,22 +161,22 @@ loosely coupled interfaces. Why is it so important to have a modular platform? -If you have a platform that is not modular you end up with something +If you have a platform that is not modular, you end up with something that is fixed or hard to modify. This means you can not adjust the platform to changing requirements of the company. -Because of modularity it is possible to specifically select tools for your use case. It also allows you to replace every component, if you need it. +Because of modularity, it is possible to specifically select tools for your use case. It also allows you to replace every component, if you need it. Now, lets talk more about each key area. ### Connect Ingestion is all about getting the data in from the source and making it -available to later stages. Sources can be everything from tweets, server -logs to IoT sensor data (e.g. from cars). +available to later stages. Sources can be everything from tweets to server +logs, to IoT sensor data (e.g. from cars). Sources send data to your API Services. The API is going to push the -data into a temporary storage. +data into temporary storage. The temporary storage allows other stages simple and fast access to incoming data. @@ -371,28 +198,28 @@ You put something in on one side and take it out on the other. The idea behind buffers is to have an intermediate system for the incoming data. -How this works is, for instance you're getting data in from from an API. +How this works is, for instance, you're getting data in from from an API. The API is publishing into the message queue. Data is buffered there until it is picked up by the processing. -If you don't have a buffer you can run into problems when writing directly into a store, or you're processing the data directly. You can always have peaks of incoming data that stall the systems. +If you don't have a buffer, you can run into problems when writing directly into a store or you're processing the data directly. You can always have peaks of incoming data that stall the systems. -Like, it's lunch break and people are working with your app way more then usually. -There's more data coming in very very fast. Faster than the analytics of the storage can handle. +Like, it's lunch break and people are working with your app way more than usual. +There's more data coming in very very fast, faster than the analytics of the storage can handle. -In this case you would run into problems, because the whole system would stall. It would therefore take long to process the data and your customers would be annoyed. +In this case, you would run into problems, because the whole system would stall. It would therefore take long to process the data, and your customers would be annoyed. -With a buffer you're buffering the incoming data. Processes for storage and analytics can take out only as much data as they can process. You are no longer in danger of overpowering systems. +With a buffer, you buffer the incoming data. Processes for storage and analytics can take out only as much data as they can process. You are no longer in danger of overpowering systems. Buffers are also really good for building pipelines. -You take data out of Kafka, you pre-process it and put it back into Kafka. -Then with another analytics process you take the processed data back out and put it into a store. +You take data out of Kafka, pre-process it, and put it back into Kafka. +Then, with another analytics process, you take the processed data back out and put it into a store. -Ta Da! A pipeline. +Ta-da! A pipeline. ### Processing Framework -The analyse stage is where the actual analytics is done. Analytics, in +The analyse stage is where the actual analytics is done in the form of stream and batch processing. Streaming data is taken from ingest and fed into analytics. Streaming @@ -405,18 +232,18 @@ big chunk of data and analyse it. This type of analysis is called batch processing. It will deliver you answers for the big questions. -For a short video about batch and stream processing and their use-cases, click on the link below: +For a short video about batch and stream processing and their use cases, click on the link below: [Adding Batch to a Streaming Pipeline](https://www.youtube.com/watch?v=o-aGi3FmdfU) -The analytics process, batch or streaming, is not a one way process. +The analytics process, batch or streaming, is not a one-way process. Analytics can also write data back to the big data storage. -Often times writing data back to the storage makes sense. It allows you +Oftentimes, writing data back to the storage makes sense. It allows you to combine previous analytics outputs with the raw data. Analytics give insights when you combine -raw data. This combination will often times allow you to create even more +raw data. This combination will often allow you to create even more useful insights. A wide variety of analytics tools are available. Ranging from MapReduce @@ -424,11 +251,11 @@ or AWS Elastic MapReduce to Apache Spark and AWS lambda. ### Store -This is the typical big data storage where you just store everything. It +This is the typical big-data storage where you just store everything. It enables you to analyse the big picture. -Most of the data might seem useless for now, but it is of upmost -importance to keep it. Throwing data away is a big no no. +Most of the data might seem useless for now, but it is of utmost +importance to keep it. Throwing data away is a big no-no. Why not throw something away when it is useless? @@ -446,42 +273,279 @@ Check out my podcast how to decide between SQL and NoSQL: ### Visualize -Displaying data is as important as ingesting, storing and analysing it. -Visualizations enable business users to make data driven decisions. +Displaying data is as important as ingesting, storing, and analysing it. +Visualizations enable business users to make data-driven decisions. This is why it is important to have a good visual presentation of the data. Sometimes you have a lot of different use cases or projects using the platform. -It might not be possible for you to build the perfect UI that fits -everyone needs. What you should do in this case is enable others to build the +It might not be possible to build the perfect UI that fits +everyone's needs. What you should do in this case is enable others to build the perfect UI themselves. How to do that? By creating APIs to access the data and making them available to developers. -Either way, UI or API the trick is to give the display stage direct -access to the data in the big data cluster. This kind of access will +Either way, UI or API, the trick is to give the display stage direct +access to the data in the big-data cluster. This kind of access will allow the developers to use analytics results as well as raw data to build the perfect application. ## Who Companies Need -For a company, it is important to have well-trained data -engineers and data scientists. Think of the data scientist as a -professional race car driver. A fit athlete with talent and driving -skills like you have never seen before. +For a company, it is important to have well-trained data engineers. + +That's why companies are looking for people with experience of tools in every part of the above platform blueprint. One common theme I see is cloud platform experience on AWS, Azure or GCP. + +## How to Learn Data Engineering + +### Interview with Andreas on the Super Data Science Podcast + +#### Summary + +This interview with Andreas on Jon Krohn's Super Data Science podcast delves into the intricacies of data engineering, highlighting its critical role in the broader data science ecosystem. Andreas, calling from Northern Bavaria, Germany, shares his journey from a data analyst to becoming a renowned data engineering educator through his Learn Data Engineering Academy. The conversation touches upon the foundational importance of data engineering in ensuring data quality, scalability, and accessibility for data scientists and analysts. + +Andreas emphasizes that the best data engineers often have a background in the companies domain/niche, which equips them with a deep understanding of the end user's needs. The discussion also explores the essential tools and skills required in the field, such as relational databases, APIs, ETL tools, data streaming with Kafka, and the significance of learning platforms like AWS, Azure, and GCP. Andreas highlights the evolving landscape of data engineering, with a nod towards the emergence of roles like analytics engineers and the increasing importance of automation and advanced data processing tools like Snowflake, Databricks, and DBT. + +The interview is not just a technical deep dive but also a personal journey of discovery and passion for data engineering, underscoring the perpetual learning and adaptation required in the fast-evolving field of data science. + +| Watch or listen to this interview -> 657: How to Learn Data Engineering — with Andreas Kretz +|------------------| +| Was super fun talking with Jon about Data Engineering on the podcast. Think this will be very helpful for you :) +| [Watch on YouTube](https://youtu.be/sbDFADS-zo8) / [Listen to the Podcast](https://www.superdatascience.com/podcast/how-to-learn-data-engineering)| + +#### Q&A Highlights + +**Q: What is data engineering, and why is it important?** A: Data engineering is the foundation of the data science process, focusing on collecting, cleaning, and managing data to make it accessible and usable for data scientists and analysts. It's crucial for automating data processes, ensuring data quality, and enabling scalable data analysis and machine learning models. + +**Q: How does one transition from data analysis to data engineering?** +A: The transition involves gaining a deep understanding of data pipelines, learning to work with various data processing and management tools, and developing skills in programming languages and technologies relevant to data engineering, such as SQL, Python, and cloud platforms like AWS or Azure. + +**Q: What are the key skills and tools for a data engineer?** +A: Essential skills include proficiency in SQL, experience with ETL tools, knowledge of programming languages like Python, and familiarity with cloud services and data processing frameworks like Apache Spark. Tools like Kafka for data streaming and platforms like Snowflake and Databricks are also becoming increasingly important. + +**Q: Can you elaborate on the emerging role of analytics engineers?** +A: Analytics engineers focus on bridging the gap between raw data management and data analysis, working closely with data warehouses and using tools like dbt to prepare and model data for easy analysis. This role is pivotal in making data more accessible and actionable for decision-making processes. -What he needs to win races is someone who will provide him the perfect -race car to drive. It is the data engineer/solution architect who will design and built the race car. +**Q: What advice would you give to someone aspiring to become a data engineer?** +A: Start by mastering the basics of SQL and Python, then explore and gain experience with various data engineering tools and technologies. It's also important to understand the data science lifecycle and how data engineering fits within it. Continuous learning and staying updated with industry trends are key to success in this field. -Like the driver and the race car engineer, the data scientist and the data engineer need to work closely together. They need to know the different big data tools inside out. +**Q: How does a data engineer's role evolve with experience?** +A: A data engineer's journey typically starts with focusing on specific tasks or segments of data pipelines, using a limited set of tools. As they gain experience, they broaden their skill set, manage entire data pipelines, and take on more complex projects. Senior data engineers often lead teams, design data architectures, and collaborate closely with data scientists and business stakeholders to drive data-driven decisions. + +**Q: What distinguishes data engineering from machine learning engineering?** +A: While both fields overlap, especially in the use of data, data engineering focuses on the infrastructure and processes for handling data, ensuring its quality and accessibility. Machine learning engineering, on the other hand, centers on deploying and maintaining machine learning models in production environments. A strong data engineering foundation is essential for effective machine learning engineering. + +**Q: Why might a data analyst transition to data engineering?** +A: Data analysts may transition to data engineering to work on more technical aspects of data handling, such as building and maintaining data pipelines, automating data processes, and ensuring data scalability. This transition allows them to have a more significant impact on the data lifecycle and contribute to more strategic data initiatives within an organization. + +**Q: Can you share a challenging project you worked on as a data engineer?** +A: One challenging project involved creating a scalable data pipeline for real-time processing of machine-generated data. The complexity lay in handling vast volumes of data, ensuring its quality, and integrating various data sources while maintaining high performance. This project highlighted the importance of selecting the right tools and technologies, such as Kafka for data streaming and Apache Spark for data processing, to meet the project's demands. + +**Q: How does the cloud influence data engineering?** +A: Cloud platforms like AWS, Azure, and GCP have transformed data engineering by providing scalable, flexible, and cost-effective solutions for data storage, processing, and analysis. They offer a wide range of services and tools that data engineers can leverage to build robust data pipelines and infrastructure, facilitating easier access to advanced data processing capabilities and enabling more innovative data solutions. + +**Q: What future trends do you see in data engineering?** +A: Future trends in data engineering include the increasing adoption of cloud-native services, the rise of real-time data processing and analytics, greater emphasis on data governance and security, and the continued growth of machine learning and AI-driven data processes. Additionally, tools and platforms that simplify data engineering tasks and enable more accessible data integration and analysis will become more prevalent, democratizing data across organizations. + +**Q: How does the background of a data analyst contribute to their success as a data engineer?** +A: Data analysts have a unique advantage when transitioning to data engineering due to their understanding of data's end-use. Their experience in analyzing data gives them insights into what makes data valuable and usable, enabling them to design more effective and user-centric data pipelines and storage solutions. + +**Q: What role does automation play in data engineering?** +A: Automation is crucial in data engineering for scaling data processes, reducing manual errors, and ensuring consistency in data handling. Automated data pipelines allow for real-time data processing and integration, making data more readily available for analysis and decision-making. + +**Q: Can you discuss the significance of cloud platforms in data engineering?** +A: Cloud platforms like AWS, Azure, and GCP offer scalable, flexible, and cost-effective solutions for data storage, processing, and analysis. They provide data engineers with a suite of tools and services to build robust data pipelines, implement machine learning models, and manage large volumes of data efficiently. + +**Q: How does data engineering support data science and machine learning projects?** +A: Data engineering lays the groundwork for data science and machine learning by preparing and managing the data infrastructure. It ensures that high-quality, relevant data is available for model training and analysis, thereby enabling more accurate predictions and insights. + +**Q: What emerging technologies or trends should data engineers be aware of?** +A: Data engineers should keep an eye on the rise of machine learning operations (MLOps) for integrating machine learning models into production, the growing importance of real-time data processing and analytics, and the adoption of serverless computing for more efficient resource management. Additionally, technologies like containerization (e.g., Docker) and orchestration (e.g., Kubernetes) are becoming critical for deploying and managing scalable data applications. + +**Q: What challenges do data engineers face, and how can they be addressed?** +A: Data engineers often grapple with data quality issues, integrating disparate data sources, and scaling data infrastructure to meet growing data volumes. Addressing these challenges requires a solid understanding of data architecture principles, continuous monitoring and testing of data pipelines, and adopting best practices for data governance and management. + +**Q: How important is collaboration between data engineers and other data professionals?** +A: Collaboration is key in the data ecosystem. Data engineers need to work closely with data scientists, analysts, and business stakeholders to ensure that data pipelines are aligned with business needs and analytical goals. Effective communication and a shared understanding of data objectives are vital for the success of data-driven projects. + + +### Building Blocks to Learn Data Engineering + +The following Roadmaps all hinge on the courses in my Data Engineering Academy. They are designed to help students who come from many different professions and enable to build a customized curriculum. + +Here are all the courses currently available February 2024: + +**Colors:** Blue (The Basics), Green (Platform & Pipeline Fundamentals), Orange (Fundamental Tools), Red (Example Projects) + +![Building blocks of your curriculum](/images/All-Courses-at-Learn-Data-Engineering.jpg) + + +### Roadmap for Data Analysts + +![Building blocks of your curriculum](/images/Data-Engineering-Roadmap-for-Data-Analysts.jpg) + +I always advise my students to begin with familiar concepts and knowledge, then expand from there. As a data analyst working with data warehousing and report preparation, one might wonder where to start. Rather than diving into basics, it's beneficial to begin with understanding platform and pipeline fundamentals. This includes grasping the architecture of platforms and proceeding to more advanced topics. + +For instance, if you're familiar with data warehousing, you might explore BigQuery for warehousing or delve into the lakehouse concept on Snowflake or Google Cloud Platform (GCP). The ease of setup with these platforms allows for immediate application of existing knowledge through data uploading and manipulation. Additionally, our course offerings, such as one on Snowflake with DBT, enable further exploration into data transformation within these environments. + +The next step involves selecting appropriate data stores, understanding the differences between OLTP (Online Transaction Processing) databases and analytical data stores, including NoSQL and traditional relational databases. This understanding is crucial for effective data modeling, which we cover in our courses, offering insights into various database modeling techniques. + +Moreover, for those interested in non-relational databases, MongoDB presents a valuable opportunity to skip relational data modeling in favor of document stores, leveraging prior knowledge in warehousing and dimensional modeling. + +Python skills, even at a basic level, are essential for data engineers. Our Python for Data Engineers course aims to enhance understanding of data transformation tools and techniques. This knowledge, combined with insights into data warehousing, platform functionality, data modeling, and store selection, equips you with the skills to utilize Python in data engineering effectively. + +For analysts ready to explore further, projects focusing on fundamental tools and platforms become the next step. Whether it's building streaming data pipelines in Azure that integrate with NoSQL databases like Cosmos DB, or diving into relational data modeling on GCP, there's a wealth of paths to explore. Our courses also cover modern data warehouses and lakehouses on both AWS and GCP, providing comprehensive knowledge on data integration and management. + +Additionally, understanding Docker fundamentals opens up possibilities for containerization and machine learning projects, further enhancing your toolkit. From there, diving into Spark fundamentals, learning Kafka for data streaming, and mastering APIs can lead to developing end-to-end streaming projects with user interfaces. + +In summary, the journey from understanding basic platform and pipeline concepts to mastering advanced data engineering tools and techniques is a gradual process. By focusing on familiar areas and progressively expanding your skill set, you can achieve a solid foundation in data engineering. This approach, especially if documented well, can set you apart in the field, even at an entry-level or junior position. + +| Live Stream -> Roadmap: Data Engineering for Data Analysts! +|------------------| +|In this live stream I was showing step by step how to read this roadmap for Analysts, why I chose these tools and why I think this is the right way to do it. I also answered many questions from the audience. +| [Watch on YouTube](https://youtube.com/live/w2t6SL5tQG0)| + +### Roadmap for Data Scientists + +![Building blocks of your curriculum](/images/Data-Engineering-Roadmap-for-Data-Scientists.jpg) + +We’re going to tackle the data engineering roadmap for data scientists. It's a topic a lot of you have been curious about, especially after we explored the data analyst side of things. The goal here is to lay out a step-by-step path for those of you looking to make a pivot or deepen your understanding of data engineering. + +The first thing I did was sit down and list out all the courses available in my academy. It’s designed to be super flexible, catering to different job roles. For a data scientist, your journey usually starts with a strong grasp of data science fundamentals, right? You know your way around machine learning, how to preprocess data, and maybe even deploy models on a basic level. But then, the question arises: How do you set up an entire platform or pipeline that takes data from ingestion to a point where it’s usable for others? + +Here’s where it gets interesting. I thought about how we could structure this to really benefit data scientists. Starting with the basics, like platform and pipeline design, and then moving into choosing data storage solutions. We’re talking about understanding the differences between databases and when to use each type. + +But it doesn’t stop there. I’ve included some optional topics, like platform security, because it’s always handy to know, even if you’re not directly responsible for it. And since you’re already familiar with data, why not dive deeper into data modeling? It’s all about making your data work for you in the most efficient way possible. + +Now, let's talk about Docker. It's a game-changer for deploying your algorithms. And after that, mastering API fundamentals and streaming with Apache Kafka will open up new possibilities for your projects. + +Depending on your interests or where you see yourself in the future, you might want to explore cloud services like AWS, GCP, or Azure. Or maybe you’re more intrigued by the idea of document streaming and creating user interfaces with MongoDB and Streamlit. The roadmap I’ve laid out includes paths for all these directions. + +Monitoring and observability are crucial, too. You’ll want to keep an eye on your algorithms and the data flowing through your systems. Tools like Elasticsearch or InfluxDB paired with Grafana can give you those insights. + +And don’t forget about orchestration with Airflow. It’s all about keeping your workflows organized and efficient. + +So, this roadmap is more than just a list of topics. It’s about building a foundation that lets you, as a data scientist, expand into data engineering seamlessly. It’s about understanding the ecosystem around your data and how to leverage it to build robust, scalable solutions. + +| Live Stream -> Roadmap: Data Engineering for Data Scientists! +|------------------| +|In this live stream you'll find even more details how to read this roadmap for Data Scientists, why I chose these tools and why I think this is the right way to do it. I also answered many questions from the audience. +| [Watch on YouTube](https://youtube.com/live/fusLAtA1Eu4)| + +### Roadmap for Software Engineers + +![Building blocks of your curriculum](/images/Data-Engineering-Roadmap-for-Software-Engineers.jpg) + +if you're transitioning from a background in computer science or software engineering into data engineering, you're already equipped with a solid foundation. Your existing knowledge in coding, familiarity with SQL databases, understanding of computer networking, and experience with operating systems like Linux, provide you with a considerable advantage. These skills form the cornerstone of data engineering and can significantly streamline your learning curve as you embark on this new journey. + +Here's a refined roadmap, incorporating your prior expertise, to help you excel in data engineering: + +- **Deepen Your Python Skills:** Python is crucial in data engineering for processing and handling various data formats, such as APIs, CSV, and JSON. Given your coding background, focusing on Python for data engineering will enhance your ability to manipulate and process data effectively. +- **Master Docker:** Docker is essential for deploying code and managing containers, streamlining the software distribution process. Your understanding of operating systems and networking will make mastering Docker more intuitive, as you'll appreciate the importance of containerization in today's development and deployment workflows. +- **Platform and Pipeline Design:** Leverage your knowledge of computer networking and operating systems to grasp the architecture of data platforms. Understanding how to design data pipelines, including considerations for stream and batch processing, and emphasizing security, will be key. Your background will provide a solid foundation for understanding how different components integrate within a data platform. +- **Choosing the Right Data Stores:** Dive into the specifics of data stores, understanding the nuances between transactional and analytical databases, and when to use relational vs. NoSQL vs. document stores vs. time-series databases. Your experience with SQL databases will serve as a valuable baseline for exploring these various data storage options. +- **Explore Cloud Platforms:** Get hands-on with cloud services such as AWS, GCP, and Azure. Projects or courses that offer practical experience with these platforms will be invaluable. Your tasks might include building pipelines to process data from APIs, using message queues, or delving into data warehousing and lakes, capitalizing on your foundational skills. +- **Optional Deep Dives:** For those interested in advanced data processing, exploring technologies like Spark or Kafka for stream processing can be enriching. Additionally, learning how to build APIs and work with MongoDB for document storage can open new avenues, especially through practical projects. +- **Log Analysis and Data Observability:** Familiarize yourself with tools like Elasticsearch, Grafana, and InfluxDB to monitor and analyze your data pipelines effectively. This area leverages your comprehensive understanding of how systems communicate and operate, enhancing your ability to maintain and optimize data flows. + +As you embark on this path, remember that your journey is unique. Your existing knowledge not only serves as a strong foundation but also as a catalyst for accelerating your growth in the realm of data engineering. Keep leveraging your strengths, explore areas of interest deeply, and continually adapt to the evolving landscape of data technology. + +| Live Stream -> Data Engineering Roadmap for Computer Scientists / Developers +|------------------| +|In this live stream you'll find even more details how to read this roadmap for Data Scientists, why I chose these tools and why I think this is the right way to do it. +| [Watch on YouTube](https://youtube.com/live/0e4WfIUixRw)| + + +## Data Engineers Skills Matrix + +![Data Engineer Skills Matrix](/images/Data-Engineer-Skills-Matrix.jpg) + +If you're diving into the world of data engineering or looking to climb the ladder within this field, you're in for a treat with this enlightening YouTube video. Andreas kicks things off by introducing us to a very handy tool they've developed: the Data Engineering Skills Matrix. This isn't just any chart; it's a roadmap designed to navigate the complex landscape of data engineering roles, ranging from a Junior Data Engineer to the lofty heights of a Data Architect and Machine Learning Engineer. + +| Live Stream -> Data Engineering Skills Matrix +|------------------| +|In this live stream you'll find even more details how to read this skills matrix for Data Engineers. +| [Watch on YouTube](https://youtube.com/live/5E0UiBy0Kwo)| + +Andreas takes us through the intricacies of this matrix, layer by layer. Starting with the basics, they discuss the minimum experience needed for each role. It's an eye-opener, especially when you see how experience requirements evolve from a beginner to senior levels. But it's not just about how many years you've spent in the field; it's about the skills you've honed during that time. + +### Challenges & Responsibilities + +As the conversation progresses, Andreas delves into the core responsibilities and main tasks associated with each role. You'll learn what sets a Junior Data Engineer apart from a Senior Data Engineer, the unique challenges a Data Architect faces, and the critical skills a Machine Learning Engineer must possess. This part of the video is golden for anyone trying to understand where they fit in the data engineering ecosystem or plotting their next career move. + +### SQL & Soft Skills + +Then there's the talk on SQL knowledge and its relevance across different roles. This segment sheds light on how foundational SQL is, irrespective of your position. But it's not just about technical skills; the video also emphasizes soft skills, like leadership and collaboration, painting a holistic picture of what it takes to succeed in data engineering. + +For those who love getting into the weeds, Andreas doesn't disappoint. They discuss software development skills, debugging, and even dive into how data engineers work with SQL and databases. This part is particularly insightful for understanding the technical depth required at various stages of your career. + +### Q&A + +And here's the cherry on top: Andreas encourages interaction, inviting viewers to share their experiences and questions. This makes the video not just a one-way learning experience but a dynamic conversation that enriches everyone involved. + +### Summary + +By the end of this video, you'll walk away with a clear understanding of the data engineering field's diverse roles. You'll know the skills needed to excel in each role and have a roadmap for your career progression. Whether you're a recent graduate looking to break into data engineering or a seasoned professional aiming for a senior position, Andreas's video is a must-watch. It's not just a lecture; it's a guide to navigating the exciting world of data engineering, tailored by someone who's taken the time to lay out the journey for you. + + + +## How to Become a Senior Data Engineer + +Becoming a senior data engineer is a goal many in the tech industry aspire to. It's a role that demands a deep understanding of data architecture, advanced programming skills, and the ability to lead and communicate effectively within an organization. In this live stream series, I dove into what it takes to climb the ladder to a senior data engineering position. Here are the key takeaways. You can find the links to the videos and the shown images below. + +### Understanding the Role +The journey to becoming a senior data engineer starts with a clear understanding of what the role entails. Senior data engineers are responsible for designing, implementing, and maintaining an organization's data architecture. They ensure data accuracy, accessibility, and security, often taking the lead on complex projects that require advanced technical skills and strategic thinking. + +### Key Skills and Knowledge Areas +Based on insights from the live stream and consultations with industry experts, including GPT-3, here are the critical areas where aspiring senior data engineers should focus their development: + +- **Advanced Data Modeling and Architecture:** Mastery of data modeling techniques and architecture best practices is crucial. This includes understanding of dimensional and Data Vault modeling, as well as expertise in SQL and NoSQL databases. +- **Big Data Technologies:** Familiarity with distributed computing frameworks (like Apache Spark), streaming technologies (such as Apache Kafka), and cloud-based big data technologies is essential. +Advanced ETL Techniques: Skills in incremental loading, data merging, and transformation are vital for efficiently processing large datasets. +- **Data Warehousing and Data Lake Implementation:** Building and maintaining scalable and performant data warehouses and lakes are fundamental responsibilities. +- **Cloud Computing:** Proficiency in cloud services from AWS, Azure, or GCP, along with platforms like Snowflake and Databricks, is increasingly important. +- **Programming and Scripting:** Advanced coding skills in languages relevant to data engineering, such as Python, Scala, or Java, are non-negotiable. +- **Data Governance and Compliance:** Understanding data governance frameworks and compliance requirements is critical, especially in highly regulated industries. +- **Leadership and Communication:** Beyond technical skills, the ability to lead projects, communicate effectively with both technical and non-technical team members, and mentor junior engineers is what differentiates a senior engineer. + +### Learning Pathways +Becoming a senior data engineer requires continuous learning and real-world experience. Here are a few steps to guide your journey: + +- **Educational Foundation:** Start with a strong foundation in computer science or a related field. This can be through formal education or self-study courses. +- **Gain Practical Experience:** Apply your skills in real-world projects. This could be in a professional setting, contributions to open-source projects, or personal projects. +- **Specialize and Certify:** Consider specializing in areas particularly relevant to your interests or industry needs. Obtaining certifications in specific technologies or platforms can also bolster your credentials. +- **Develop Soft Skills:** Work on your communication, project management, and leadership skills. These are as critical as your technical abilities. +- **Seek Feedback and Mentorship:** Learn from the experiences of others. Seek out mentors who can provide guidance and feedback on your progress. + +### Video 1 + +| Live Stream -> How to become a Senior Data Engineer - Part 1 +|------------------| +| In this part one I talked about Data Modeling, Big Data, ETL, Data Warehousing & Data Lakes as well as Cloud computing +| [Watch on YouTube](https://youtube.com/live/M-6xkTCKQQc)| + +![Watch on YouTube](/images/Becoming-a-Senior-Data-Engineer-Video-1.jpg) + +### Video 2 + +| Live Stream -> How to become a Senior Data Engineer - Part 2 +|------------------| +| In part two I talked about real time data processing, programming & scripting, data governance, compliance and data security +| [Watch on YouTube](https://youtube.com/live/po96pzpjxvA)| + +![Watch on YouTube](/images/Becoming-a-Senior-Data-Engineer-Video-2.jpg) + +### Video 3 + +| Live Stream -> How to become a Senior Data Engineer - Part 3 +|------------------| +| In part 3 I focused on everything regarding Leadership and Communication: team management, project management, collaboration, problem solving, strategic thinking, communication and leadership +| [Watch on YouTube](https://youtube.com/live/DMumpzSyRjI)| -That's why companies are looking for people with Spark experience. Spark is the common ground between the data engineer and the data scientist that drives innovation. +![Watch on YouTube](/images/Becoming-a-Senior-Data-Engineer-Video-3.jpg) -Spark gives data scientists the tools to do analytics and helps -engineers to bring the data scientist's algorithms into production. -After all, those two decide how good the data platform is, how good the -analytics insight is and how fast the whole system gets into a -production-ready state. +### Final Thoughts +The path to becoming a senior data engineer is both challenging and rewarding. It requires a blend of technical prowess, continuous learning, and the development of soft skills that enable you to lead and innovate. Whether you're just starting out or looking to advance your career, focusing on the key areas outlined above will set you on the right path. diff --git a/docs/02-BasicSkills.md b/docs/02-BasicSkills.md index 9d9e748..5e84829 100644 --- a/docs/02-BasicSkills.md +++ b/docs/02-BasicSkills.md @@ -1,104 +1,83 @@ ---- -sidebar_label: Basic Skills -title: ' ' ---- - - -Basic Data Engineering Skills +Basic Computer Science Skills ============================= ## Contents -- [Learn To Code](02-BasicSkills.md#learn-to-code) -- [Get Familiar With Git](02-BasicSkills.md#get-familiar-with-git) +- [Learn to Code](02-BasicSkills.md#learn-to-code) +- [Get Familiar with Git](02-BasicSkills.md#get-familiar-with-git) - [Agile Development](02-BasicSkills.md#agile-development) - - [Why is agile so important?](02-BasicSkills.md#Why-is-agile-so-important) - - [Agile rules I learned over the years](02-BasicSkills.md#agile-rules-i-learned-over-the-years) + - [Why Is Agile So Important?](02-BasicSkills.md#Why-is-agile-so-important) + - [Agile Rules I Learned Over the Years](02-BasicSkills.md#agile-rules-i-learned-over-the-years) - [Agile Frameworks](02-BasicSkills.md#agile-frameworks) - [Scrum](02-BasicSkills.md#scrum) - [OKR](02-BasicSkills.md#okr) - [Software Engineering Culture](02-BasicSkills.md#software-engineering-culture) -- [Learn how a Computer Works](02-BasicSkills.md#learn-how-a-computer-works) +- [Learn How a Computer Works](02-BasicSkills.md#learn-how-a-computer-works) - [Data Network Transmission](02-BasicSkills.md#data-network-transmission) - [Security and Privacy](02-BasicSkills.md#security-and-privacy) - [SSL Public and Private Key Certificates](02-BasicSkills.md#ssl-public-and-private-key-Certificates) - [JSON Web Tokens](02-BasicSkills.md#json-web-tokens) - - [GDPR regulations](02-BasicSkills.md#gdpr-regulations) + - [GDPR Regulations](02-BasicSkills.md#gdpr-regulations) - [Linux](02-BasicSkills.md#linux) - [OS Basics](02-BasicSkills.md#os-basics) - - [Shell scripting](02-BasicSkills.md#shell-scripting) + - [Shell Scripting](02-BasicSkills.md#shell-scripting) - [Cron Jobs](02-BasicSkills.md#cron-jobs) - [Packet Management](02-BasicSkills.md#packet-management) - [Docker](02-BasicSkills.md#docker) - [What is Docker and How it Works](02-BasicSkills.md#what-is-docker-and-what-do-you-use-it-for) - - [Don't Mess Up Your System](02-BasicSkills.md#dont-mess-up-your-system) - - [Preconfigured Images](02-BasicSkills.md#preconfigured-images) - - [Take it With You](02-BasicSkills.md#take-it-with-you) - - [Kubernetes Container Deployment](02-BasicSkills.md#kubernetes-container-deployment) - - [How to Create Start and Stop a Container](02-BasicSkills.md#how-to-create-start-stop-a-container) - - [Docker Micro Services](02-BasicSkills.md#docker-micro-services) - - [Kubernetes](02-BasicSkills.md#kubernetes) - - [Why and How To Do Docker Container Orchestration](02-BasicSkills.md#why-and-how-to-do-docker-container-orchestration) - - [Useful Docker Commands](02-BasicSkills.md#useful-docker-commands) + - [Kubernetes Container Deployment](02-BasicSkills.md#kubernetes-container-deployment) + - [Why and How To Do Docker Container Orchestration](02-BasicSkills.md#why-and-how-to-do-docker-container-orchestration) + - [Useful Docker Commands](02-BasicSkills.md#useful-docker-commands) - [The Cloud](02-BasicSkills.md#the-cloud) - - [IaaS vs PaaS vs SaaS](02-BasicSkills.md#iaas-vs-paas-vs-saas) - - [AWS Azure IBM Google IBM](02-BasicSkills.md#aws-azure-ibm-google) - - [Cloud vs On-Premises](02-BasicSkills.md#cloud-vs-on-premises) + - [IaaS vs. PaaS vs. SaaS](02-BasicSkills.md#iaas-vs-paas-vs-saas) + - [AWS Azure IBM Google](02-BasicSkills.md#aws-azure-ibm-google) + - [Cloud vs. On-Premises](02-BasicSkills.md#cloud-vs-on-premises) - [Security](02-BasicSkills.md#security) - [Hybrid Clouds](02-BasicSkills.md#hybrid-clouds) -- [Security Zone Design](02-BasicSkills.md#security-zone-design) - - [How to secure a multi layered application](02-BasicSkills.md#how-to-secure-a-multi-layered-application) - - [Cluster security with Kerberos](02-BasicSkills.md#cluster-security-with-kerberos) +- [Data Scientists and Machine Learning](02-BasicSkills.md#Data-Scientists-and-Machine-Learning) + - [Machine Learning Workflow](02-BasicSkills.md#machine-learning-workflow) + - [Machine Learning Model and Data](02-BasicSkills.md#machine-learning-model-and-data) -Learn To Code +Learn to Code ------------- Why this is important: Without coding you cannot do much in data -engineering. I cannot count the number of times I needed a quick Java -hack. +engineering. I cannot count the number of times I needed a quick hack to solve a problem. The possibilities are endless: -- Writing or quickly getting some data out of a SQL DB +- Writing or quickly getting some data out of a SQL DB. -- Testing to produce messages to a Kafka topic +- Testing to produce messages to a Kafka topic. -- Understanding the source code of a Java Webservice +- Understanding the source code of a Webservice -- Reading counter statistics out of a HBase key value store +- Reading counter statistics out of a HBase key-value store. So, which language do I recommend then? -I highly recommend Java. It's everywhere! - -When you are getting into data processing with Spark you should use -Scala. But, after learning Java this is easy to do. - -Also Python is a great choice. It is super versatile. - -Personally however, I am not that big into Python. But I am going to -look into it -Where to Learn? There's a Java Course on Udemy you could look at: - +If you would asked me a few years ago I would have said Java, 100%. Nowadays though the community moved heavily to Python. I highly recommend starting with it. -- OOP Object oriented programming +When you are getting into data processing with Spark you can use +Scala which is a JVM language, but Python is also very good here. -- What are Unit tests to make sure what you code is working +Python is a great choice. It is super versatile. -- Functional Programming -- How to use build management tools like Maven +Where to Learn Python? There are free Python courses all over the internet. +- I have a beginner one in my Data Engineering academy: [Introduction to Python course](https://learndataengineering.com/p/introduction-to-python) +- I also have a Python for Data Engineers one one in my Data Engineering academy: [Python for Data Engineers course](https://learndataengineering.com/p/python-for-data-engineers) -- Resilient testing (?) +Keep in mind to always keep it practical: Learning by doing! I talked about the importance of learning by doing in this podcast: -Get Familiar With Git +Get Familiar with Git --------------------- Why this is important: One of the major problems with coding is to keep @@ -109,19 +88,19 @@ Another problem is the topic of collaboration and documentation, which is super important. Let's say you work on a Spark application and your colleagues need to -make changes while you are on holiday. Without some code management they +make changes while you are on holiday. Without some code management, they are in huge trouble: Where is the code? What have you changed last? Where is the documentation? How do we mark what we have changed? -But if you put your code on GitHub your colleagues can find your code. +But, if you put your code on GitHub, your colleagues can find your code. They can understand it through your documentation (please also have -in-line comments) +in-line comments). -Developers can pull your code, make a new branch and do the changes. -After your holiday you can inspect what they have done and merge it with -your original code and you end up having only one application. +Developers can pull your code, make a new branch, and do the changes. +After your holiday, you can inspect what they have done and merge it with +your original code, and you end up having only one application. Where to learn: Check out the GitHub Guides page where you can learn all the basics: @@ -139,60 +118,60 @@ Also look into: - Forking -GitHub uses markdown to write pages. A super simple language that is actually a lot of fun to write. Here's a markdown cheat cheatsheet: +GitHub uses markdown to write pages, a super simple language that is actually a lot of fun to write. Here's a markdown cheat cheatsheet: -Pandoc is a great tool to convert any text file from and to markdown: +Pandoc is a great tool to convert any text file to and from markdown: Agile Development ----------------- -Agility, the ability to adapt quickly to changing circumstances. +Agility is the ability to adapt quickly to changing circumstances. -These days everyone wants to be agile. Big or small company people are -looking for the "startup mentality". +These days, everyone wants to be agile. Big and small companies are +looking for the "startup mentality." -Many think it's the corporate culture. Others think it's the process how +Many think it's the corporate culture. Others think it's the process of how we create things that matters. -In this article I am going to talk about agility and self-reliance. -About how you can incorporate agility in your professional career. +In this article, I am going to talk about agility and self-reliance, +about how you can incorporate agility in your professional career. -### Why is agile so important? +### Why Is Agile So Important? -Historically, development is practiced as an explicitly defined process. You -think of something, specify it, have it developed and then built in mass +Historically, development has been practiced as an explicitly defined process. You +think of something, specify it, have it developed, and then build in mass production. It's a bit of an arrogant process. You assume that you already know -exactly what a customer wants. Or how a product has to look and how +exactly what a customer wants, or how a product has to look and how everything works out. The problem is that the world does not work this way! -Often times the circumstances change because of internal factors. +Oftentimes the circumstances change because of internal factors. Sometimes things just do not work out as planned or stuff is harder than you think. You need to adapt. -Other times you find out that you build something customers do not like -and need to be changed. +Other times you find out that you built something customers do not like +and needs to be changed. You need to adapt. -That's why people jump on the Scrum train. Because Scrum is the +That's why people jump on the Scrum train -- because Scrum is the definition of agile development, right? -### Agile rules I learned over the years +### Agile Rules I Learned Over the Years -#### Is the method making a difference? +#### Is the Method Making a Difference? Yes, Scrum or Google's OKR can help to be more agile. The secret to -being agile however, is not only how you create. +being agile, however, is not only how you create. What makes me cringe is people trying to tell you that being agile starts in your head. So, the problem is you? @@ -202,9 +181,9 @@ No! The biggest lesson I have learned over the past years is this: Agility goes down the drain when you outsource work. -#### The problem with outsourcing +#### The Problem with Outsourcing -I know on paper outsourcing seems like a no-brainer: Development costs +I know on paper outsourcing seems like a no-brainer: development costs against the fixed costs. It is expensive to bind existing resources on a task. It is even more @@ -218,7 +197,7 @@ money. His agenda will be to spend as little time as possible on your work. That is why outsourcing requires contracts, detailed specifications, -timetables and delivery dates. +timetables, and delivery dates. He doesn't want to spend additional time on a project, only because you want changes in the middle. Every unplanned change costs him time and @@ -228,22 +207,22 @@ If so, you need to make another detailed specification and a contract change. He is not going to put his mind into improving the product while -developing. Firstly because he does not have the big picture. Secondly +developing. Firstly, because he does not have the big picture. Secondly, because he does not want to. He is doing as he is told. -Who can blame him? If I was the subcontractor I would do exactly the +Who can blame him? If I were the subcontractor, I would do exactly the same! Does this sound agile to you? -#### Knowledge is king: A lesson from Elon Musk +#### Knowledge Is King: A lesson from Elon Musk -Doing everything in house, that's why startups are so productive. No +Doing everything in house -- that's why startups are so productive. No time is wasted on waiting for someone else. -If something does not work, or needs to be changed, there is someone in +If something does not work or needs to be changed, there is someone on the team who can do it right away. One very prominent example who follows this strategy is Elon Musk. @@ -252,28 +231,28 @@ Tesla's Gigafactories are designed to get raw materials in on one side and spit out cars on the other. Why do you think Tesla is building Gigafactories that cost a lot of money? -Why is SpaceX building its one space engines? Clearly there are other, -older, companies who could do that for them. +Why is SpaceX building its own space engines? Clearly, there are other, +older companies who could do that for them. Why is Elon building tunnel boring machines at his new boring company? -At first glance this makes no sense! +At first glance, this makes no sense! -#### How you really can be agile +#### How You Really Can Be Agile -If you look closer it all comes down to control and knowledge. You, your +If you look closer, it all comes down to control and knowledge. You, your team, your company, needs to do as much as possible on your own. Self-reliance is king. -Build up your knowledge and therefore the teams knowledge. When you have +Build up your knowledge and therefore the team's knowledge. When you have the ability to do everything yourself, you are in full control. -You can build electric cars, rocket engines or bore tunnels. +You can build electric cars, build rocket engines, or bore tunnels. -Don't largely rely on others and be confident to just do stuff on your +Don't largely rely on others, and be confident to just do stuff on your own. -Dream big and JUST DO IT! +Dream big, and JUST DO IT! PS. Don't get me wrong. You can still outsource work. Just do it in a smart way by outsourcing small independent parts. @@ -282,37 +261,37 @@ smart way by outsourcing small independent parts. #### Scrum -There's an interesting Scrum Medium publication with a lot of details +There's an interesting Medium article with a lot of details about Scrum: -Also this scrum guide webpage has good infos about Scrum: +Also, this Scrum guide webpage has good info: #### OKR -I personally love OKR, and have been using it for years. Especially for smaller +I personally love OKR and have been using it for years. Especially for smaller teams, OKR is great. You don't have a lot of overhead and get work done. It helps you stay focused and look at the bigger picture. -I recommend to do a sync meeting every Monday. There you talk about what +I recommend doing a sync meeting every Monday. There you talk about what happened last week and what you are going to work on this week. -I talked about this in this Podcast: +I talked about this in this podcast: -This is also this awesome 1,5 hours startup guide from Google: - I really love this video, I rewatched it +There is also this awesome 1,5-hour startup guide from Google: + I really love this video; I rewatched it multiple times. ### Software Engineering Culture The software engineering and development culture is super important. How -does a company handle product development with hundreds of developers. +does a company handle product development with hundreds of developers? Check out this podcast: -| Podcast Episode: #070 Engineering Culture At Spotify +| Podcast episode: #070 Engineering Culture At Spotify |------------------ -|In this podcast we look at the engineering culture at Spotify, my favorite music streaming service. The process behind the development of Spotify is really awesome. +|In this podcast, we look at the engineering culture at Spotify, my favorite music streaming service. The process behind the development of Spotify is really awesome. |[Watch on YouTube](https://youtu.be/1asVrsUDbp0) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/070-The-Engineering-Culture-At-Spotify-e45ipa)| @@ -322,12 +301,12 @@ Check out this podcast: -Learn how a Computer Works +Learn How a Computer Works -------------------------- ### CPU,RAM,GPU,HDD -### Differences between PCs and Servers +### Differences Between PCs and Servers I talked about computer hardware and GPU processing in this podcast: @@ -337,35 +316,35 @@ Data Network Transmission ### OSI Model -The OSI Model describes how data is flowing through the network. It +The OSI Model describes how data flows through the network. It consists of layers starting from physical layers, basically how the data is transmitted over the line or optic fiber. -Check out this article for a deeper understanding of the layers and processes outline in a OSI model: +Check out this article for a deeper understanding of the layers and processes outlined in the OSI model: The Wikipedia page is also very good: -###### Which protocol lives on which layer? +###### Which Protocol Lives on Which Layer? -Check out this network protocol map. Unfortunately it is really hard to +Check out this network protocol map. Unfortunately, it is really hard to find it theses days: ### IP Subnetting -Check out this IP Address and Subnet guide from Cisco: +Check out this IP address and subnet guide from Cisco: -A calculator for Subnets: +A calculator for subnets: -### Switch, Layer 3 Switch +### Switch, Layer-3 Switch -For an introduction to how Ethernet went from broadcasts, to bridges, to -Ethernet MAC switching, to Ethernet & IP (layer 3) switching, to -software defined networking, and to programmable data plane that can +For an introduction to how ethernet went from broadcasts, to bridges, to +Ethernet MAC switching, to ethernet & IP (layer 3) switching, to +software-defined networking, and to programmable data planes that can switch on any packet field and perform complex packet processing, see this video: @@ -373,7 +352,7 @@ this video: ### Firewalls -I talked about Network Infrastructure and Techniques in this podcast: +I talked about network infrastructure and techniques in this podcast: Security and Privacy @@ -381,32 +360,38 @@ Security and Privacy ### SSL Public and Private Key Certificates -### What is a certificate authority + + + + + + + ### JSON Web Tokens Link to the Wiki page: -### GDPR regulations +### GDPR Regulations The EU created the GDPR \"General Data Protection Regulation\" to -protect your personal data like: Your name, age, where you live and so +protect your personal data like: name, age, address, and so on. It's huge and quite complicated. If you want to do online business in -the EU you need to apply these rules. The GDPR is applicable since May -25th 2018. So, if you haven't looked into it, now is the time. +the EU, you need to apply these rules. The GDPR is applicable since May +25th, 2018. So, if you haven't looked into it, now is the time. -The penalties can be crazy high if you do mistakes here. +The penalties can be crazy high if you make mistakes here. Check out the full GDPR regulation here: -By the way, if you do profiling or in general analyse big data, look -into it. There are some important regulations. Unfortunately. +By the way, if you do profiling or analyse big data in general, look +into it. There are some important regulations, unfortunately. I spend months with GDPR compliance. Super fun. Not! Hahaha -### Privacy by design +### Privacy by Design When should you look into privacy regulations and solutions? @@ -414,31 +399,31 @@ Creating the product or service first and then bolting on the privacy is a bad choice. The best way is to start implementing privacy right away in the engineering phase. -This is called privacy by design. Privacy as an integral part of your +This is called privacy by design. Privacy is an integral part of your business, not just something optional. -Check out the Wikipedia page to get a feeling of the important +Check out the Wikipedia page to get a feeling for the important principles: Linux ----- -Linux is very important to learn, at least the basics. Most Big Data -tools or NoSQL databases are running on Linux. +Linux is very important to learn, at least the basics. Most big-data +tools or NoSQL databases run on Linux. -From time to time you need to modify stuff through the operation system. -Especially if you run an infrastructure as a service solution like -Cloudera CDH, Hortonworks or a MapR Hadoop distribution. +From time to time, you need to modify stuff through the operating system, +especially if you run an infrastructure as a service solution like +Cloudera CDH, Hortonworks, or a MapR Hadoop distribution. ### OS Basics -Show all historic commands +Show all historic commands: history | grep docker ### Shell scripting -Ah, creating shell scripts in 2019? Believe it or not scripting in the +Ah, creating shell scripts in 2019? Believe it or not, scripting in the command line is still important. Start a process, automatically rename, move or do a quick compaction of @@ -447,73 +432,73 @@ log files. It still makes a lot of sense. Check out this cheat sheet to get started with scripting in Linux: -There's also this Medium article with a super simple example for +There's also this Medium article with a super-simple example for beginners: -### Cron jobs +### Cron Jobs Cron jobs are super important to automate simple processes or jobs in -Linux. You need this here and there I promise. Check out this three +Linux. You need this here and there, I promise. Check out these three guides: -And of course Wikipedia, which is surprisingly good: +And, of course, Wikipedia, which is surprisingly good: Pro tip: Don't forget to end your cron files with an empty line or a comment, otherwise it will not work. -### Packet management +### Packet Management -Linux Tips are the second part of this podcast: +Linux tips are the second part of this podcast: Docker ------ -### What is docker and what do you use it for +### What is Docker, and What Do You Use It for? Have you played around with Docker yet? If you're a data science learner -or a data scientist you need to check it out! +or a data scientist, you need to check it out! It's awesome because it simplifies the way you can set up development -environments for data science. If you want to set up a dev environment +environments for data science. If you want to set up a dev environment, you usually have to install a lot of packages and tools. #### Don't Mess Up Your System -What this does is you basically mess up your operating system. If you're +What this does is basically mess up your operating system. If you're just starting out, you don't know which packages you need to install. You don't know which tools you need to install. -If you want to for instance start with Jupyter notebooks you need to -install that on your PC somehow. Or you need to start installing tools +If you want to, for instance, start with Jupyter Notebooks, you need to +install that on your PC somehow. Or, you need to start installing tools like PyCharm or Anaconda. -All that gets added to your system and so you mess up your system more +All that gets added to your system, and so you mess up your system more and more and more. What Docker brings you, especially if you're on a Mac -or a Linux system is simplicity. +or a Linux system, is simplicity. #### Preconfigured Images -Because it is so easy to install on those systems. Another cool thing -about docker images is you can just search them in the Docker store, -download them and install them on your system. +Because it is so easy to install on those systems, another cool thing +about Docker images is you can just search them in the Docker store, +download them, and install them on your system. -Running them in a completely pre-configured environment. You don't need -to think about stuff, you go to the Docker library you search for Deep +Running them in a completely pre-configured environment, you don't need +to think about stuff. You go to the Docker library, and you search for Deep Learning, GPU and Python. You get a list of images you can download. You download one, start it -up, you go to the browser hit up the URL and just start coding. +up, go to the browser and hit up the URL, and just start coding. Start doing the work. The only other thing you need to do is bind some -drives to that instance so you can exchange files. And then that's it! +drives to that instance so you can exchange files. And, then that's it! There is no way that you can crash or mess up your system. It's all encapsulated into Docker. Why this works is because Docker has native @@ -522,40 +507,41 @@ access to your hardware. #### Take It With You It's not a completely virtualized environment like a VirtualBox. An -image has the upside that you can take it wherever you want. So if -you're on your PC at home use that there. +image has the upside that you can take it wherever you want. So, if +you're on your PC at home, use that there. -Make a quick build, take the image and go somewhere else. Install the -image which is usually quite fast and just use it like you're at home. +Make a quick build, take the image, and go somewhere else. Install the +image, which is usually quite fast, and just use it like you're at home. It's that awesome! ### Kubernetes Container Deployment -I am getting into Docker a lot more myself. For a bit different reasons. +I am getting into Docker a lot more myself. For a some different reasons. -What I'm looking for is using Docker with Kubernetes. With Kubernetes +What I'm looking for is using Docker with Kubernetes. With Kubernetes, you can automate the whole container deployment process. -The idea with is that you have a cluster of machines. Lets say you have -a 10 server cluster and you run Kubernetes on it. +The idea is that you have a cluster of machines. Lets say you have +a 10-server cluster and you run Kubernetes on it. -Kubernetes lets you spin up Docker containers on-demand to execute -tasks. You can set up how much resources like CPU, RAM, Network, Docker -container can use. +Kubernetes lets you spin up Docker containers on demand to execute +tasks. You can set up how much resources like CPU, RAM, and network your +Docker container can use. -You can basically spin up containers, on the cluster on demand. When -ever you need to do a analytics task. +You can basically spin up containers, on the cluster on demand, whenever +you need to do an analytics task. -Perfect for Data Science. +That's perfect for data science. -### How to create, start, stop a Container -### Docker micro services? +### How to Create, Start, Stop a Container + +### Docker Micro-Services? ### Kubernetes -### Why and how to do Docker container orchestration +### Why and How to Do Docker Container Orchestration Podcast about how data science learners use Docker (for data scientists): @@ -575,16 +561,15 @@ Stop a running container: docker stop -List all running containers +List all running containers: docker ps -List all containers including stopped ones +List all containers including stopped ones: docker ps -a -Inspect the container configuration. For instance network settings and -so on: +Inspect the container configuration (e.g. network settings, etc.): docker inspect CONTAINER @@ -596,15 +581,15 @@ Create a new network: docker network create NETWORK --driver bridge -Connect a running container to a network +Connect a running container to a network: docker network connect NETWORK CONTAINER -Disconnect a running container from a network +Disconnect a running container from a network: docker network disconnect NETWORK CONTAINER -Remove a network +Remove a network: docker network rm NETWORK @@ -612,26 +597,29 @@ Remove a network The Cloud --------- -### IaaS vs PaaS vs SaaS +### IaaS vs. PaaS vs. SaaS -Check out this Podcast it will help you understand where's the -difference and how to decide on what you are going to use. +Check out this podcast. It will help you understand the +difference and how to decide what to use. -| Podcast Episode: #082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS +| Podcast episode: #082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS |------------------| -|In this episode we are talking about the differences between infrastructure as a service, platform as a service and application as a service. Then we install the Nifi docker container and look into how we can extract the twitter data. +|In this episode, we talk about the differences between infrastructure as a service, platform as a service, and application as a service. Then, we install the Nifi Docker container and look into how we can extract the twitter data. | [Watch on YouTube](https://youtu.be/pWuT4UAocUY) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/082-Reading-Tweets-With-Apache-Nifi--IaaS-vs-PaaS-vs-SaaS-e45j50)| ### AWS, Azure, IBM, Google -Each of these have their own answer to IaaS, Paas and SaaS. Pricing and -pricing models vary greatly between each provider. Likewise each +Each of these have their own answer to IaaS, Paas, and SaaS. Pricing and +pricing models vary greatly between each provider. Likewise, each provider's service may have limitations and strengths. #### AWS -[Full list of AWS services](https://www.amazonaws.cn/en/products/). +Here is the [full list of AWS services](https://www.amazonaws.cn/en/products/). Studying for the [AWS Certified Cloud Practitioner](https://aws.amazon.com/certification/certified-cloud-practitioner/?ch=cta&cta=header&p=2) and/or [AWS Certified Solutions Architect](https://aws.amazon.com/certification/certified-solutions-architect-associate/?ch=sec&sec=rmg&d=1) exams can be helpful to quickly gain an understanding of all these services. +Here are links for free digital training for the [AWS Certified Cloud Practitioner](https://explore.skillbuilder.aws/learn/public/learning_plan/view/82/cloud-foundations-learning-plan) and [AWS Certified Solutions Architect Associate](https://explore.skillbuilder.aws/learn/public/learning_plan/view/78/architect-learning-plan). + +Here is a free 17 hour [Data Analytics Learning plan](https://explore.skillbuilder.aws/learn/public/learning_plan/view/97/data-analytics-learning-plan) for AWS's [Analytics](https://aws.amazon.com/big-data/datalakes-and-analytics/?nc2=h_ql_prod_an)/Data Engineering services. #### Azure [Full list of Azure services](https://azure.microsoft.com/en-us/services/). @@ -641,23 +629,21 @@ provider's service may have limitations and strengths. #### Google -Google's offerings referred to as Google Cloud Platform provides wide -variety of services that is ever evolving. [List of GCP services with -brief -description](https://github.com/gregsramblings/google-cloud-4-words). In -recent years documentation and tutorials have com a long way to help +Google Cloud Platform offers a wide, ever-evolving variety of services. +[List of GCP services with brief description](https://github.com/gregsramblings/google-cloud-4-words). In +recent years, documentation and tutorials have com a long way to help [getting started with GCP](https://cloud.google.com/gcp/getting-started/). You can start with -a free account but to use many of the services you will need to turn on -billing. Once you do enable billing always remember to turn off services +a free account, but to use many of the services, you will need to turn on +billing. Once you do enable billing, always remember to turn off services that you have spun up for learning purposes. It is also a good idea to turn on billing limits and alerts. -### Cloud vs On-Premises +### Cloud vs. On-Premises -| Podcast Episode: #076 Cloud vs On-Premise +| Podcast episode: #076 Cloud vs. On-Premise |------------------| -|How do you choose between Cloud vs On-Premises, pros and cons and what you have to think about. Because there are good reasons to not go cloud. Also thoughts on how to choose between the cloud providers by just comparing instance prices. Otherwise the comparison will drive you insane. My suggestion: Basically use them as IaaS and something like Cloudera as PaaS. Then build your solution on top of that. +|How to choose between cloud and on-premises, pros and cons and what you have to think about. There are good reasons to not go cloud. Also, thoughts on how to choose between the cloud providers by just comparing instance prices. Otherwise, the comparison will drive you insane. My suggestion: Basically use them as IaaS and something like Cloudera as PaaS. Then build your solution on top of that. | [Watch on YouTube](https://youtu.be/BAzj0yGcrnE) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/076-Cloud-vs-On-Premise-How-To-Decide-e45ivk)| @@ -673,15 +659,156 @@ interesting example for this is Google Anthos: -Security Zone Design --------------------- -### How to secure a multi layered application +# Data Scientists and Machine Learning + +Data scientists aren't like every other scientist. + +Data scientists do not wear white coats or work in high tech labs full +of science fiction movie equipment. They work in offices just like you +and me. + +What differs them from most of us is that they are math experts. They +use linear algebra and multivariable calculus to create new insight from +existing data. + +How exactly does this insight look? + +Here's an example: + +An industrial company produces a lot of products that need to be tested +before shipping. + +Usually such tests take a lot of time because there are hundreds of +things to be tested. All to make sure that your product is not broken. + +Wouldn't it be great to know early if a test fails ten steps down the +line? If you knew that you could skip the other tests and just trash the +product or repair it. + +That's exactly where a data scientist can help you, big-time. This field +is called predictive analytics and the technique of choice is machine +learning. + +Machine what? Learning? + +Yes, machine learning, it works like this: + +You feed an algorithm with measurement data. It generates a model and +optimises it based on the data you fed it with. That model basically +represents a pattern of how your data is looking. You show that model +new data and the model will tell you if the data still represents the +data you have trained it with. This technique can also be used for +predicting machine failure in advance with machine learning. Of course +the whole process is not that simple. + +The actual process of training and applying a model is not that hard. A +lot of work for the data scientist is to figure out how to pre-process +the data that gets fed to the algorithms. + +In order to train an algorithm you need useful data. If you use any data +for the training the produced model will be very unreliable. + +An unreliable model for predicting machine failure would tell you that +your machine is damaged even if it is not. Or even worse: It would tell +you the machine is ok even when there is a malfunction. + +Model outputs are very abstract. You also need to post-process the model +outputs to receive the outputs you desire + +![The Machine Learning Pipeline](/images/Machine-Learning-Pipeline.jpg) + + +## Machine Learning Workflow + +![The Machine Learning Workflow](/images/Machine-Learning-Workflow.jpg) + +Data Scientists and Data Engineers. How does that all fit together? + +You have to look at the data science process. How stuff is created and how data +science is done. How machine learning is +done. + +The machine learning process shows, that you start with a training phase. A phase where you are basically training the algorithms to create the right output. + +In the learning phase you are having the input parameters. Basically the configuration of the model and you have the input data. + +What you're doing is you are training the algorithm. While training the algorithm modifies the training +parameters. It also modifies the used data and then you are getting to an output. + +Once you get an output you are evaluating. Is that output okay, or is that output not the desired output? + +if the output is not what you were looking for? Then you are continuing with the training phase. + +You're trying to retrain the model hundreds, thousands, hundred thousands of times. Of course all this is being done automatically. + +Once you are satisfied with the output, you are putting the model into production. In production it is no longer fed with training +data it's fed with the live data. + +It's evaluating the input data live and putting out live results. + +So, you went from training to production and then what? + +What you do is monitoring the output. If the output keeps making sense, all good! + +If the output of the model changes and it's on longer what you have expected, it means the model doesn't work anymore. + +You need to trigger a retraining of the model. It basically gets to getting trained again. + +Once you are again satisfied with the output, you put it into production again. It replaces the one in production. + +This is the overall process how machine learning. It's how the learning part of data science is working. + + +## Machine Learning Model and Data + +![The Machine Learning Model](/images/Machine-Learning-Model.jpg) + +Now that's all very nice. + +When you look at it, you have two very important places where you have data. + +You have in the training phase two types of data: +Data that you use for the training. Data that basically configures the model, the hyper parameter configuration. + +Once you're in production you have the live data that is streaming in. Data that is coming in from from an app, from +a IoT device, logs, or whatever. + +A data catalog is also important. It explains which features are available and how different data sets are labeled. + +All different types of data. Now, here comes the engineering part. + +The Data Engineers part, is making this data available. Available to the data scientist and the machine learning process. + +So when you look at the model, on the left side you have your hyper parameter configuration. You need to store and manage these configurations somehow. + +Then you have the actual training data. + +There's a lot going on with the training data: + +Where does it come from? Who owns it? Which is basically data governance. + +What's the lineage? Have you modified this data? What did you do, what was the basis, the raw data? + +You need to access all this data somehow. In training and in production. + +In production you need to have access to the live data. + +All this is the data engineers job. Making the data available. + +First an architect needs to build the platform. This can also be a good data engineer. + +Then the data engineer needs to build the pipelines. How is the data coming in and how is the platform +connecting to other systems. + +How is that data then put into the storage. Is there a pre processing for the algorithms necessary? He'll do it. + +Once the data and the systems are available, it's time for the machine learning part. + +It is ready for processing. Basically ready for the data scientist. -(UI in different zone then SQL DB) +Once the analytics is done the data engineer needs to build pipelines to make it then accessible again. For instance for other analytics processes, for APIs, for front ends and so on. -### Cluster security with Kerberos +All in all, the data engineer's part is a computer science part. -I talked about security zone design and lambda architecture in this -podcast: - +That's why I love it so much :) diff --git a/docs/03-AdvancedSkills.md b/docs/03-AdvancedSkills.md index 0298ad0..55efc45 100644 --- a/docs/03-AdvancedSkills.md +++ b/docs/03-AdvancedSkills.md @@ -1,9 +1,3 @@ ---- -sidebar_label: Advanced Skills -title: ' ' ---- - - Advanced Data Engineering Skills ================================ @@ -20,13 +14,6 @@ Advanced Data Engineering Skills - [Scaling Up](03-AdvancedSkills.md#scaling-up) - [Scaling Out](03-AdvancedSkills.md#scaling-out) - [When not to Do Big Data](03-AdvancedSkills.md#please-dont-go-big-data) -- [Hadoop Platforms](03-AdvancedSkills.md#hadoop-platforms) - - [What is Hadoop](03-AdvancedSkills.md#what-is-hadoop) - - [What makes Hadoop so popular](03-AdvancedSkills.md#what-makes-hadoop-so-popular) - - [Hadoop Ecosystem Components](03-AdvancedSkills.md#hadoop-ecosystem-components) - - [Hadoop is Everywhere?](03-AdvancedSkills.md#hadoop-is-everywhere) - - [Should You Learn Hadoop?](03-AdvancedSkills.md#should-you-learn-hadoop) - - [How to Select Hadoop Cluster Hardware](03-AdvancedSkills.md#how-to-select-hadoop-cluster-hardware) - [Connect](03-AdvancedSkills.md#connect) - [REST APIs](03-AdvancedSkills.md#rest-apis) - [API Design](03-AdvancedSkills.md#api-design) @@ -85,27 +72,30 @@ Advanced Data Engineering Skills - [Apache Drill](03-AdvancedSkills.md#apache-drill) - [StreamSets](03-AdvancedSkills.md#streamsets) - [Store](03-AdvancedSkills.md#store) - - [Data Warehouse vs Data Lake](03-AdvancedSkills.md#data-warehouse-vs-data-lake) - - [SQL Databases](03-AdvancedSkills.md#sql-databases) - - [PostgreSQL DB](03-AdvancedSkills.md#postgresql-db) - - [Database Design](03-AdvancedSkills.md#database-design) - - [SQL Queries](03-AdvancedSkills.md#sql-queries) - - [Stored Procedures](03-AdvancedSkills.md#stored-procedures) - - [ODBC/JDBC Server Connections](03-AdvancedSkills.md#odbc-jdbc-server-connections) - - [NoSQL Stores](03-AdvancedSkills.md#nosql-stores) - - [HBase KeyValue Store](03-AdvancedSkills.md#keyvalue-stores-hbase) - - [HDFS Document Store](03-AdvancedSkills.md#document-stores-hdfs) - - [MongoDB Document Store](03-AdvancedSkills.md#document-stores-mongodb) - - [Elasticsearch Document Store](03-AdvancedSkills.md#Elasticsearch-search-engine-and-document-store) - - [Graph Databases (Neo4j)](03-AdvancedSkills.md#graph-db-neo4j) - - [Impala](03-AdvancedSkills.md#impala) - - [Kudu](03-AdvancedSkills.md#kudu) - - [Apache Druid](03-AdvancedSkills.md#apache-druid) - - [InfluxDB Time Series Database](03-AdvancedSkills.md#influxdb-time-series-database) - - [Greenplum MPP Database](03-AdvancedSkills.md#mpp-databases-greenplum) - - [NoSQL Data Warehouses](03-AdvancedSkills.md#nosql-data-warehouses) - - [Hive Warehouse](03-AdvancedSkills.md#hive-warehouse) - - [Impala](03-AdvancedSkills.md#impala) + - [Analytical Data Stores](03-AdvancedSkills.md#analytical-data-stores) + - [Data Warehouse vs Data Lake](03-AdvancedSkills.md#data-warehouse-vs-data-lake) + - [Snowflake and dbt](03-AdvancedSkills.md#snowflake-and-dbt) + - [Transactional Data Stores](03-AdvancedSkills.md#transactional-data-stores) + - [SQL Databases](03-AdvancedSkills.md#sql-databases) + - [PostgreSQL DB](03-AdvancedSkills.md#postgresql-db) + - [Database Design](03-AdvancedSkills.md#database-design) + - [SQL Queries](03-AdvancedSkills.md#sql-queries) + - [Stored Procedures](03-AdvancedSkills.md#stored-procedures) + - [ODBC/JDBC Server Connections](03-AdvancedSkills.md#odbc-jdbc-server-connections) + - [NoSQL Stores](03-AdvancedSkills.md#nosql-stores) + - [HBase KeyValue Store](03-AdvancedSkills.md#keyvalue-stores-hbase) + - [HDFS Document Store](03-AdvancedSkills.md#document-stores-hdfs) + - [MongoDB Document Store](03-AdvancedSkills.md#document-stores-mongodb) + - [Elasticsearch Document Store](03-AdvancedSkills.md#Elasticsearch-search-engine-and-document-store) + - [Graph Databases (Neo4j)](03-AdvancedSkills.md#graph-db-neo4j) + - [Impala](03-AdvancedSkills.md#impala) + - [Kudu](03-AdvancedSkills.md#kudu) + - [Apache Druid](03-AdvancedSkills.md#apache-druid) + - [InfluxDB Time Series Database](03-AdvancedSkills.md#influxdb-time-series-database) + - [Greenplum MPP Database](03-AdvancedSkills.md#mpp-databases-greenplum) + - [NoSQL Data Warehouses](03-AdvancedSkills.md#nosql-data-warehouses) + - [Hive Warehouse](03-AdvancedSkills.md#hive-warehouse) + - [Impala](03-AdvancedSkills.md#impala) - [Visualize](03-AdvancedSkills.md#visualize) - [Android and IOS](03-AdvancedSkills.md#android-and-ios) - [API Design for Mobile Apps](03-AdvancedSkills.md#how-to-design-apis-for-mobile-apps) @@ -346,150 +336,6 @@ If you don't need it it's making absolutely no sense at all! On the other side: If you really need big data tools they will save your ass :) -## Hadoop Platforms - -When people talk about big data, one of the first things come to mind is -Hadoop. Google's search for Hadoop returns about 28 million results. - -It seems like you need Hadoop to do big data. Today I am going to shed -light onto why Hadoop is so trendy. - -You will see that Hadoop has evolved from a platform into an ecosystem. -Its design allows a lot of Apache projects and 3rd party tools to -benefit from Hadoop. - -I will conclude with my opinion on, if you need to learn Hadoop and if -Hadoop is the right technology for everybody. - -### What is Hadoop - -Hadoop is a platform for distributed storing and analyzing of very large -data sets. - -Hadoop has four main modules: Hadoop common, HDFS, MapReduce and YARN. -The way these modules are woven together is what makes Hadoop so -successful. - -The Hadoop common libraries and functions are working in the background. -That's why I will not go further into them. They are mainly there to -support Hadoop's modules. - -| Podcast Episode: #060 What Is Hadoop And Is Hadoop Still Relevant In 2019? -|------------------| -|An introduction into Hadoop HDFS, YARN and MapReduce. Yes, Hadoop is still relevant in 2019 even if you look into serverless tools. -| [Watch on YouTube](https://youtu.be/8AWaht3YQgo) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/060-What-Is-Hadoop-And-Is-Hadoop-Still-Relevant-In-2019-e45ijp)| - - -### What makes Hadoop so popular? - -Storing and analyzing data as large as you want is nice. But what makes -Hadoop so popular? - -Hadoop's core functionality is the driver of Hadoop's adoption. Many -Apache side projects use it's core functions. - -Because of all those side projects Hadoop has turned more into an -ecosystem. An ecosystem for storing and processing big data. - -To better visualize this eco system I have drawn you the following -graphic. It shows some projects of the Hadoop ecosystem who are closely -connected with the Hadoop. - -It is not a complete list. There are many more tools that even I don't -know. Maybe I am drawing a complete map in the future. - -![Hadoop Ecosystem Components](/images/Hadoop-Ecosystem.jpg) - -### Hadoop Ecosystem Components - -Remember my big data platform blueprint? The blueprint has four stages: -Ingest, store, analyse and display. - -Because of the Hadoop ecosystem the different tools in these stages can -work together perfectly. - -Here's an example: - -![Connections between tools](/images/Hadoop-Ecosystem-Connections.jpg) - -You use Apache Kafka to ingest data, and store it in the HDFS. You do -the analytics with Apache Spark and as a backend for the display you -store data in Apache HBase. - -To have a working system you also need YARN for resource management. You -also need Zookeeper, a configuration management service to use Kafka and -HBase - -As you can see in the picture below each project is closely connected to -the other. - -Spark for instance, can directly access Kafka to consume messages. It is -able to access HDFS for storing or processing stored data. - -It also can write into HBase to push analytics results to the front end. - -The cool thing of such ecosystem is that it is easy to build in new -functions. - -Want to store data from Kafka directly into HDFS without using Spark? - -No problem, there is a project for that. Apache Flume has interfaces for -Kafka and HDFS. - -It can act as an agent to consume messages from Kafka and store them -into HDFS. You even do not have to worry about Flume resource -management. - -Flume can use Hadoop's YARN resource manager out of the box. - -![Flume Integration](/images/Hadoop-Ecosystem-Connections-Flume.jpg) - -### Hadoop Is Everywhere? - -Although Hadoop is so popular it is not the silver bullet. It isn't the -tool that you should use for everything. - -Often times it does not make sense to deploy a Hadoop cluster, because -it can be overkill. Hadoop does not run on a single server. - -You basically need at least five servers, better six to run a small -cluster. Because of that. the initial platform costs are quite high. - -One option you have is to use a specialized systems like Cassandra, -MongoDB or other NoSQL DB's for storage. Or you move to Amazon and use -Amazon's Simple Storage Service, or S3. - -Guess what the tech behind S3 is. Yes, HDFS. That's why AWS also has the -equivalent to MapReduce named Elastic MapReduce. - -The great thing about S3 is that you can start very small. When your -system grows you don't have to worry about S3's server scaling. - -### Should you learn Hadoop? - -Yes, I definitely recommend you to get to know how Hadoop works and how -to use it. As I have shown you in this article, the ecosystem is quite -large. - -Many big data projects use Hadoop or can interface with it. That's why -it is generally a good idea to know as many big data technologies as -possible. - -Not in depth, but to the point that you know how they work and how you -can use them. Your main goal should be to be able to hit the ground -running when you join a big data project. - -Plus, most of the technologies are open source. You can try them out for -free. - -### How does a Hadoop System architecture look like - -### What tools are usually in a with Hadoop Cluster - -Yarn Zookeeper HDFS Oozie Flume Hive - -### How to select Hadoop Cluster Hardware - ## Connect @@ -515,8 +361,6 @@ example how to build an API #### Payload compression attacks -Zip Bombs -https://bomb.codes/ How to defend your Server with zip Bombs https://www.sitepoint.com/how-to-defend-your-website-with-zip-bombs/ @@ -1229,21 +1073,21 @@ it a lot easier to configure the resource management. ### Samza -![Link to Apache Samza Homepage](http://samza.apache.org/) +[Link to Apache Samza Homepage](http://samza.apache.org/) ### AWS Lambda -![Link to AWS Lambda Homepage](https://aws.amazon.com/lambda/) +[Link to AWS Lambda Homepage](https://aws.amazon.com/lambda/) ### Apache Flink -![Link to Apache Flink Homepage](https://flink.apache.org/) +[Link to Apache Flink Homepage](https://flink.apache.org/) ### Elasticsearch -![Link to Elatsicsearch Homepage](https://www.elastic.co/products/elastic-stack) +[Link to Elatsicsearch Homepage](https://www.elastic.co/products/elastic-stack) ### Graph DB @@ -1292,12 +1136,12 @@ https://neo4j.com/use-cases/ ### Apache Solr -![Link to Solr Homepage](https://lucene.apache.org/solr/) +[Link to Solr Homepage](https://solr.apache.org) ### Apache Drill -![Link to Apache Drill Homepage](https://drill.apache.org/) +[Link to Apache Drill Homepage](https://drill.apache.org) ### Apache Storm @@ -1310,22 +1154,73 @@ https://storm.apache.org/ - ## Store -### Data Warehouse vs Data Lake +### Analytical Data Stores + +#### Data Warehouse vs Data Lake | Podcast Episode: #055 Data Warehouse vs Data Lake |------------------| |On this podcast we are going to talk about data warehouses and data lakes? When do people use which? What are the pros and cons of both? Architecture examples for both Does it make sense to completely move to a data lake? | [Watch on YouTube](https://youtu.be/8gNQTrUUwMk) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/055-Data-Warehouse-vs-Data-Lake-e45iem)| -### SQL Databases +#### Snowflake and dbt + +![Snowlfake thumb](/images/03/Snowflake-dbt-thumbnail.jpeg) + +In the rapidly evolving landscape of data engineering, staying ahead means continuously expanding your skill set with the latest tools and technologies. Among the myriad of options available, dbt (data build tool) and Snowflake have emerged as indispensable for modern data engineering workflows. Understanding and leveraging these tools can significantly enhance your ability to manage and transform data, making you a more effective and valuable data engineer. Let's dive into why dbt and Snowflake should be at the top of your learning list and explore how the "dbt for Data Engineers" and "Snowflake for Data Engineers" courses from the Learn Data Engineering Academy can help you achieve mastery in these tools. + +##### The Power of Snowflake in Data Engineering + +Snowflake has revolutionized the data warehousing space with its cloud-native architecture. It offers a scalable, flexible, and highly performant platform that simplifies data management and analytics. Here’s why Snowflake is a critical skill for data engineers: -#### PostgreSQL DB +1. **Cloud-Native Flexibility:** Snowflake’s architecture allows you to scale resources up or down based on your needs, ensuring optimal performance and cost-efficiency. +2. **Unified Data Platform:** It unifies data silos, enabling seamless data sharing and collaboration across the organization. +3. **Integration Capabilities:** Snowflake integrates with various data tools and platforms, enhancing its versatility in different data workflows. +4. **Advanced Analytics:** With its robust support for data querying, transformation, and integration, Snowflake is ideal for complex analytical workloads. + +The "Snowflake for Data Engineers" course in my Learn Data Engineering Academy provides comprehensive training on Snowflake. From the basics of setting up your Snowflake environment to advanced data automation with Snowpipes, the course equips you with practical skills to leverage Snowflake effectively in your data projects. + +Learn more about the course [here](https://learndataengineering.com/p/snowflake-for-data-engineers). + +![Snowlfake thumb](/images/03/Snowflake-ui.jpeg) + + +##### Why dbt is a Game-Changer for Data Engineers + +dbt is a powerful transformation tool that allows data engineers to transform, test, and document data directly within their data warehouse using simple SQL. Unlike traditional ETL tools, dbt operates on the principle of ELT (Extract, Load, Transform), which aligns perfectly with modern cloud data warehousing paradigms. Here are a few reasons why dbt is a must-have skill for data engineers: + +1. **SQL-First Approach:** dbt allows you to write transformations in SQL, the lingua franca of data manipulation, making it accessible to a broad range of data professionals. +2. **Collaboration:** Teams can collaborate seamlessly, creating trusted datasets for reporting, machine learning, and operational workflows. +3. **Ease of Use:** With dbt, you can transform, test, and document your data with ease, streamlining the data pipeline process. +4. **Integration:** dbt integrates effortlessly with your existing data warehouse, such as Snowflake, making it a versatile addition to your toolkit. + +In my Learn Data Engineering Academy you find the perfect starting point for mastering dbt with the course "dbt for Data Engineers". The course covers everything from the basics of ELT processes to advanced features like continuous integration and deployment (CI/CD) pipelines. With hands-on training, you'll learn to create data pipelines, configure dbt materializations, test dbt models, and much more. + +Learn more about the course [here](https://learndataengineering.com/p/dbt-for-data-engineers). + +![Snowlfake thumb](/images/03/dbt-ui.jpeg) + +##### dbt and Snowflake: A Winning Combination + +When used together, dbt and Snowflake offer a powerful combination for data engineering. Here’s why: + +1. **Seamless Integration:** dbt’s SQL-first transformation capabilities integrate perfectly with Snowflake’s scalable data warehousing, creating a streamlined ELT workflow. +2. **Efficiency:** Together, they enhance the efficiency of data transformation and analytics, reducing the time and effort required to prepare data for analysis. +3. **Scalability:** The combined power of dbt’s model management and Snowflake’s dynamic scaling ensures that your data pipelines can handle large and complex datasets with ease. +4. **Collaboration and Documentation:** dbt’s ability to document and test data transformations directly within Snowflake ensures that data workflows are transparent, reliable, and collaborative. +Get right into it with our Academy! + +By integrating Snowflake and dbt into your skill set, you position yourself at the forefront of data engineering innovation. These tools not only simplify and enhance your data workflows but also open up new possibilities for data transformation and analysis. + +### Transactional Data Stores +#### SQL Databases + +##### PostgreSQL DB Homepage: @@ -1335,17 +1230,17 @@ PostgreSQL vs MongoDB: -#### Database Design +##### Database Design -#### SQL Queries +##### SQL Queries -#### Stored Procedures +##### Stored Procedures -#### ODBC/JDBC Server Connections +##### ODBC/JDBC Server Connections -### NoSQL Stores +#### NoSQL Stores -#### KeyValue Stores (HBase) +##### KeyValue Stores (HBase) | Podcast Episode: #056 NoSQL Key Value Stores Explained with HBase @@ -1354,7 +1249,7 @@ PostgreSQL vs MongoDB: | [Watch on YouTube](https://youtu.be/67hIkbpzFc8) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/056-NoSQL-Key-Value-Stores-Explained-With-HBase-e45ifb)| -#### Document Store HDFS +##### Document Store HDFS The Hadoop distributed file system, or HDFS, allows you to store files in Hadoop. The difference between HDFS and other file systems like NTFS @@ -1412,7 +1307,7 @@ This mechanic of splitting a large file in blocks and distributing them over the servers is great for processing. See the MapReduce section for an example. -#### Document Store MongoDB +##### Document Store MongoDB | Podcast Episode: #093 What is MongoDB @@ -1459,7 +1354,7 @@ MongoDB vs Cassandra: -#### Elasticsearch Search Engine and Document Store +##### Elasticsearch Search Engine and Document Store Elasticsearch is not a DB but firstly a search engine that indexes JSON documents. @@ -1488,9 +1383,6 @@ JSON:\ Indexing basics:\ -How to query data with DSL language:\ - - How to do searches with search API:\ @@ -1519,13 +1411,13 @@ Google Trends Grafana vs Kibana:\ -#### Apache Impala +##### Apache Impala -![Apache Impala Homepage](https://impala.apache.org/) +[Apache Impala Homepage](https://impala.apache.org/) -#### Kudu +##### Kudu -#### Apache Druid +##### Apache Druid | Podcast Episode: Druid NoSQL DB and Analytics DB Introduction |------------------| @@ -1533,7 +1425,11 @@ Google Trends Grafana vs Kibana:\ |[Watch on YouTube](https://youtu.be/EiEIeBXSWjM) -#### InfluxDB Time Series Database +##### InfluxDB Time Series Database + +What is time-series data? + + Key concepts: @@ -1553,23 +1449,23 @@ Performance Dashboard Spark and InfluxDB: Other alternatives for time series databases are: DalmatinerDB, -InfluxDB, Prometheus, Riak TS, OpenTSDB, KairosDB +QuestDB, Prometheus, Riak TS, OpenTSDB, KairosDB -#### MPP Databases (Greenplum) +##### MPP Databases (Greenplum) -#### Azure Cosmos DB +##### Azure Cosmos DB https://azure.microsoft.com/en-us/services/cosmos-db/ -#### Azure Table-Storage +##### Azure Table-Storage https://azure.microsoft.com/en-us/services/storage/tables/ -### NoSQL Data warehouse +#### NoSQL Data warehouse -#### Hive Warehouse +##### Hive Warehouse -#### Impala +##### Impala ## Visualize diff --git a/docs/04-HandsOnCourse.md b/docs/04-HandsOnCourse.md index 88a73e5..43589d7 100644 --- a/docs/04-HandsOnCourse.md +++ b/docs/04-HandsOnCourse.md @@ -1,111 +1,102 @@ ---- -sidebar_label: Hands On Course -title: ' ' ---- - - Data Engineering Course: Building A Data Platform ================================================= ## Contents -- [What We Want To Do](04-HandsOnCourse.md#what-we-want-to-do) -- [Thoughts On Choosing A Development Environment](04-HandsOnCourse.md#thoughts-on-choosing-a-development-environment) -- [A Look Into the Twitter API](04-HandsOnCourse.md#a-look-into-the-twiiter-api) -- [Ingesting Tweets with Apache Nifi](04-HandsOnCourse.md#ingesting-tweets-with-apache-nifi) -- [Writing from Nifi to Apache Kafka](04-HandsOnCourse.md#writing-from-nifi-to-kafka) -- [Apache Zeppelin](04-HandsOnCourse.md#apache-zeppelin) - - [Install and Ingest Kafka Topic](04-HandsOnCourse.md#install-and-ingest-kafka-topic) - - [Processing Messages with Spark & SparkSQL](04-HandsOnCourse.md#processing-messages-with-spark-and-sparksql) - - [Visualizing Data](04-HandsOnCourse.md#visualizing-data) -- [Switch Processing from Zeppelin to Spark](04-HandsOnCourse.md#switch-processing-from-zeppelin-to-spark) - -What We Want To Do ------------------- - -- Twitter data to predict best time to post using the hashtag - datascience or ai - -- Find top tweets for the day - -- Top users - -- Analyze sentiment and keywords - -Thoughts On Choosing A Development Environment ----------------------------------------------- - -For a local environment you need a good PC. I thought a bit about a -budget build around 1.000 Dollars or Euros. +- [Free Data Engineering Course with AWS, TDengine, Docker and Grafana](04-HandsOnCourse.md#free-data-engineering-course-with-aws-tdengine-docker-and-grafana) +- [Monitor your data in dbt & detect quality issues with Elementary](04-HandsOnCourse.md#monitor-your-data-in-dbt-and-detect-quality-issues-with-elementary) +- [Solving Engineers 4 Biggest Airflow Problems](04-HandsOnCourse.md#solving-engineers-4-biggest-airflow-problems) +- [The best alternative to Airlfow? Mage.ai](04-HandsOnCourse.md#the-best-alternative-to-airlfow?-mage.ai) -| Podcast Episode: #068 How to Build a Budget Data Science PC -|------------------| -|In this podcast we look into configuring a sub 1000 dollar PC for data engineering and machine learning. -| [Watch on YouTube](https://youtu.be/00NWR-II6ek) \ [Listen on Anchor](https://anchor.fm/andreaskayy/episodes/068-A-Budget-Data-Science-PC-Build-e45inh)| -A Look Into the Twitter API ---------------------------- +## Free Data Engineering Course with AWS TDengine Docker and Grafana -| Podcast Episode: #081 Twitter API Research -|------------------| -|In this podcast we were looking into how the Twitter API works and how you get access to it. -| [Watch on YouTube](https://youtu.be/UnAXKxeIlyg) +**Free hands-on course:** [Watch on YouTube](https://youtu.be/eoj-CnrR9jA) +In this detailed tutorial video, Andreas guides viewers through creating an end-to-end data pipeline using time series data. The project focuses on fetching weather data from a Weather API, processing it on AWS, storing it in TDengine (a time series database), and visualizing the data with Grafana. Here's a concise summary of what the video covers: -Ingesting Tweets with Apache Nifi ---------------------------------- +1. **Introduction and Setup:** + - The project is introduced along with a GitHub repository containing all necessary resources and a step-by-step guide. + - The pipeline architecture includes an IoT weather station, a Weather API, AWS for processing, TDengine for data storage, and Grafana for visualization. +2. **Project Components:** + - **Weather API:** Utilizes weatherapi.com to fetch weather data. + - **AWS Lambda:** Processes the data fetched from the Weather API. + - **TDengine:** Serves as the time series database to store processed data. It's highlighted for its performance and simplicity, especially for time series data. + - **Grafana:** Used for creating dashboards to visualize the time series data. +3. **Development and Deployment:** + - The local development environment setup includes Python, Docker, and VS Code. + - The tutorial covers the creation of a Docker image for the project and deploying it to AWS Elastic Container Registry (ECR). + - AWS Lambda is then configured to use the Docker image from ECR. + - AWS EventBridge is used to schedule the Lambda function to run at specified intervals. +4. **Time Series Data:** + - The importance of time series data and the benefits of using a time series database like TDengine over traditional relational databases are discussed. + - TDengine's features such as speed, scaling, data retention, and built-in functions for time series data are highlighted. +5. **Building the Pipeline:** + - Detailed instructions are provided for setting up each component of the pipeline: + - Fetching weather data from the Weather API. + - Processing and sending the data to TDengine using an AWS Lambda function. + - Visualizing the data with Grafana. + - Each step includes code snippets and configurations needed to implement the pipeline. +6. **Conclusion:** + - The video concludes with a demonstration of the completed pipeline, showing weather data visualizations in Grafana. + - Viewers are encouraged to replicate the project using the resources provided in the GitHub repository linked in the video description. -| Podcast Episode: #082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS -|------------------| -|In this podcast we are trying to read Twitter Data with Nifi. -| [Watch on YouTube](https://youtu.be/pWuT4UAocUY) +This video provides a comprehensive guide to building a data pipeline with a focus on time series data, demonstrating the integration of various technologies and platforms to achieve an end-to-end solution. -| Podcast Episode: #085 Trying to read Tweets with Nifi Part 2 -|------------------| -|We are looking into the Big Data landscape chart and we are trying to read Twitter Data with Nifi again. -| [Watch on YouTube](https://youtu.be/OLUwXr8-gAk) +## Monitor your data in dbt and detect quality issues with Elementary +**Free hands-on tutorial:** [Watch on YouTube](https://youtu.be/6fnU91Q2gq0) -Writing from Nifi to Apache Kafka ---------------------------------- +In this comprehensive tutorial, Andreas delves into the integration of dbt (data build tool) with Elementary to enhance data monitoring and quality detection within Snowflake databases. The tutorial is structured to guide viewers through a hands-on experience, starting with an introduction to a sample project setup and the common challenges faced in monitoring dbt jobs. It then transitions into how Elementary can be utilized to address these challenges effectively. -| Podcast Episode: #086 How to Write from Nifi to Kafka Part 1 -|------------------| -|I’ve been working a lot on the cookbook, because it’s so much fun. I gotta tell you what I added. Then we are trying to write the Tweets from Apache Nifi into Kafka. Also talk about Kafka basics. -| [Watch on YouTube](https://youtu.be/F7Y-ygnyJMg) +Key learning points and tutorial structure include: -| Podcast Episode: #088 How to Write from Nifi to Kafka Part 2 -|------------------| -|In this podcast we finally figure out how to write to Kafka from Nifi. The problem was the network configuration of the Docker containers. -| [Watch on YouTube](https://youtu.be/pJbRnBQmoCs) +1. **Introduction to the Sample Project:** Andreas showcases a project setup involving Snowflake as the data warehouse, dbt for data modeling and testing, and a visualization tool for data analysis. This setup serves as the basis for the tutorial. +2. **Challenges in Monitoring dbt Jobs:** Common issues in monitoring dbt jobs are discussed, highlighting the limitations of the dbt interface in providing comprehensive monitoring capabilities. +3. **Introduction to Elementary:** Elementary is introduced as a dbt-native data observability tool designed to enhance the monitoring and analysis of dbt jobs. It offers both open-source and cloud versions, with the tutorial focusing on the cloud version. +4. **Setup Requirements:** The tutorial covers the necessary setup on both the Snowflake and dbt sides, including schema creation, user and role configuration in Snowflake, and modifications to the dbt project for integrating with Elementary. +5. **Elementary's User Interface and Features:** A thorough walkthrough of Elementary's interface is provided, showcasing its dashboard, test results, model runs, data catalog, and data lineage features. The tool's ability to automatically run additional tests, like anomaly detection and schema change detection, is also highlighted. +6. **Advantages of Using Elementary:** The presenter outlines several benefits of using Elementary, such as easy implementation, native test integration, clean and straightforward UI, and enhanced privacy due to data being stored within the user's data warehouse. +7. **Potential Drawbacks:** Some potential drawbacks are discussed, including the additional load on dbt job execution due to more models being run and limitations in dashboard customization. +8. **Summary and Verdict:** The tutorial concludes with a summary of the key features and benefits of using Elementary with dbt, emphasizing its value in improving data quality monitoring and detection. +Overall, viewers are guided through setting up and utilizing Elementary for dbt data monitoring, gaining insights into its capabilities, setup process, and the practical benefits it offers for data quality assurance. -Apache Zeppelin ---------------- -### Install and Ingest Kafka Topic +## Solving Engineers 4 Biggest Airflow Problems -Start the container: +**Free hands-on tutorial:** [Watch on YouTube](https://youtu.be/b9bMNEh8bes) +In this informative video, Andreas discusses the four major challenges engineers face when working with Apache Airflow and introduces Astronomer, a managed Airflow service that addresses these issues effectively. Astronomer is highlighted as a solution that simplifies Airflow deployment and management, making it easier for engineers to develop, deploy, and monitor their data pipelines. Here's a summary of the key points discussed for each challenge and how Astronomer provides solutions: - docker run -d -p 8081:8080 --rm \ - -v /Users/xxxx/Documents/DockerFiles/logs:/logs \ - -v /Users/xxxx/Documents/DockerFiles/Notebooks:/notebook \ - -e ZEPPELIN_LOG_DIR='/logs' \ - -e ZEPPELIN_NOTEBOOK_DIR='/notebook' \ - --network app-tier --name zeppelin apache/zeppelin:0.7.3 +1. Managing Airflow Deployments: + - **Challenge:** Setting up and maintaining Airflow deployments is complex and time-consuming, involving configuring cloud instances, managing resources, scaling, and updating the Airflow system. + - **Solution with Astronomer:** Offers a straightforward deployment process where users can easily configure their deployments, choose cloud providers (GCP, AWS, Azure), and set up scaling with just a few clicks. Astronomer handles the complexity, making it easier to manage production and quality environments. +2. Development Environment and Deployment: + - **Challenge:** Local installation of Airflow is complicated due to its dependency on multiple Docker containers and the need for extensive configuration. + - **Solution with Astronomer:** Provides a CLI tool for setting up a local development environment with a single command, simplifying the process of developing, testing, and deploying pipelines. The Astronomer CLI also helps in initializing project templates and deploying Dags to the cloud effortlessly. +3. Source Code Management and CI/CD Pipelines: + - **Challenge:** Collaborative development and continuous integration/deployment (CI/CD) are essential but challenging to implement effectively with Airflow alone. + - **Solution with Astronomer:** Facilitates easy integration with GitHub for source code management and GitHub Actions for CI/CD. This allows automatic testing and deployment of pipeline code, ensuring a smooth workflow for teams working on pipeline development. +4. Observing Pipelines and Alarms: + - **Challenge:** Monitoring data pipelines and getting timely alerts when issues occur is crucial but often difficult to achieve. + - **Solution with Astronomer:** The Astronomer platform provides a user-friendly interface for monitoring pipeline status and performance. It also offers customizable alerts for failures or prolonged task durations, with notifications via email, PagerDuty, or Slack, ensuring immediate awareness and response to issues. -### Processing Messages with Spark and SparkSQL +Overall, the video shows Astronomer as a powerful and user-friendly platform that addresses the common challenges of using Airflow, from deployment and development to collaboration, CI/CD, and monitoring. It suggests that Astronomer can significantly improve the experience of engineers working with Airflow, making it easier to manage, develop, and monitor data pipelines. -### Visualizing Data -Switch Processing from Zeppelin to Spark ----------------------------------------- +## The best alternative to Airlfow? Mage.ai -### Install Spark +**Free hands-on tutorial:** [Watch on YouTube](https://youtu.be/3gXsFEC3aYA) -### Ingest Messages from Kafka +In this insightful video, Andreas introduces Mage, a promising alternative to Apache Airflow, focusing on its simplicity, user-friendliness, and scalability. The video provides a comprehensive walkthrough of Mage, highlighting its key features and advantages over Airflow. Here's a breakdown of what viewers can learn and expect from the video: -### Writing from Spark to Kafka +1. **Deployment Ease:** Mage offers a stark contrast to Airflow's complex setup process. It simplifies deployment to a single Docker image, making it straightforward to install and start on any machine, whether it's local or cloud-based on AWS, GCP, or Azure. This simplicity extends to scaling, which Mage handles horizontally, particularly beneficial in Kubernetes environments where performance scales with the number of pipelines. +2. **User Interface (UI):** Mage shines with its UI, presenting a dark mode interface that's not only visually appealing but also simplifies navigation and pipeline management. The UI facilitates easy access to pipelines, scheduling, and monitoring of pipeline runs, offering a more intuitive experience compared to Airflow. +3. **Pipeline Creation and Modification:** Mage streamlines the creation of ETL pipelines, allowing users to easily add data loaders, transformers, and exporters through its UI. It supports direct interaction with APIs for data loading and provides a visual representation of the data flow, enhancing the overall pipeline design experience. +4. **Data Visualization and Exploration:** Beyond simple pipeline creation, Mage enables in-depth data exploration within the UI. Users can generate various charts, such as histograms and bar charts, to analyze the data directly, a feature that greatly enhances the tool's utility. +5. **Testing and Scheduling:** Testing pipelines in Mage is straightforward, allowing for quick integration of tests to ensure data quality and pipeline reliability. Scheduling is also versatile, supporting standard time-based triggers, event-based triggers for real-time data ingestion, and API calls for on-demand pipeline execution. +6. **Support for Streaming and ELT Processes:** Mage is not limited to ETL workflows but also supports streaming and ELT processes. It integrates seamlessly with DBT models for in-warehouse transformations and Spark for big data processing, showcasing its versatility and scalability. +7. **Conclusion and Call to Action:** Andreas concludes by praising the direction in which the industry is moving, with tools like Mage simplifying data engineering processes. He encourages viewers to try Mage and engage with the content by liking, subscribing, and commenting on their current tools and the potential impact of Mage. -### Move Zeppelin Code to Spark +Overall, the video shows Mage as a highly user-friendly, scalable, and versatile tool for data pipeline creation and management, offering a compelling alternative to traditional tools like Airflow. diff --git a/docs/05-CaseStudies.md b/docs/05-CaseStudies.md index e53a629..03cacb2 100644 --- a/docs/05-CaseStudies.md +++ b/docs/05-CaseStudies.md @@ -1,9 +1,3 @@ ---- -sidebar_label: Case Studies -title: ' ' ---- - - Case Studies ============ @@ -82,7 +76,6 @@ Spark Streaming for logging events: ### Data Science at Amazon -\ ### Data Science at Baidu @@ -135,9 +128,6 @@ Confluent Platform: - - - @@ -242,8 +232,6 @@ Confluent Platform: **Slides:** - - diff --git a/docs/06-BestPracticesCloud.md b/docs/06-BestPracticesCloud.md index 05eecf8..60c4df6 100644 --- a/docs/06-BestPracticesCloud.md +++ b/docs/06-BestPracticesCloud.md @@ -1,9 +1,3 @@ ---- -sidebar_label: Best Practices Cloud -title: ' ' ---- - - Best Practices Cloud Platforms ============================= @@ -245,9 +239,6 @@ Best Practices for Operating Containers: https://cloud.google.com/solutions/best-practices-for-operating-containers -Preparing a Google Kubernetes Engine Environment for Production: - -https://cloud.google.com/solutions/prep-kubernetes-engine-for-prod Automating IoT Machine Learning: Bridging Cloud and Device Benefits with AI Platform: diff --git a/docs/07-DataSources.md b/docs/07-DataSources.md index 52bd3fd..bba587c 100644 --- a/docs/07-DataSources.md +++ b/docs/07-DataSources.md @@ -1,9 +1,3 @@ ---- -sidebar_label: Data Sources -title: ' ' ---- - - 100 Plus Data Sources Data Science =================================== @@ -47,7 +41,6 @@ You can find the articles on the bottom of this section to read more. They inclu - [FiveThirtyEight](http://fivethirtyeight.com/) - [Google Scholar](http://scholar.google.com/) - [Pew Research](http://www.pewresearch.org/) -- [Qlik DataMarket](http://www.qlik.com/products/qlik-data-market) - [The Upshot by New York Times](http://www.nytimes.com/section/upshot) - [UNData](http://data.un.org/) @@ -77,7 +70,6 @@ You can find the articles on the bottom of this section to read more. They inclu - [Education Data by the World Bank](http://data.worldbank.org/topic/education) - [Education Data by Unicef](http://data.unicef.org/education/overview.html) -- [Government Data About Education](https://www.data.gov/education/) - [National Center for Education Statistics](https://nces.ed.gov/) ## Entertainment @@ -97,7 +89,6 @@ You can find the articles on the bottom of this section to read more. They inclu - [Environmental Protection Agency](https://www.epa.gov/data) - [International Energy Agency Atlas](https://www.iea.org/data-and-statistics?country=WORLD&fuel=Energy%20supply&indicator=TPESbySource) - [National Center for Environmental Health](http://www.cdc.gov/nceh/data.htm) -- [National Centers for Environmental Information](https://www.ncdc.noaa.gov/wdcmet/data-access-search-viewer-tools/world-weather-records-wwr-clearinghouse) - [National Climatic Data Center](http://www.ncdc.noaa.gov/data-access/quick-links#loc-clim) - [National Weather Service](http://www.weather.gov/help-past-weather) - [Weather Underground](https://www.wunderground.com/) @@ -123,7 +114,6 @@ You can find the articles on the bottom of this section to read more. They inclu ## Government And World -- [Data Catalogs](http://opengovernmentdata.org/data/) - [Data.gov](http://www.data.gov/) - [European Union Open Data Portal](http://data.europa.eu/euodp/en/data/) - [Gapminder](https://www.gapminder.org/data/) @@ -134,7 +124,6 @@ You can find the articles on the bottom of this section to read more. They inclu - [The World Bank’s World Development Indicators](http://data.worldbank.org/data-catalog/world-development-indicators) - [U.S. Census Bureau](http://www.census.gov/) - [UNDP’s Human Development Index](http://hdr.undp.org/en/data) -- [Unicef](http://www.unicef.org/statistics/) ## Health @@ -146,16 +135,12 @@ You can find the articles on the bottom of this section to read more. They inclu - [Medicare Hospital Quality](https://data.medicare.gov/data/hospital-compare#) - [MedicinePlus](https://www.nlm.nih.gov/medlineplus/healthstatistics.html) - [National Center for Health Statistics](http://www.cdc.gov/nchs/) -- [Partners in Information Access for the Public Health Workforce](https://phpartners.org/health_stats.html) -- [President’s Council on Fitness, Sports & Nutrition](http://www.fitness.gov/resource-center/facts-and-statistics/) - [SEER Cancer Incidence](http://seer.cancer.gov/faststats/selections.php?series=cancer) -- [The BROAD Institute](http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi) - [World Health Organization](http://www.who.int/en/) ## Human Rights - [Amnesty International](https://www.amnesty.org/en/search/?q=&documentType=Annual+Report) -- [Harvard Law School](http://hls.harvard.edu/library/research/find-a-database/international-relations-human-rights-data/) - [Human Rights Data Analysis Group](https://hrdag.org/) - [The Armed Conflict Database by Uppsala University](http://www.pcr.uu.se/research/UCDP/) @@ -171,7 +156,6 @@ You can find the articles on the bottom of this section to read more. They inclu - [California Field Poll](http://dlab.berkeley.edu/data-resources/california-polls) - [Crowdpac](https://www.crowdpac.com/) - [Gallup](http://www.gallup.com/home.aspx) -- [Intro to Political Science Research by UC Berkeley](http://guides.lib.berkeley.edu/Intro-to-Political-Science-Research/Stats) - [Open Secrets](https://www.opensecrets.org/) - [Rand State Statistics](http://www.randstatestats.org/us/) - [Real Clear Politics](http://guides.lib.berkeley.edu/Intro-to-Political-Science-Research/Stats) diff --git a/docs/08-InterviewQuestions.md b/docs/08-InterviewQuestions.md index 0668b08..6933eaf 100644 --- a/docs/08-InterviewQuestions.md +++ b/docs/08-InterviewQuestions.md @@ -1,9 +1,3 @@ ---- -sidebar_label: Interview Questions -title: ' ' ---- - - 1001 Data Engineering Interview Questions ========================================= diff --git a/docs/09-BooksAndCourses.md b/docs/09-BooksAndCourses.md index ca1bc4f..4f92b93 100644 --- a/docs/09-BooksAndCourses.md +++ b/docs/09-BooksAndCourses.md @@ -1,40 +1,20 @@ ---- -sidebar_label: Books And Courses -title: ' ' ---- - - -Recommended Books And Courses +Recommended Books, Courses, and Podcasts ============================= ## Contents - [About Books and Courses](09-BooksAndCourses.md#about-books-and-courses) - [Books](09-BooksAndCourses.md#books) - [Languages](09-BooksAndCourses.md#books-languages) - - [Java](09-BooksAndCourses.md#java) - - [Python](09-BooksAndCourses.md#Python) - - [Scala](09-BooksAndCourses.md#Scala) - - [Swift](09-BooksAndCourses.md#Swift) - [Data Science Tools](09-BooksAndCourses.md#books-data-science-tools) - - [Apache Spark](09-BooksAndCourses.md#apache-spark) - - [Apache Kafka](09-BooksAndCourses.md#apache-Kafka) - - [Apache Hadoop](09-BooksAndCourses.md#apache-Hadoop) - - [Apache HBase](09-BooksAndCourses.md#apache-HBase) - [Business](09-BooksAndCourses.md#Books-Business) - - [The Lean Startup](09-BooksAndCourses.md#the-lean-startup) - - [Zero to One](09-BooksAndCourses.md#zero-to-one) - - [The Innovators Dilemma](09-BooksAndCourses.md#the-innovators-dilemma) - - [Crossing the Chasm](09-BooksAndCourses.md#crossing-the-chasm) - - [Crush It!](09-BooksAndCourses.md#crush-it!) - [Community Recommendations](09-BooksAndCourses.md#Community-Recommendations) - - [Designing Data-Intensive Applications](09-BooksAndCourses.md#designing-data-intensive-applications) -- [Online Courses](BooksAndCourses#online-courses) - - [Machine Learning Stanford](09-BooksAndCourses.md#machine-learning-stanford) - - [Computer Networking](09-BooksAndCourses.md#computer-networking) - - [Spring Framework](09-BooksAndCourses.md#spring-framework) - - [IOS App Development Specialization](09-BooksAndCourses.md#ios-app-development-specialization) +- [Online Courses](09-BooksAndCourses.md#Online-Courses) + - [Preparation courses](09-BooksAndCourses.md#Preparation-courses) + - [Data engineering courses](09-BooksAndCourses.md#Data-engineering-courses) + - [Podcasts](09-BooksAndCourses.md#Podcasts) -## About Books and Courses + +## About Books, Courses, and Podcasts This is a collection of books and courses I can recommend personally. They are great for every data engineering learner. @@ -139,21 +119,124 @@ PS: Don't just get a book and expect to learn everything ## Online Courses -### Computer Networking - -[The Bits and Bytes of Computer Networking (Coursera)](https://www.coursera.org/learn/computer-networking) - - -### Spring Framework - -[Building Cloud Services with the Java Spring Framework (Coursera)](https://www.coursera.org/learn/cloud-services-java-spring-framework) - - -### Machine Learning Stanford - -[Machine Learning (Coursera)](https://www.coursera.org/learn/machine-learning) - - -### IOS App Development Specialization - -[iOS Development for Creative Entrepreneurs Specialization (Coursera)](https://www.coursera.org/specializations/ios-development) +### Preparation courses + +| Course name | Course description | Course URL | +|---|---|---| +| The Bits and Bytes of Computer Networking | This course is designed to provide a full overview of computer networking. We’ll cover everything from the fundamentals of modern networking technologies and protocols to an overview of the cloud to practical applications and network troubleshooting. | https://www.coursera.org/learn/computer-networking | +| Learn SQL \| Codecademy | In this SQL course, you'll learn how to manage large datasets and analyze real data using the standard data management language. | https://www.codecademy.com/learn/learn-sql | +| Learn Python 3 \| Codecademy | Learn the basics of Python 3, one of the most powerful, versatile, and in-demand programming languages today. | https://www.codecademy.com/learn/learn-python-3 | + +### Data engineering courses + +| Course name | Course description | Course URL | +|---|---|---| +| **1. Data Engineering Basics** | | | +| Introduction to Data Engineering | Introduction to Data Engineering with over 1 hour of videos including my journey here. | https://learndataengineering.com/p/introduction-to-data-engineering | +| Computer Science Fundamentals | A complete guide of topics and resources you should know as a Data Engineer. | https://learndataengineering.com/p/data-engineering-fundamentals | +| Introduction to Python | Learn all the fundamentals of Python to start coding quick | https://learndataengineering.com/p/introduction-to-python | +| Python for Data Engineers | Learn all the Python topics a Data Engineer needs even if you don't have a coding background | https://learndataengineering.com/p/python-for-data-engineers | +| Docker Fundamentals | Learn all the fundamental Docker concepts with hands-on examples | https://learndataengineering.com/p/docker-fundamentals | +| Successful Job Application | Everything you need to get your dream job in Data Engineering. | https://learndataengineering.com/p/successful-job-application | +| Data Preparation & Cleaning for ML | All you need for preparing data to enable Machine Learning. | https://learndataengineering.com/p/data-preparation-and-cleaning-for-ml | +| **2. Platform & Pipeline Design Fundamentals** | | | +| Data Platform And Pipeline Design | Learn how to build data pipelines with templates and examples for Azure, GCP and Hadoop. | https://learndataengineering.com/p/data-pipeline-design | +| Platform & Pipelines Security | Learn the important security fundamentals for Data Engineering | https://learndataengineering.com/p/platform-pipeline-security | +| Choosing Data Stores | Learn the different types of data stores and when to use which. | https://learndataengineering.com/p/choosing-data-stores | +| Schema Design Data Stores | Learn how to design schemas for SQL, NoSQL and Data Warehouses. | https://learndataengineering.com/p/data-modeling | +| **3. Fundamental Tools** | | | +| Building APIs with FastAPI | Learn the fundamentals of designing, creating and deploying APIs with FastAPI and Docker | https://learndataengineering.com/p/apis-with-fastapi-course | +| Apache Kafka Fundamentals | Learn the fundamentals of Apache Kafka | https://learndataengineering.com/p/apache-kafka-fundamentals | +| Apache Spark Fundamentals | Apache Spark quick start course in Python with Jupyter notebooks, DataFrames, SparkSQL and RDDs. | https://learndataengineering.com/p/learning-apache-spark-fundamentals | +| Data Engineering on Databricks | Everything you need to get started with Databricks. From setup to building ETL pipelines & warehousing. | https://learndataengineering.com/p/data-engineering-on-databricks | +| MongoDB Fundamentals | Learn how to use MongoDB | https://learndataengineering.com/p/mongodb-fundamentals-course | +| Log Analysis with Elasticsearch | Learn how to monitor and debug your data pipelines | https://learndataengineering.com/p/log-analysis-with-elasticsearch | +| Airflow Workflow Orchestration | Learn how to orchestrate your data pipelines with Apache Airflow | https://learndataengineering.com/p/learn-apache-airflow | +| Snowflake for Data Engineers | Everything you need to get started with Snowflake | https://learndataengineering.com/p/snowflake-for-data-engineers | +| dbt for Data Engineers | Everything you need to work with dbt and Snowflake | https://learndataengineering.com/p/dbt-for-data-engineers | +| **4. Full Hands-On Example Projects** | | | +| Data Engineering on AWS | Full 5 hours course with complete example project. Building stream and batch processing pipelines on AWS. | https://learndataengineering.com/p/data-engineering-on-aws | +| Data Engineering on Azure | Ingest, Store, Process, Serve and Visualize Streams of Data by Building Streaming Data Pipelines in Azure. | https://learndataengineering.com/p/build-streaming-data-pipelines-in-azure | +| Data Engineering on GCP | Everything you need to start with Google Cloud. | https://learndataengineering.com/p/data-engineering-on-gcp | +| Modern Data Warehouses & Data Lakes | How to integrate a Data Lake with a Data Warehouse and query data directly from files | https://learndataengineering.com/p/modern-data-warehouses | +| Machine Learning & Containerization On AWS | Build a app that analyzes the sentiment of tweets and visualizing them on a user interface hosted as container | https://learndataengineering.com/p/ml-on-aws | +| Contact Tracing with Elasticsearch | Track 100,000 users in San Francisco using Elasticsearch and an interactive Streamlit user interface | https://learndataengineering.com/p/contact-tracing-with-elasticsearch | +| Document Streaming Project | Document Streaming with FastAPI, Kafka, Spark Streaming, MongoDB and Streamlit | https://learndataengineering.com/p/document-streaming | +| Storing & Visualizing Time Series Data with InfluxDB and Grafana | Learn how to use InfluxDB to store time series data and visualize interactive dashboards with Grafana | https://learndataengineering.com/p/time-series-influxdb-grafana | +| Data Engineering with Hadoop | Hadoop Project with HDFS, YARN, MapReduce, Hive and Sqoop! | https://learndataengineering.com/p/data-engineering-with-hadoop | +| Dockerized ETL | Learn how quickly set up a simple ETL script with AWS TDengine & Grafana | https://learndataengineering.com/p/timeseries-etl-with-aws-tdengine-grafana | + +## Podcasts +Top five podcasts by the number of episodes created. + +### Super Data Science + +[The latest machine learning, A.I., and data career topics from across both academia and industry are brought to you by host Dr. Jon Krohn on the Super Data Science Podcast.](https://podcasts.apple.com/us/podcast/super-data-science/id1163599059) + +### Data Skeptic + +[The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.](https://podcasts.apple.com/us/podcast/data-skeptic/id890348705) + +### Data Engineering Podcast + +[This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.](https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557?mt=2) + +### Roaring Elephant BiteSized Big Tech + +[A weekly community podcast about Big Technology with a focus on Open Source, Advanced Analytics and other modern magic.](https://roaringelephant.org/) + +### SQL Data Partners Podcast + +[Hosted by Carlos L Chacon, the SQL Data Partners Podcast focuses on Microsoft data platform related topics mixed with a sprinkling of professional development. Carlos and guests discuss new and familiar features and ideas and how you might apply them in your environments.](https://podcasts.apple.com/us/podcast/sql-data-partners-podcast/id1027394388) + +### Complete list +| Host name | Podcast name | Access podcast | +|-------------------------|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Jon Krohn | Super Data Science | https://www.superdatascience.com/podcast | +| Kyle Polich | Data Skeptic | https://dataskeptic.com/ | +| Tobias Macey | Data Engineering Podcast | https://www.dataengineeringpodcast.com/ | +| Dave Russell | Roaring Elephant - Bite-Sized Big Tech | https://roaringelephant.org/ | +| Carlos L Chacon | SQL Data Partners Podcast | https://sqldatapartners.com/podcast/ | +| Jason Himmelstein | BIFocal - Clarifying Business Intelligence | https://bifocal.show/ | +| Scott Hirleman | Data Mesh Radio | https://daappod.com/data-mesh-radio/ | +| Jonathan Schwabish | PolicyViz | https://policyviz.com/podcast/ | +| Al Martin | Making Data Simple | https://www.ibm.com/blogs/journey-to-ai/2021/02/making-data-simple-this-week-we-continue-our-discussion-on-data-framework-and-what-is-meant-by-data-framework/ | +| John David Ariansen | How to Get an Analytics Job | https://www.silvertoneanalytics.com/how-to-get-an-analytics-job/ | +| Moritz Stefaner | Data Stories | https://datastori.es/ | +| Hilary Parker | Not So Standard Deviations | https://nssdeviations.com/ | +| Ben Lorica | The Data Exchange with Ben Lorica | https://thedataexchange.media/author/bglorica/ | +| Juan Sequeda | Catalog & Cocktails | https://data.world/resources/podcasts/ | +| Wayne Eckerson | Secrets of Data Analytics Leaders | https://www.eckerson.com/podcasts/secrets-of-data-analytics-leaders | +| Guy Glantser | SQL Server Radio | https://www.sqlserverradio.com/ | +| Eitan Blumin | SQL Server Radio | https://www.sqlserverradio.com/ | +| Jason Tan | The Analytics Show | https://ddalabs.ai/the-analytics-show/ | +| Hugo Bowne-Anderson | DataFramed | https://www.datacamp.com/podcast | +| Kostas Pardalis | The Data Stack Show | https://datastackshow.com/ | +| Eric Dodds | The Data Stack Show | https://datastackshow.com/ | +| Catherine King | The Business of Data Podcast | https://podcasts.apple.com/gb/podcast/the-business-of-data-podcast/id1528796448 | +| | The Business of Data | https://business-of-data.com/podcasts/ | +| James Le | Datacast | https://datacast.simplecast.com/ | +| Mike Delgado | DataTalk | https://podcasts.apple.com/us/podcast/datatalk/id1398548129 | +| Matt Housley | Monday Morning Data Chat | https://podcasts.apple.com/us/podcast/monday-morning-data-chat/id1565154727 | +| Francesco Gadaleta | Data Science at Home | https://datascienceathome.com/ | +| Alli Torban | Data Viz Today | https://dataviztoday.com/ | +| Steve Jones | Voice of the DBA | https://voiceofthedba.com/ | +| Lea Pica | The Present Beyond Measure Show: Data Storytelling, Presentation & Visualization | https://leapica.com/podcast/ | +| Samir Sharma | The Data Strategy Show | https://podcasts.apple.com/us/podcast/the-data-strategy-show/id1515194422 | +| Cindi Howson | The Data Chief | https://www.thoughtspot.com/data-chief/podcast | +| Cole Nussbaumer Knaflic | storytelling with data podcast | https://storytellingwithdata.libsyn.com/ | +| Margot Gerritsen | Women in Data Science | https://www.widsconference.org/podcast.html | +| Jonas Christensen | Leaders of Analytics | https://www.leadersofanalytics.com/episode/the-future-of-analytics-leadership-with-john-thompson | +| Matt Brady | ZUMA: Data For Good | https://www.youtube.com/@zuma-dataforgood | +| Julia Schottenstein | The Analytics Engineering Podcast | https://roundup.getdbt.com/s/the-analytics-engineering-podcast | +| | Data Unlocked | https://dataunlocked.buzzsprout.com/ | +| Boris Jabes | The Sequel Show | https://www.thesequelshow.com/ | +| | Data Radicals | https://www.alation.com/podcast/ | +| Nicola Askham | The Data Governance | https://www.nicolaaskham.com/podcast | +| Boaz Farkash | The Data Engineering Show | https://www.dataengineeringshow.com/ | +| Bob Haffner | The Engineering Side of Data | https://podcasts.apple.com/us/podcast/the-engineering-side-of-data/id1566999533 | +| Dan Linstedt | Data Vault Alliance | https://datavaultalliance.com/category/news/podcasts/ | +| Dustin Schimek | Data Ideas | https://podcasts.apple.com/us/podcast/data-ideas/id1650322207 | +| Alex Merced | The datanation | https://podcasts.apple.com/be/podcast/the-datanation-podcast-podcast-for-data-engineers/id1608638822 | +| Thomas Bustos | Let's Talk AI | https://www.youtube.com/@lets-talk-ai | +| Jahanvee Narang | Decoding Data Analytics | https://www.youtube.com/@decodingdataanalytics/videos |