Skip to content

Commit

Permalink
README Update
Browse files Browse the repository at this point in the history
  • Loading branch information
inesmcm26 committed Jun 28, 2024
1 parent 85e181f commit f416016
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 6 deletions.
6 changes: 3 additions & 3 deletions 01_spark_intro/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ This module contains a series of notebooks that introduce and explore key concep

## Notebooks

1. Introduction to Big Data and Spark:
1. **Introduction to Big Data and Spark**

Overview of Big Data concepts and introduction of Apache Spark, highlighting its architecture, key features, and components.

2. Introduction to Databricks Environment:
2. **Introduction to Databricks Environment**

Introduction to the Databricks environment, including how to run shell commands, interact with the Databricks Filesystem, and execute SQL within Databricks cells.

3. Pyspark RDDs:
3. **Pyspark RDDs**

Explores Resilient Distributed Datasets (RDDs) in PySpark, covering their creation, transformations, and actions, with hands-on examples to demonstrate these concepts.

Expand Down
33 changes: 31 additions & 2 deletions 02_pyspark_dataframes/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,32 @@
# TODO
# PySpark DataFrames Basics

Explain import to DataBricks
Welcome to the PySpark DataFrames Basics module! ✴️

This module introduces the basics of working with DataFrames in PySpark. DataFrames are a key abstraction in PySpark that provide a more user-friendly interface for working with structured data.

In this module, you will learn how to create DataFrames, perform basic operations on them, and understand the underlying concepts that drive PySpark's DataFrame API.

To do so, you'll be looking into orders data from an e-commerce platform. You'll be using PySpark to load this data into a DataFrame and perform various operations on it to answer business questions.

## Notebooks

1. **PySpark DataFrames Part 1**

Introduction to PySpark DataFrames, covering their key features, advantages over RDDs, how to create DataFrames from different data sources and basic operations like selecting, filtering and creating new columns.

Also explores a lot of PySpark SQL functions that can be used to manipulate DataFrames.

2. **PySpark DataFrames Part 2**

Explores more advanced operations on PySpark DataFrames, including grouping and aggregation, sorting and joining.


## Running the Notebooks

All notebooks in this module are designed to be run in the **Databricks Community Edition**. Detailed steps to set up and configure your environment are provided in the first module.

If you need, go back to `2-Databricks-Environment` notebook in module `01_spark_intro` and follow the instructions there to ensure you have the necessary setup to run these notebooks successfully.

---

Happy Learning!
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ This collection of modules is designed to help you learn how to work with Big Da

Introduces advanced PySpark topics such as User-Defined Functions (UDFs), window functions, and working with complex data structures like arrays and structs.

4. Final Project
4. **Final Project**

A final project that brings together the concepts covered in the previous modules. You will work on a real-world dataset, applying your knowledge of Spark to analyze and derive insights from the data.

Expand Down

0 comments on commit f416016

Please sign in to comment.