forked from inesmcm26/lp-big-data
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
35 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,32 @@ | ||
# TODO | ||
# PySpark DataFrames Basics | ||
|
||
Explain import to DataBricks | ||
Welcome to the PySpark DataFrames Basics module! ✴️ | ||
|
||
This module introduces the basics of working with DataFrames in PySpark. DataFrames are a key abstraction in PySpark that provide a more user-friendly interface for working with structured data. | ||
|
||
In this module, you will learn how to create DataFrames, perform basic operations on them, and understand the underlying concepts that drive PySpark's DataFrame API. | ||
|
||
To do so, you'll be looking into orders data from an e-commerce platform. You'll be using PySpark to load this data into a DataFrame and perform various operations on it to answer business questions. | ||
|
||
## Notebooks | ||
|
||
1. **PySpark DataFrames Part 1** | ||
|
||
Introduction to PySpark DataFrames, covering their key features, advantages over RDDs, how to create DataFrames from different data sources and basic operations like selecting, filtering and creating new columns. | ||
|
||
Also explores a lot of PySpark SQL functions that can be used to manipulate DataFrames. | ||
|
||
2. **PySpark DataFrames Part 2** | ||
|
||
Explores more advanced operations on PySpark DataFrames, including grouping and aggregation, sorting and joining. | ||
|
||
|
||
## Running the Notebooks | ||
|
||
All notebooks in this module are designed to be run in the **Databricks Community Edition**. Detailed steps to set up and configure your environment are provided in the first module. | ||
|
||
If you need, go back to `2-Databricks-Environment` notebook in module `01_spark_intro` and follow the instructions there to ensure you have the necessary setup to run these notebooks successfully. | ||
|
||
--- | ||
|
||
Happy Learning! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters