Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update semester #56

Open
wants to merge 14 commits into
base: solution
Choose a base branch
from
19 changes: 18 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,21 @@ The data pipeline you construct will look like below:
- TODO: Week 7 BI

# Important Course Resources
- [Course videos on the Thinkific platform](https://hours-school-d024.thinkific.com/courses/hours-with-experts-cloud)
- [Course resources on the Thinkific platform](https://hours-school-d024.thinkific.com/courses/hours-with-experts-cloud)

# Continued Learning
Want to continue your learning in Data Engineering? Great -- check out these links:

* [STL Big Data - Innovation, Data Engineering, Analytics Group](https://www.meetup.com/st-louis-big-data-idea/)
A meetup for users of Big Data services and tools in the Saint Louis Area. We are interested in Innovation (new tools, techniques, and services), Data Engineering (architecture and design of data movement systems), and Analytics (converting information into meaning).
(with Kit Menke and Matt Harris)

* [Data Engineering Podcast](https://www.dataengineeringpodcast.com/)
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.


## Some Previous "STL Big Data - I.D.E.A" Meetups

[Apache Iceberg Presentation - August 2023](https://drive.google.com/file/d/1cM9SD8euuQCPoGUQukQpIGp7y6hgXknN/view?usp=share_link)

[LakeFS Presentation - June 2023](https://drive.google.com/file/d/1OHmEgfGuStoF7ZHhiMVpyVEkML1Hf37v/view?usp=share_link)
13 changes: 13 additions & 0 deletions week1_welcome/setup-windows.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,16 @@ SUCCESS: The process with PID 10068 (child process of PID 19344) has been termin
SUCCESS: The process with PID 19344 (child process of PID 9988) has been terminated.
SUCCESS: The process with PID 9988 (child process of PID 12832) has been terminated.
```
## Troubleshooting Winutils

On Windows, you may encounter [errors](https://stackoverflow.com/questions/45947375/why-does-starting-a-streaming-query-lead-to-exitcodeexception-exitcode-1073741) like `Error writing stream metadata StreamMetadata` or other issues writing files (usually to temp directories).

First, validate that winutils is working correctly by navigating to the winutils bin directory and executing:

```
winutils.exe ls
```

If you get no output and the return code is non-zero, then you're probably missing [Visual Studio 2010 (VC++ 10.0) SP1](https://learn.microsoft.com/en-US/cpp/windows/latest-supported-vc-redist?view=msvc-170#visual-studio-2010-vc-100-sp1-no-longer-supported). Download and install it and check again. here. [Direct link to download page](https://www.microsoft.com/en-us/download/details.aspx?id=26999).

Why? The winutils binaries are compiled using the Visual Studio 2010 Redistributable and need it to run.
2 changes: 1 addition & 1 deletion week3_python/validate_my_credentials.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@


# Define the S3 bucket and file path
bucket_name = 'hwe-fall-2023'
bucket_name = 'hwe-fall-2024'
file_key = f'{handle}/success_message'

# Download and display the contents of the S3 object
Expand Down
37 changes: 37 additions & 0 deletions week7_bi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Hours with Experts - Week 7: Visualizations

## Introduction

In this week we will be using the data that was inserted into the gold layer in Delta format and doing some visualizations on that data.

You should already have run the pipeline in week 6 and created an Athena table over that data using the 'Delta Lake' connector.

We will be using the open source package [Apache Superset](https://superset.apache.org/). You will get instructions on how to login and the credentials in class.

## Assignment

1. Create an Apache Superset dataset over your gold layer data (prefix it with your userid, please).
* This can be against your raw table or using a SQL query.
2. Create at least three reports against your data.
* Total reviews by year bar chart.
* Pie chart which breaks down the total review by gender.
* Geography chart showing the average star reviews by state.
(HINT: To get the states to show up you have to prefix them with 'US-')

If you have not completed week 6 and still want to use Superset, you can use the data in the 'demo' tables to do this assignment.
* demo_fact
* demo_customer
* demo_product
* demo_date

## Extra Credit

Create separate dimensions using Spark to create gold layer dimensions.

- Customer
- Product
- Purchase Date

Join these in Superset to create a custom dataset for your data.

You may have to redesign what you have done in week 6 to make this possible.