Skip to content

Commit

Permalink
feat(python): adding example for s3, athena and glue (#988)
Browse files Browse the repository at this point in the history
* feat(java/EKS): Adding EKS Fargate sample

* feat(python): adding example for s3, athena and glue

* Delete python/athena-s3-glue/source.bat

* chore: added comments to the code

---------

Co-authored-by: Paulo Pereira <[email protected]>
Co-authored-by: Michael Kaiser <[email protected]>
  • Loading branch information
3 people authored Feb 18, 2024
1 parent a889d32 commit dd77351
Show file tree
Hide file tree
Showing 17 changed files with 493 additions and 0 deletions.
126 changes: 126 additions & 0 deletions python/athena-s3-glue/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
<!--BEGIN STABILITY BANNER-->
---

![Stability: Stable](https://img.shields.io/badge/stability-Stable-success.svg?style=for-the-badge)

> **This is a stable example. It should successfully build out of the box**
>
> This example is built on Construct Libraries marked "Stable" and does not have any infrastructure prerequisites to
> build.
---
<!--END STABILITY BANNER-->

# Auditing logs with _S3_, _Athena_ and _Glue_

This is an example of a CDK program written in Python.\
**Use Case**: a customer wants to store and be able to audit their user logs using common SQL statements.

## Solution Description

To provide the log storage we will deploy an _Amazon S3_ bucket and the auditing capability will be provided by _Amazon
Athena_.

_Athena_ will use the _S3_ bucket as the source for queries that will return specific values given the audit process.

In addition, we will deploy **seven log samples** on the bucket organized by business domain and date to grant _Athena_
high performance and cost efficiency during the queries. An _AWS Glue_ crawler will create the Data Catalog used by
_Athena_, and **three named queries** will be available for testing.

## CDK Toolkit

The `cdk.json` file tells the CDK Toolkit how to execute your app.

This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the `.venv`
directory. To create the virtualenv it assumes that there is a `python3`
(or `python` for Windows) executable in your path with access to the `venv`
package. If for any reason the automatic creation of the virtualenv fails,
you can create the virtualenv manually.

To manually create a virtualenv on MacOS and Linux:

```
$ python3 -m venv .venv
```

After the init process completes and the virtualenv is created, you can use the following
step to activate your virtualenv.

```
$ source .venv/bin/activate
```

If you are a Windows platform, you would activate the virtualenv like this:

```
% .venv\Scripts\activate.bat
```

Once the virtualenv is activated, you can install the required dependencies.

```
$ pip install -r requirements.txt
```

At this point you can now synthesize the CloudFormation template for this code.

```
$ cdk synth
```

To add additional dependencies, for example other CDK libraries, just add
them to your `setup.py` file and rerun the `pip install -r requirements.txt`
command.


## Deploying the solution

To deploy the solution, we will need to request cdk to deploy the stack:

```shell
$ cdk deploy --all
```

Now that we have the infrastructure created, you will need to populate the Glue Database. Do that by going to the AWS
console, _AWS Glue_, _Data Catalog_, _Crawlers_.

Select `logs-crawler` and hit the button *Run*. When it finishes, you will be ready to test the solution.


## Testing the solution

1. Head to _AWS_ console and then to _Amazon Athena_
2. On the left panel, go to **Query editor**
3. Change the **Workgroup** selection to `log-auditing`
4. On **Data source**, choose `AwsDataCatalog`
5. On **Database**, choose `log-database`
6. Two tables will be displayed on the **Tables** section. Expand both and their fields will be displayed
7. You can now start writing your queries on the right panel and then clicking **Run** to perform the query against the
database.
8. Optionally you can go to the **Saved queries** and select one to open on the **Editor** panel, helping you format the
query.

> **Tip**: you can explore the `auditing-logs` bucket and check all the log files inside it. If you want to add other
> logs to perform more complex tests, follow the directory structure and if needed to add another directory, make sure
> you run the respective _Glue Crawler_ in order to update the partitions.

## Destroying the deployment

To destroy the provisioned infrastructure, you can simply run the following command:

```shell
$ cdk destroy --all
```

## Running Unit Tests
To invoke Unit Tests (from the root project folder)
```
pytest
```

If you want to invoke a specific unit test file, just pass the filename as a parameter. (wildcards also work, e.g. `pytest tests/unit/*_stack*`).
```
pytest tests/unit/<test_filename>
```
11 changes: 11 additions & 0 deletions python/athena-s3-glue/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import aws_cdk as cdk

from athena_s3_glue.athena_s3_glue_stack import AthenaS3GlueStack

app = cdk.App()

demo_stack = AthenaS3GlueStack(app, "DemoAthenaS3GlueStack")

cdk.Tags.of(demo_stack).add(key='project', value='demo-athena-s3-glue')

app.synth()
Empty file.
110 changes: 110 additions & 0 deletions python/athena-s3-glue/athena_s3_glue/athena_s3_glue_stack.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
from aws_cdk import (
Stack,
RemovalPolicy,
aws_s3 as s3,
aws_s3_deployment as s3_deployment,
aws_glue as glue,
aws_iam as iam,
aws_athena as athena
)
from constructs import Construct


class AthenaS3GlueStack(Stack):

def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)

# creating the buckets where the logs will be placed
logs_bucket = s3.Bucket(self, 'logs-bucket',
bucket_name=f"auditing-logs-{self.account}",
removal_policy=RemovalPolicy.DESTROY,
auto_delete_objects=True
)

# creating the bucket where the Athena queries output will be placed
query_output_bucket = s3.Bucket(self, 'query-output-bucket',
bucket_name=f"auditing-analysis-output-{self.account}",
removal_policy=RemovalPolicy.DESTROY,
auto_delete_objects=True
)

# uploading the log files to the bucket as examples
s3_deployment.BucketDeployment(self, 'sample-files',
destination_bucket=logs_bucket,
sources=[s3_deployment.Source.asset('./log-samples')],
content_type='application/json',
retain_on_delete=False
)

# creating the Glue Database to serve as our Data Catalog
glue_database = glue.CfnDatabase(self, 'log-database',
catalog_id=self.account,
database_input=glue.CfnDatabase.DatabaseInputProperty(
name="log-database"
))

# creating the permissions for the crawler to enrich our Data Catalog
glue_crawler_role = iam.Role(self, 'glue-crawler-role',
role_name='glue-crawler-role',
assumed_by=iam.ServicePrincipal(service='glue.amazonaws.com'),
managed_policies=[
# Remember to apply the Least Privilege Principle and provide only the permissions needed to the crawler
iam.ManagedPolicy.from_managed_policy_arn(self, 'AmazonS3FullAccess',
'arn:aws:iam::aws:policy/AmazonS3FullAccess'),
iam.ManagedPolicy.from_managed_policy_arn(self, 'AWSGlueServiceRole',
'arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole')
])

# creating the Glue Crawler that will automatically populate our Data Catalog. Don't forget to run the crawler
# as soon as the deployment finishes, otherwise our Data Catalog will be empty. Check out the README for more instructions
glue.CfnCrawler(self, 'logs-crawler',
name='logs-crawler',
database_name=glue_database.database_input.name,
role=glue_crawler_role.role_name,
targets={
"s3Targets": [
{"path": f's3://{logs_bucket.bucket_name}/products'},
{"path": f's3://{logs_bucket.bucket_name}/users'}
]
})

# creating the Athena Workgroup to store our queries
work_group = athena.CfnWorkGroup(self, 'log-auditing-work-group',
name='log-auditing',
work_group_configuration=athena.CfnWorkGroup.WorkGroupConfigurationProperty(
result_configuration=athena.CfnWorkGroup.ResultConfigurationProperty(
output_location=f"s3://{query_output_bucket.bucket_name}",
encryption_configuration=athena.CfnWorkGroup.EncryptionConfigurationProperty(
encryption_option="SSE_S3"
))))

# creating an example query to fetch all product events by date
product_events_by_date_query = athena.CfnNamedQuery(self, 'product-events-by-date-query',
database=glue_database.database_input.name,
work_group=work_group.name,
name="product-events-by-date",
query_string="SELECT * FROM \"log-database\".\"products\" WHERE \"date\" = '2024-01-19'")

# creating an example query to fetch all user events by date
user_events_by_date_query = athena.CfnNamedQuery(self, 'user-events-by-date-query',
database=glue_database.database_input.name,
work_group=work_group.name,
name="user-events-by-date",
query_string="SELECT * FROM \"log-database\".\"users\" WHERE \"date\" = '2024-01-22'")

# creating an example query to fetch all events by the user ID
all_events_by_userid_query = athena.CfnNamedQuery(self, 'all-events-by-userId-query',
database=glue_database.database_input.name,
work_group=work_group.name,
name="all-events-by-userId",
query_string="SELECT * FROM (\n"
" SELECT transactionid, userid, username, domain, datetime, action FROM \"log-database\".\"products\" \n"
"UNION \n"
" SELECT transactionid, userid, username, domain, datetime, action FROM \"log-database\".\"users\" \n"
") WHERE \"userid\" = '123'")

# adjusting the resource creation order
product_events_by_date_query.add_dependency(work_group)
user_events_by_date_query.add_dependency(work_group)
all_events_by_userid_query.add_dependency(work_group)
62 changes: 62 additions & 0 deletions python/athena-s3-glue/cdk.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{
"app": "python3 app.py",
"watch": {
"include": [
"**"
],
"exclude": [
"README.md",
"cdk*.json",
"requirements*.txt",
"source.bat",
"**/__init__.py",
"**/__pycache__",
"tests"
]
},
"context": {
"@aws-cdk/aws-lambda:recognizeLayerVersion": true,
"@aws-cdk/core:checkSecretUsage": true,
"@aws-cdk/core:target-partitions": [
"aws",
"aws-cn"
],
"@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true,
"@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true,
"@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true,
"@aws-cdk/aws-iam:minimizePolicies": true,
"@aws-cdk/core:validateSnapshotRemovalPolicy": true,
"@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true,
"@aws-cdk/aws-s3:createDefaultLoggingPolicy": true,
"@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true,
"@aws-cdk/aws-apigateway:disableCloudWatchRole": true,
"@aws-cdk/core:enablePartitionLiterals": true,
"@aws-cdk/aws-events:eventsTargetQueueSameAccount": true,
"@aws-cdk/aws-iam:standardizedServicePrincipals": true,
"@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true,
"@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true,
"@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true,
"@aws-cdk/aws-route53-patters:useCertificate": true,
"@aws-cdk/customresources:installLatestAwsSdkDefault": false,
"@aws-cdk/aws-rds:databaseProxyUniqueResourceName": true,
"@aws-cdk/aws-codedeploy:removeAlarmsFromDeploymentGroup": true,
"@aws-cdk/aws-apigateway:authorizerChangeDeploymentLogicalId": true,
"@aws-cdk/aws-ec2:launchTemplateDefaultUserData": true,
"@aws-cdk/aws-secretsmanager:useAttachedSecretResourcePolicyForSecretTargetAttachments": true,
"@aws-cdk/aws-redshift:columnId": true,
"@aws-cdk/aws-stepfunctions-tasks:enableEmrServicePolicyV2": true,
"@aws-cdk/aws-ec2:restrictDefaultSecurityGroup": true,
"@aws-cdk/aws-apigateway:requestValidatorUniqueId": true,
"@aws-cdk/aws-kms:aliasNameRef": true,
"@aws-cdk/aws-autoscaling:generateLaunchTemplateInsteadOfLaunchConfig": true,
"@aws-cdk/core:includePrefixInUniqueNameGeneration": true,
"@aws-cdk/aws-efs:denyAnonymousAccess": true,
"@aws-cdk/aws-opensearchservice:enableOpensearchMultiAzWithStandby": true,
"@aws-cdk/aws-lambda-nodejs:useLatestRuntimeVersion": true,
"@aws-cdk/aws-efs:mountTargetOrderInsensitiveLogicalId": true,
"@aws-cdk/aws-rds:auroraClusterChangeScopeOfInstanceParameterGroupWithEachParameters": true,
"@aws-cdk/aws-appsync:useArnForSourceApiAssociationIdentifier": true,
"@aws-cdk/aws-rds:preventRenderingDeprecatedCredentials": true,
"@aws-cdk/aws-codepipeline-actions:useNewDefaultBranchForCodeCommitSource": true
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"transactionId": "eb8c8930-f675-4ab2-912d-cb2970080dda","userId": "123","userName": "test user","domain": "products","dateTime": "2024-01-19T13:40:09","action": "Create New Product","transactionResult": "Success", "data": {"productId": "04dad1d4-d88f-4db7-81e1-22fec3f202c5"}}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"transactionId": "425b59c1-0a4d-4776-a213-3d5853721fb2","userId": "123","userName": "test user","domain": "products","dateTime": "2024-01-19T13:45:09","action": "Change Product","transactionResult": "Success","data": {"productId": "04dad1d4-d88f-4db7-81e1-22fec3f202c5","changedFields": {"productName": "new product name","value": 50}}}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"transactionId": "b8609277-d9cd-4dab-98ea-4219ad8a414d","userId": "456","userName": "test user 2","domain": "products","dateTime": "2024-01-19T13:51:09","action": "Delete Product","transactionResult": "Error", "data": {"productId": "04dad1d4-d88f-4db7-81e1-22fec3f202c5"}}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"transactionId": "b8609277-d9cd-4dab-98ea-4219ad8a414d","userId": "456","userName": "test user 2","domain": "products","dateTime": "2024-01-20T13:51:09","action": "Delete Product","transactionResult": "Success", "data": {"productId": "04dad1d4-d88f-4db7-81e1-22fec3f202c5"}}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"transactionId": "edd4916a-da30-4638-b0d4-ebcdbfee7042","userId": "789","userName": "test user 3","domain": "users","dateTime": "2024-01-20T08:13:33","action": "Add User","transactionResult": "Error", "data": {"newUser": {"userId": "000", "userName": "test user 4"}}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"transactionId": "5522f86a-120b-4b26-9e09-899874558b43","userId": "789","userName": "test user 3","domain": "users","dateTime": "2024-01-22T08:13:33","action": "Add User","transactionResult": "Success", "data": {"newUser": {"userId": "000", "userName": "test user 4"}}}
{"transactionId": "00f46121-6218-4e7f-93f7-e186cf7659ef","userId": "123","userName": "test user","domain": "users","dateTime": "2024-01-22T08:13:40","action": "Add User","transactionResult": "Error", "data": {"newUser": {"userId": "000", "userName": "test user 4"}}}
{"transactionId": "4be4b004-baaf-47c7-b0a6-86c4f81c9b4f","userId": "123","userName": "test user","domain": "users","dateTime": "2024-01-22T08:13:43","action": "Add User","transactionResult": "Error", "data": {"newUser": {"userId": "000", "userName": "test user 4"}}}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"transactionId": "c413bd54-caef-482c-86a4-260b72742f52","userId": "000","userName": "test user 4","domain": "users","dateTime": "2024-01-22T09:13:33","action": "Change User","transactionResult": "Success", "data": {"changedFields": {"birthDate": "1970-01-01"}}}
1 change: 1 addition & 0 deletions python/athena-s3-glue/requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pytest==6.2.5
2 changes: 2 additions & 0 deletions python/athena-s3-glue/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
aws-cdk-lib==2.115.0
constructs>=10.0.0,<11.0.0
Empty file.
Empty file.
Loading

0 comments on commit dd77351

Please sign in to comment.