This is a adapter leveraged from this repo to better serve our custom needs. It supports addtional capabilities as below:
- Run on dbt-core version 1.1.x
- Support boto3 session to take the credential from from aws profile name
- On schema change support for new columns added
- Add s3 bucket for storing data instead of randomly writing on staging dir
Within your python environment, proceed below step to initate a first project. There will be some prompts at during inital steps, refer Configuring your profile
section below to properly set it up.
pip install dbt-athena2
dbt init my_dbt_project
export DBT_PROFILES_DIR=`pwd`
cd my_dbt_project
dbt debug # to test connection
dbt seed # to dump your seed data
dbt run # to run all models
dbt run --select model_name # to run specific model
#...and more...
Below show an example how we configure how our model be configured.
- There are 4 supported
materialized
modes:view
,table
,incremental
andesphemeral
. Details here. incremental_strategy
supportsinsert_overwrite
andappend
. If partition is specified, it only overwrites partions available from source data.on_schema_change
supportfail
,ignore
andappend_new_columns
only and for onlyincremental
materialization. Details here.- There are some usefule macro such as
run_started_at
can be referred from here to enhance the flexibility on the model.
{{ config(
materialized="incremental",
partitioned_by=["report_month"],
incremental_strategy="insert_overwrite",
on_schema_change="append_new_columns"
) }}
{% set run_date = run_started_at.astimezone(modules.pytz.timezone("Asia/Saigon")).strftime("%Y-%m-%d") %}
select cast(working_day as timestamp) working_day,
sum(spend_balance) spend_balance,
sum(td_balance) td_balance,
sum(gs_balance) gs_balance,
cast(date_format(date_parse('{{ run_date }}', '%Y-%m-%d') - interval '1' month, '%Y%m') as int) report_month
from {{ source('analytics', 'eod_balance') }}
where cast(working_day as date) >= date_trunc('month', cast(date_parse('{{ run_date }}', '%Y-%m-%d') as date)-interval'2'month)
and cast(working_day as date) < date_trunc('month', cast(date_parse('{{ run_date }}', '%Y-%m-%d') as date)-interval'1'month)
group by working_day
order by working_day desc
Under folder seeds, place csv seed file ( eg. c_ecom_rate.csv
) and the yaml config (eg. c_ecom_rate.yml
) as below example. Then run dbt seed
version: 2
seeds:
- name: c_ecom_rate
config:
enabled: true
quote_columns: true
tags: accounting | report
-
If the workgroup is specified in the
profile.yml
withouts3_staging_dir
, it will extract the default s3 ouput attached with thatwork_group when Override client-side settings enabled
. -
The boto3 session inherit from devlopment environment; once deployed, it should be obtained permission as role such as EC2 profile instance or K8S service account role.
-
Athena limit ALTER ADD COLUMNS with data type
date
, recommend to parse it totimestamp
orstring
during implementing the model. Details here. -
Athena not accept the comment like
/*
, to ignore these auto generated comment fromdbt
, place thisquery-comment: null
indbt_project.yml
file.
A dbt profile can be configured to run against AWS Athena using the following configuration:
Option | Description | Required? | Example |
---|---|---|---|
s3_staging_dir | S3 location to store Athena query results and metadata | Required | s3://athena-output-bucket/data_services/ |
region_name | AWS region of your Athena instance | Required | ap-southeast-1 |
schema | Specify the schema (Athena database) to build models into (lowercase only) | Required | dbt |
database | Specify the database (Data catalog) to build models into (lowercase only) | Required | awsdatacatalog |
poll_interval | Interval in seconds to use for polling the status of query results in Athena | Optional | 5 |
aws_profile_name | Profile to use from your AWS shared credentials file. | Optional | my-profile |
work_group | Identifier of Athena workgroup | Optional | my-custom-workgroup |
num_retries | Number of times to retry a failing query | Optional | 3 |
s3_data_dir | Prefix for storing tables, if different from the connection's s3_staging_dir |
Optional | s3://athena-data-bucket/{schema_name}/{table_name}/ |
Example profiles.yml entry:
athena:
target: dev
outputs:
dev:
database: awsdatacatalog
region_name: ap-southeast-1
aws_profile_name: dl-dev-process
s3_staging_dir: s3://athena-output-bucket/data_services/
s3_data_dir: s3://athena-data-bucket/{schema_name}/{table_name}/
schema: accounting
type: athena
Additional information
threads
is supporteddatabase
andcatalog
can be used interchangeably
First, install the adapter and its dependencies using make
(see Makefile):
make install_deps
Next, configure the environment variables in dev.env to match your Athena development environment. Finally, run the tests using make
:
make run_tests