DBT
Local Setup
To be able to run dbt locally
- Make sure you have python installed 3.8 (this version should work for everyone)
- Use pip to install dbt
pip install dbt-redshift - You might lack some additional dependencies. Just install them
- Create a file in ~/.dbt/profiles.yml
default:
target: gf_dwh
outputs:
gf_dwh:
type: redshift
host: localhost
user: dbt
password: XXX
port: 9203
dbname: gf-dwh
schema: dbt_dariusz <- your subschema
threads: 4
- Remember to open bastion tunnels to RS
- Run
dbt debug(In your project directory) It can suggest some additional changes, like runningdbt deps - Make sure you have access to RS by executing for example
dbt seed -s country_locations.csv - Run a sample compile
dbt compile -s all_jobs_by_descriptionYou will see compiled data in target directory
References
https://docs.getdbt.com/docs/get-started/getting-started-dbt-core
https://docs.getdbt.com/docs/get-started/pip-install
jobs location
https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs
checking the logs
your ran/running a job and you want to check what was created in the details
create\s+[table|view]+[\n\r\s]+"gf-dwh"."gf_data_blocks"

drop all the white spaces to have a more compact view of the logs
wired characters in dbt log
use some extension like this https://marketplace.visualstudio.com/items?itemName=iliazeus.vscode-ansi

running jobs
we have 2 types
Prod Jobs
https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs/139419
it runs all to refresh gf schema in RS
- currently as some models have dep we a re running as
dbt seed
dbt build --select gf_clean
dbt build --select gf_intermediate
dbt build --select gf_data_blocks --threads 1
dbt_log_status tools
use to check the logs and get insight
https://gist.github.com/mamatucci/6199e29597c723f987439c1cc2f5cac1
One OFF Job
https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs/132109
it runs what you want to refresh specific gf schema in RS

❗be careful in case Prod Jobs is currently running/ about to be executed to avoid 2 jobs updating the same models and have RS issues:
- cancel it - if running
- pause the
Prod Jobs scheduling - run
One OFFand wait to complete - re-enable the schedule of
Prod Jobs
consider using some BKP before refreshing dbk example
https://goodfit.height.app/T-6352#83fe3844-286e-47aa-935c-910d77262b91
so you can compare the new and old data
dev
adding new code logic to models with some test data
use the dev macro to limit the data
ex:
you need to change something on some model
you create a full refresh of the model in your dev schema

you use scratchpad to check simply you have some data to work with

code the changes
test data again before pr
ex:
with new_logic as (
# new logic copied from changed model here)
),
curr_logic as (
select * from {{ref('firmographics')}}
)
# compare them here on sample data
re-rerun of a few merge in master
when a few datablocks were changed and you want to refresh prod data with the latest data and changes in https://github.com/GoodFit-io/gf-dwh/commits/master without waiting for the Prod Jobs
https://goodfit-io.slack.com/archives/C039RLGFUR2/p1666958074891769
using test
try this simple case to familiarize heir the audit_helper.compare_queries
with primary_software_products_fnm_test as (
SELECT NULL::text AS source
UNION ALL
SELECT 'g2'
),
test as (
{% set expected_cases %}
SELECT NULL::text AS source
UNION ALL
SELECT 'g2'
{% endset %}
{% set actual_cases %}
SELECT NULL::text AS source
UNION ALL
SELECT 'gx'
{% endset %}
{{ audit_helper.compare_queries(a_query=expected_cases, b_query=actual_cases) }}
)
/* when expected_cases and actual_cases differ anyhow, test fails */
select *
from test
where in_a = false or in_b = false
it fails ofc

let's print a and b // to see the data
- hit compile // to get compiled version
- keep only a and b cte generated by the macro

just add these
-- ),
)
select * from a
)
select * from test
- a

- b

a and b is just a dbt convention fort the macro 😄

snippets to code faste
- we want to favour explicit tables used (so you can query them one by one and trouble shoot )
use regex to create the ctes from a list
([a-z_]+)\n
$1 as (
select * from {{ref('$1')}}
),


and use

performance improvements
section with ideas and tests about what is possible to improve and how

- check this htpps://docs.getdbt.com/reference/warehouse-setups/redshift-setup and RS docs - I am sure that might increase performances if done right