Skip to main content

DBT

Local Setup

To be able to run dbt locally

  1. Make sure you have python installed 3.8 (this version should work for everyone)
  2. Use pip to install dbt pip install dbt-redshift
  3. You might lack some additional dependencies. Just install them
  4. Create a file in ~/.dbt/profiles.yml
default:
target: gf_dwh
outputs:
gf_dwh:
type: redshift
host: localhost
user: dbt
password: XXX
port: 9203
dbname: gf-dwh
schema: dbt_dariusz <- your subschema
threads: 4
  1. Remember to open bastion tunnels to RS
  2. Run dbt debug (In your project directory) It can suggest some additional changes, like running dbt deps
  3. Make sure you have access to RS by executing for example dbt seed -s country_locations.csv
  4. Run a sample compile dbt compile -s all_jobs_by_descriptionYou will see compiled data in target directory

References

https://docs.getdbt.com/docs/get-started/getting-started-dbt-core

https://docs.getdbt.com/docs/get-started/pip-install

jobs location

https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs

checking the logs

your ran/running a job and you want to check what was created in the details

create\s+[table|view]+[\n\r\s]+"gf-dwh"."gf_data_blocks"

drop all the white spaces to have a more compact view of the logs

wired characters in dbt log

use some extension like this https://marketplace.visualstudio.com/items?itemName=iliazeus.vscode-ansi

running jobs

we have 2 types

Prod Jobs

https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs/139419

it runs all to refresh gf schema in RS

  • currently as some models have dep we a re running as
dbt seed
dbt build --select gf_clean
dbt build --select gf_intermediate
dbt build --select gf_data_blocks --threads 1

dbt_log_status tools

use to check the logs and get insight

https://gist.github.com/mamatucci/6199e29597c723f987439c1cc2f5cac1

One OFF Job

https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs/132109

it runs what you want to refresh specific gf schema in RS

❗be careful in case Prod Jobs is currently running/ about to be executed to avoid 2 jobs updating the same models and have RS issues:

  • cancel it - if running
  • pause the Prod Jobs scheduling
  • run One OFF and wait to complete
  • re-enable the schedule of Prod Jobs

consider using some BKP before refreshing dbk example

https://goodfit.height.app/T-6352#83fe3844-286e-47aa-935c-910d77262b91

so you can compare the new and old data

dev

adding new code logic to models with some test data

use the dev macro to limit the data

ex:

you need to change something on some model

you create a full refresh of the model in your dev schema

you use scratchpad to check simply you have some data to work with

code the changes

test data again before pr

ex:

with new_logic as (
# new logic copied from changed model here)
),
curr_logic as (
select * from {{ref('firmographics')}}
)
# compare them here on sample data

re-rerun of a few merge in master

when a few datablocks were changed and you want to refresh prod data with the latest data and changes in https://github.com/GoodFit-io/gf-dwh/commits/master without waiting for the Prod Jobs

https://goodfit-io.slack.com/archives/C039RLGFUR2/p1666958074891769

using test

try this simple case to familiarize heir the audit_helper.compare_queries

with primary_software_products_fnm_test as (
SELECT NULL::text AS source
UNION ALL
SELECT 'g2'
),
test as (
{% set expected_cases %}
SELECT NULL::text AS source
UNION ALL
SELECT 'g2'
{% endset %}

{% set actual_cases %}
SELECT NULL::text AS source
UNION ALL
SELECT 'gx'
{% endset %}

{{ audit_helper.compare_queries(a_query=expected_cases, b_query=actual_cases) }}
)
/* when expected_cases and actual_cases differ anyhow, test fails */
select *
from test
where in_a = false or in_b = false

it fails ofc

let's print a and b // to see the data

  • hit compile // to get compiled version
  • keep only a and b cte generated by the macro

just add these

            -- ),
)
select * from a
)
select * from test
  • a

  • b

a and b is just a dbt convention fort the macro 😄

snippets to code faste

  • we want to favour explicit tables used (so you can query them one by one and trouble shoot )

use regex to create the ctes from a list

([a-z_]+)\n

$1 as (
select * from {{ref('$1')}}
),

and use

performance improvements

section with ideas and tests about what is possible to improve and how

  • check this htpps://docs.getdbt.com/reference/warehouse-setups/redshift-setup and RS docs - I am sure that might increase performances if done right