DBT

Local Setup

To be able to run dbt locally

Make sure you have python installed 3.8 (this version should work for everyone)
Use pip to install dbt pip install dbt-redshift
You might lack some additional dependencies. Just install them
Create a file in ~/.dbt/profiles.yml

default:
  target: gf_dwh
  outputs:
    gf_dwh:
      type: redshift
      host: localhost
      user: dbt
      password: XXX
      port: 9203
      dbname: gf-dwh
      schema: dbt_dariusz <- your subschema
      threads: 4

Remember to open bastion tunnels to RS
Run dbt debug (In your project directory) It can suggest some additional changes, like running dbt deps
Make sure you have access to RS by executing for example dbt seed -s country_locations.csv
Run a sample compile dbt compile -s all_jobs_by_descriptionYou will see compiled data in target directory

References

https://docs.getdbt.com/docs/get-started/getting-started-dbt-core

https://docs.getdbt.com/docs/get-started/pip-install

jobs location

https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs

checking the logs

your ran/running a job and you want to check what was created in the details

create\s+[table|view]+[\n\r\s]+"gf-dwh"."gf_data_blocks"

drop all the white spaces to have a more compact view of the logs

wired characters in dbt log

use some extension like this https://marketplace.visualstudio.com/items?itemName=iliazeus.vscode-ansi

running jobs

we have 2 types

Prod Jobs

https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs/139419

it runs all to refresh gf schema in RS

currently as some models have dep we a re running as

dbt seed
dbt build --select gf_clean
dbt build --select gf_intermediate
dbt build --select gf_data_blocks   --threads 1

dbt_log_status tools

use to check the logs and get insight

https://gist.github.com/mamatucci/6199e29597c723f987439c1cc2f5cac1

One OFF Job

https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs/132109

it runs what you want to refresh specific gf schema in RS

❗be careful in case Prod Jobs is currently running/ about to be executed to avoid 2 jobs updating the same models and have RS issues:

cancel it - if running
pause the Prod Jobs scheduling
run One OFF and wait to complete
re-enable the schedule of Prod Jobs

consider using some BKP before refreshing dbk example

https://goodfit.height.app/T-6352#83fe3844-286e-47aa-935c-910d77262b91

so you can compare the new and old data

dev

adding new code logic to models with some test data

use the dev macro to limit the data

ex:

you need to change something on some model

you create a full refresh of the model in your dev schema

you use scratchpad to check simply you have some data to work with

code the changes

test data again before pr

ex:

with new_logic as (
# new logic copied from changed model here)
),
curr_logic as (
select * from {{ref('firmographics')}}
)
# compare them here on sample data

re-rerun of a few merge in master

when a few datablocks were changed and you want to refresh prod data with the latest data and changes in https://github.com/GoodFit-io/gf-dwh/commits/master without waiting for the Prod Jobs

https://goodfit-io.slack.com/archives/C039RLGFUR2/p1666958074891769

using test

try this simple case to familiarize heir the audit_helper.compare_queries

with primary_software_products_fnm_test as (
    SELECT  NULL::text AS source
        UNION ALL
        SELECT 'g2'
    ),
    test as (
{% set expected_cases %}
    SELECT  NULL::text AS source
    UNION ALL
    SELECT 'g2'
{% endset %}

{% set actual_cases %}
    SELECT  NULL::text AS source
    UNION ALL
    SELECT 'gx'
{% endset %}

        {{ audit_helper.compare_queries(a_query=expected_cases, b_query=actual_cases) }}
    )
/* when expected_cases and actual_cases differ anyhow, test fails */
select *
from test
where in_a = false or in_b = false

it fails ofc

let's print a and b // to see the data

hit compile // to get compiled version
keep only a and b cte generated by the macro

just add these

            -- ),
            )
            select * from a
    )
    select * from test

a and b is just a dbt convention fort the macro 😄

snippets to code faste

we want to favour explicit tables used (so you can query them one by one and trouble shoot )

use regex to create the ctes from a list

([a-z_]+)\n

$1 as (
  select * from {{ref('$1')}}
),

and use

performance improvements

section with ideas and tests about what is possible to improve and how

check this htpps://docs.getdbt.com/reference/warehouse-setups/redshift-setup and RS docs - I am sure that might increase performances if done right

Local Setup

jobs location

checking the logs

running jobs

Prod Jobs​

dbt_log_status tools​

One OFF Job​

dev

adding new code logic to models with some test data​

re-rerun of a few merge in master​

using test​

snippets to code faste​