Skip to main content

DBT

Local Setup

To be able to run dbt locally

  1. Make sure you have python installed 3.8 (this version should work for everyone)

  2. Use pip to install dbt pip install dbt-redshift

  3. You might lack some additional dependencies. Just install them

  4. Create a profile config (touch ~/.dbt/profiles.yml) with the following content:

    profiles.yaml
    default:
    target: gf_dwh
    outputs:
    # DEV #

    gf_dwh:
    type: redshift
    host: localhost
    user: 'YOUR-REDSHIFT-USERNAME'
    password: 'YOUR-REDSHIFT-PASSWORD'
    port: 55101
    dbname: gfdwh
    # Your own schema name, e.g.
    schema: dbt_harry_potter
    threads: 4

    # STAG #

    # gf_dwh:
    # type: redshift
    # host: localhost
    # user: 'YOUR-REDSHIFT-USERNAME'
    # password: 'YOUR-REDSHIFT-PASSWORD'
    # port: 55131
    # dbname: gfdwh
    # # Your own schema name, e.g.
    # schema: dbt_harry_potter
    # threads: 4

    # PROD #

    # gf_dwh:
    # type: redshift
    # host: localhost
    # user: 'YOUR-REDSHIFT-USERNAME'
    # password: 'YOUR-REDSHIFT-PASSWORD'
    # port: 55141
    # dbname: gf-dwh
    # # Your own schema name, e.g.
    # schema: dbt_harry_potter
    # threads: 4
  5. Remember to open bastion tunnels to RS

    • You can use the gf-helpers script for this:
      # USAGE: gf_port_forward_db [env] redshift
      gf_port_forward_db dev redshift
  6. Run dbt debug (In your project directory) It can suggest some additional changes, like running dbt deps

  7. Make sure you have access to RS by executing for example dbt seed -s country_locations.csv

  8. Run a sample compile dbt compile -s all_jobs_by_descriptionYou will see compiled data in target directory

Reference Documentation

Jobs location

Checking the logs

You either ran or are running a job, and you want to check what was created in the details:

create\s+[table|view]+[\n\r\s]+"gf-dwh"."gf_data_blocks"

Drop all the white-space to have a more compact view of the logs

Weird characters in dbt log

Use an extension similar to this one.

Running jobs

There are 2 types of jobs

Prod Jobs

It runs all to refresh gf schema in RS

  • currently as some models have dependencies we are running as:
dbt seed
dbt build --select gf_clean
dbt build --select gf_intermediate
dbt build --select gf_data_blocks --threads 1

dbt_log_status tools

Use this to check the logs and get insights:

One OFF Job

https://cloud.getdbt.com/next/deploy/64906/projects/103262/jobs/132109

it runs what you want to refresh specific gf schema in RS

danger

Be careful in case Prod Jobs is currently running or is about to be executed to avoid 2 jobs updating the same models and have RS issues:

  • cancel it - if running
  • pause the Prod Jobs scheduling
  • run One OFF and wait to complete
  • re-enable the schedule of Prod Jobs

consider using some BKP before refreshing dbk example

https://goodfit.height.app/T-6352#83fe3844-286e-47aa-935c-910d77262b91

so you can compare the new and old data

dev

Adding new code logic to models with some test data

use the dev macro to limit the data

ex:

you need to change something on some model

you create a full refresh of the model in your dev schema

you use scratchpad to check simply you have some data to work with

code the changes

test data again before pr

ex:

with new_logic as (
# new logic copied from changed model here)
),
curr_logic as (
select * from {{ref('firmographics')}}
)
# compare them here on sample data

re-rerun of a few merge in master

when a few datablocks were changed and you want to refresh prod data with the latest data and changes in https://github.com/GoodFit-io/gf-dwh/commits/master without waiting for the Prod Jobs

https://goodfit-io.slack.com/archives/C039RLGFUR2/p1666958074891769

using test

try this simple case to familiarize heir the audit_helper.compare_queries

WITH primary_software_products_fnm_test AS (
SELECT NULL::text AS source
UNION ALL
SELECT 'g2'
),
test AS (
{% set expected_cases %}
SELECT NULL::text AS source
UNION ALL
SELECT 'g2'
{% endset %}

{% set actual_cases %}
SELECT NULL::text AS source
UNION ALL
SELECT 'gx'
{% endset %}

{{ audit_helper.compare_queries(a_query=expected_cases, b_query=actual_cases) }}
)

-- when expected_cases and actual_cases differ anyhow, test fails
SELECT * FROM test
WHERE in_a = false or in_b = false

it fails ofc

let's print a and b // to see the data

  • hit compile // to get compiled version
  • keep only a and b cte generated by the macro

just add these

            -- ),
)
select * from a
)
select * from test
  • a

  • b

a and b is just a dbt convention fort the macro 😄

Snippets to code faster

  • we want to favour explicit tables used (so you can query them one by one and trouble shoot )

use regex to create the ctes from a list

([a-z_]+)\n

$1 as (
select * from {{ref('$1')}}
),

and use

performance improvements

info

This section contains ideas and tests around what can be improved and how

  • check this htpps://docs.getdbt.com/reference/warehouse-setups/redshift-setup and RS docs - I am sure that might increase performances if done right