Understanding dbt fundamentals

I finally have a reason to learn dbt properly. Here is my summary of the dbt Fundamentals course.

In this post, I summarise what I've learned from the dbt Fundamentals course created and shared for free by dbt Labs.

But first, perhaps some context is in order.

I am seeking a move upwards in my career. I tell this to my manager, who is incredibly supportive (thank you Natan!). I didn't know it then, but this was perfect timing, because a role was opening up for a new engineering manager.

But, you know, startups. One moment they need a new engineering manager, the next moment they need someone to lead a special project for ensuring we get our act together with our data warehouse so we can understand our business well.

So I found myself in that position: here's a big project, go figure it out, and you'll probably get a move upwards in your career.

This big project was all about data pipelines and SQL and dashboards, but I'd never led a data-related project of any scale before...

I knew, though, that I had a few colleagues – software engineers, not data engineers, surprisingly – who have a lot of experience with data. I could always ask them for help.

Except... that's not my style. I think it's irresponsible to be a dilettante. You have to first learn the fundamentals and try and piece things together for yourself before you ask fundamental questions. Even Albert Einstein knew nothing about physics at one point.

And so, with a big project to deliver, I asked myself: what do I not understand? Here's the list:

  • Financial domain concepts
  • Risk domain concepts
  • Operational workflow
  • Our data pipeline, and where best to implement data transformations for the whole business – sales, marketing, finance, ops, risk, tech – to use

That last point about the data pipeline led me to dbt, which stands for data build tool.

I knew by this point in my career that dbt has become the de facto standard in the tech industry for codifying and building data models that are exposed to the rest of the business. I just didn't really know how to use it, other than from intuiting using my software engineering knowledge. I could only rely on that for so long until I realised I was asking the same old questions again and again (a reliable signal for a learning intervention).

So, here are my main learnings from completing the dbt Fundamentals course. I combine them with existing knowledge of data pipelines that I've gained elsewhere to make this post a richer reference.

Types of tables

I thought there were a lot of types of tables, but the course helped me realise that there are only two types that matter:

  1. source tables
  2. tables exposed in the data mart

Source tables contain raw data. You refer to them with the source macro, like so:

SELECT *
FROM {{ source('app_db', 'orders') }};

Source tables are almost always normalised. That means data isn't repeated, and associations between tables are made using IDs. This also means that these tables are usually quite thin (not many columns) and long (loads of entries).

Tables exposed in the data mart contain transformed data, modelling business concepts like customers, orders, and so on. These are usually de-normalised, because we need them to have a lot of additional joined data for business users and data analysts to make use of immediately. This means that these tables are usually fat (loads of columns).

Typically, mart tables are prefixed with dim_ or fct_ or agg_.

  • dim = dimension, referring to users, organisations, shopping carts, and generally any entity that is usually described as "who" or "what"
  • fct = fact, referring to events, actions, or results from a process, like payments, emails sent, or user accounts created
  • agg = ... frankly, unsure, as this wasn't covered. But I've seen this before in production at a few companies. Likely some aggregation on top of dim_ or fct_ would be my guess

That's it. That's the whole thing with dimensional modelling.

The other kinds of prefixes that I'd come across are for modularisation and composability. Tables with prefixes like stg_ and int_ are just conventions used by the dbt community for incrementally grouping, filtering, and transforming source tables into exposed mart tables.

What's the data mart? Where does it live?

Speaking of mart... what is it?

It's a conceptual thing. A data mart is what dbt folks use to refer to the place you go to to grab ready-to-use data about the business (my definition). It's any number of dim_ and fct_ and potentially agg_ tables, ready to use.

The folks at dbt give a great analogy with flour (source) and cupcakes (mart):

Just as eating raw flour isn’t that appetizing, neither is deriving insights from raw data since it rarely has a nice structure that makes it poised for analytics. There’s some considerable work that’s needed to organize data and make it usable for business users.
This is where dimensional modeling comes into play; it’s a method that can help data folks create meaningful entities (cupcakes and cookies) to live inside their data mart (your glass display) and eventually use for business intelligence purposes (eating said cookies).

And where does the data mart live? Likely as part of your data warehouse.

So the data warehouse has a mix of source tables and mart tables. Why? Because dbt operates on the ELT paradigm, not ETL.

ELT? Not ETL?

This is a helpful distinction to know, so let's talk about it.

The traditional paradigm for processing data until it becomes fit for consumption by business users and data analysts is ETL:

  • Extract = take data from sources (e.g. your application database tables, Salesforce, etc.)
  • Transform = changing the data format, cleaning it up, ignoring, etc.
  • Load = putting transformed data into the warehouse for querying

There is one big problem with this paradigm that became more obvious when cloud warehouses with flexibly scalable computational power and cheap storage came into existence: the process results in rigidity.

With ETL, to have something exposed, you're going to need to get a data engineer in a meeting, explain your requirements, and they go make changes in code for the Transform step. Cumbersome, slow, painful, costly.

But since products like Google's BigQuery came about, which is effectively data warehouse software as a service, a new paradigm became possible. That's the ELT paradigm, built on cheap storage and pay-as-you-use and flexibly scalable computing power.

In ELT, the Transform step is the last. E and L are done first, which is effectively saying "we take stuff from 10 different sources and load them into the warehouse, raw." Products like Fivetran automate this E and L step, freeing up time from data engineers (maybe you don't even need to hire them anymore... many startups don't).

What's left is the T, and dbt provides the toolset for making transformations pleasant and maintainable.

So, dbt – the "data build tool" – effectively covers the whole T step in ELT paradigm.

(See THIS useful video of Fivetran founder explaining ETL vs ELT. I went back a few times to watch this across a few years to really understand it.)

Most useful commands

There are two:

  1. dbt build
  2. dbt run + dbt test

First one runs both dbt run and dbt test (i.e. option 2) but with one crucial difference: it runs it in DAG order.

DAG = directed acyclic graph. It's a technically-accurate but convoluted way of referring to the flow of data from source to exposed tables.

dbt run will generate the models and fill them in with data from your sources.

dbt test will run all tests against the generated models.

If you run dbt run first, then run dbt test, what's effectively happening is this, in order:

  1. Generate models and fill in data
  2. Test against generated models and data

But because models are built upon previous tables in a lineage graph, this can be problematic. If a previous table generated had some issues, then downstream tables would, of course, have issues too.

What we usually want is to be build confidence in our upstream tables and move to downstream tables. That's why dbt build is better. Instead of running in this order:

  1. generate 1st model
  2. generate 2nd model that depends on 1st model
  3. test 1st model
  4. test 2nd model

It runs in this order:

  1. generate 1st model
  2. test 1st model
  3. generate 2nd model that depends on 1st model
  4. test 2nd model

The benefit to us as developers is that we know where exactly the problem lies, as upstream as possible. It's the difference between an obscure error message and a clear one with a well defined stacktrace.

Tests: Generic vs Singular

Generic tests are super simple to add against models in dbt. You just specify it in the model's YML file in the tests property:

version: 2
models:
  - name: stg_sheets_gmv_forecast__goals
    columns:
      - name: sales_channel
        description: Sales channel for the forecast
        tests:
          - not_null
          - accepted_values:
              values:
                - offline
                - ecommerce

With that, whenever you run dbt test, dbt will run a generic query against the table to ensure that the test assertion passes. It's applying software engineering to data.

Using the snippet above as an example, we're asserting that the sales_channel column in the stg_sheets_gmv_forecast__goals table should always be "not_null".

What dbt will do under the hood is to write a simple – what some may call "generic" – SQL select statement to check for entries where the sales_channel value is null.

If any row returns, the test fails. If no row is returned, the test passes. I love how simple this is.

There are 4 generic tests provided by dbt:

  1. not_null = should never be null
  2. unique = should be unique across whole table
  3. accepted_values = should only have value in predetermined list
  4. relationships = should have corresponding associated entry in another table

That's it with generic tests. Powerful and simple. Beautiful too, because they live in the model's YML, describing its schema exactly where you'd want to read about it.

Then there's the second type of test – "singular" tests.

From what I understand, the main difference between singular and generic tests in dbt is that you write your own SQL instead of let dbt write it for you under the hood.

Say you want to ensure that all orders have positive or 0 total payment amount on the stg_payments table. You write this SQL as a test:

SELECT 
  SUM(amount) AS total_amount
FROM {{ ref('stg_payments') }}
WHERE total_amount < 0;

Then dbt will execute this as part of dbt test and check that it doesn't return any rows. (There may be a one-liner missing to ensure dbt picks it up as a test, but I don't remember it.)

Documentating data with dbt

One of the biggest problems with data in any organisation is the lack of shared meaning in column names and cell values.

What does "return_pending" mean in an dim_orders table?

What does the column customer_id mean in the same table? Why is there another similarly named column end_customer_id in the same table?

To solve part of this problem, you can codify the meaning behind each column and its possible values with a description property in the model's YML file.

Here is an example:

version: 2

models:
  - name: dim_customers
    description: One record per customer

    columns:
      - name: customer_id
        description: Primary key

You can do this for every single column you have in every single table that dbt will generate.

The cherry on top is that dbt has a built-in feature for generating a relatively non-technical-user-friendly documentation site. In that site, users of your data mart can see the description of each column clearly.

For situations where you want to, say, explain what each status value means, or where you need to be verbose, "doc blocks" allows you to describe a field with a markdown file. The markdown file looks like this:

{% docs orders_status %}

Orders can be one of the following statuses:

| status         | description                                                               |
|----------------|---------------------------------------------------------------------------|
| placed         | The order has been placed but has not yet left the warehouse              |
| shipped        | The order has been shipped to the customer and is currently in transit     |
| completed      | The order has been received by the customer                               |
| returned       | The order has been returned by the customer and received at the warehouse |


{% enddocs %}

models/docs.md

And to refer to it, you use curly braces and the doc(...) macro in the model's YML file:

version: 2

models:
  - name: fct_orders
    description: This table has basic information about orders, as well as some derived facts based on payments

    columns:
      - name: status
        description: '{{ doc("orders_status") }}'

Hopefully you get the point that dbt is very useful and that their own dbt Fundamentals course is well worth your time.

I feel much more prepared now to deliver the big data project!

Subscribe to Nick's Notes

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe