Image generated using Midjourney.

This post was originally written by my business “better half” Marco Baringer (go give him a follow). I’m only posting it on my Medium on his behalf.

This is an opinionated guide to building a solid data setup with the modelling layer. DOs and DON’Ts, along with the “whys” and the “what ifs”. This will get you off the ground and is straightforward to implement, assuming you’re not afraid to get your hands a little dirty.

1. Problem

You’re the first or the only data person in a company. Everyone around you has questions and turns to you for answers. You want a simple setup that will allow you to organise the data and communicate the information in a coherent and reproducible way.

You’ve done some googling, because you know you’re not the only one with this problem, but there are so many tools and opinions and vendors competing for your attention, the choices are overwhelming.

2. Solution (TL;DR)

Use Airbyte or Fivetran to dump your raw data into one big pot in BigQuery.

Add four (or five) layers of dbt models to transform the raw data into usable business data. Use one git repo but split your models into several layers:

  • A bottom layer where you do column renaming, lower casing, type conversion, etc.
  • A middle layer, or two, where you build up your business entities
  • A top layer where you expose copies of business data models

Use the visualisation and ad hoc querying tools you or your stakeholders are most comfortable with (yes, spreadsheets do qualify as such) on a small curated subset of the top-layer models you generated in the previous step.

3. Ingestion: importing the raw data

Do this

Create a single GCP project for all your raw data.

  1. Use Airbyte running on a bare metal virtual machine (VM) from Hetzner or OVH to ingest the data.
  2. Create a dataset for every source (“backend”, “salesforce”, or “spreadsheets”), and inside that a table for every table or entity in the source. (This is what Airbyte will do by default anyway)
  3. Store your data as raw JSON, two columns per table: one for the imported at timestamp and one for the JSON blob of data.

What if I need to do more with my marketing data than just cost and spend per campaign?

Use Improvado (or Funnel).

If you work extensively with marketing data, beyond just the tracking data you can pull out of GA4, look at Improvado or Funnel instead of Fivetran or Airbyte. While these services aren’t free they are focused on marketing: they know what data AdWords, Facebook, TikTok, etc. have; they know what kinds of questions we usually ask of the marketing data; and they deliver the data in a schema and with a level of quality that makes answering those questions easy.

For example, while the AdWords API might expose geo-information in the API in terms of location IDs, which then must be joined with a separate API to get postal codes of country names, marketing-specific services will generally provide the geo-information directly as human-readable text, saving you an extra data integration and an extra model or two in the warehouse.

What if I need to explore the raw data in the data warehouse itself?

Consider using your ingestion tool to create tables with columns and data types; but only if you don’t have query-level access to the raw source (for example the app backend is behind a firewall or a VPN).

If, however, you can query the source directly, then keep your data as one JSON blob per row/object. This has two advantages over “split into columns” raw data:

  1. If your source schema changes your warehouse schema doesn’t. This means you can decide when, or if, to adapt your code for the new schema and it’s impossible (at least much harder) for changes in the source, over which you have no control, to cause your pipeline to break.
  2. You control how source data is converted to warehouse data types. time zone parsing, string handling, JSON manipulation, all of these things become explicit decisions you make in your code.

This style does require you to explicitly list all the columns you want to pull out in the staging layer. 99% of the time you’re listing the column in the staging layer anyway, very rarely do you really need all the data in the source. This overhead is real, but it’s trivial given what we have to do later on with the data anyway.

What if I have data in sources Airbyte doesn’t support?

Use a single-node in-development-mode Airflow running on a small server somewhere. A bare metal service like Hetzner or OVH is great for this, the airflow code lives in a git repo somewhere else, the data is only getting moved from an external source to an external warehouse, and the load on the server is basically constant. You don’t need autoscaling or backups or anything fancy that the cloud providers give you, so don’t pay for it.

Having said all that, if you’re reading this guide you probably don’t have the time to be building your own data connectors, so just find a way to ignore this custom source or, since it’s probably something around your app’s backend, convince (or bribe) the engineering team to dump the data you need in BigQuery.

What about real-time data?

Don’t.

As nice as real-time data is, and we all love pounding the refresh key after a release and watching the numbers change, it comes with a lot of technical challenges and ongoing maintenance costs, it is rarely worth the effort at this early stage of your company’s data setup.

Why one project for all the raw data?

Billing, auditing, not losing things, easy joining in the modelling layer.

Given what happens in the modelling layer it’s easiest if all the raw data, for all the sources, lands in the same project (BigQuery) or database (Snowflake) and you use schema names, or table prefixes, to separate one source from another.

Why such long dataset names?

  • my_company_raw.airbyte_backend.orders -> a copy of the orders table from the backend via Airbyte
  • my_company_raw.fivetran_adwords.campaign -> AdWords data as imported by Fivetran
  • my_company_raw.airbyte_GA4.campagin -> GA4 data as imported by Airbyte

If you put the name of the tool in the schema name, then later on you can easily migrate to another tool:

  • There are no problems where different datasets have the same name.
  • Every tool is different, and the schemas they produce are not the same. With this split you can deploy the migrations in smaller steps: new code can reference the new name and schema, and old code can reference the old name and schema.

4. Modelling: the base layer

Do this

This is where the first, simplest and most mechanical cleanups and transformations of the raw data occur. The goal is to create a set of tables and columns that are more convenient to use in the intermediate layer (the place where we add our actual business logic). For example, your backend might have evolved over a few years and use names and labels that aren’t correct anymore but can’t reasonably be changed in the DB schema. Propagating these bad names in the intermediate layer makes everything else confusing, so fix that here.

Generally, this layer will have about as many tables, and columns, as the source data.

What kinds of things I should do in the base layer?

Other examples of things you would want to do in the base layer:

  • Removing test or debugging data which nobody has yet removed from the CRM
  • Converting integer status codes or enums to readable labels
  • Converting to a single casing and naming conventions (so it’s all snake_case or camelCase)
  • Ignoring columns and tables that aren’t at all relevant to the warehouse so that autocompletion is more useful when writing intermediate layer code
  • Making strings and fields consistent: Removing spurious whitespace, converting things to all upper or all lower case as the case may be, coalescing boolean fields so NULLs become false, etc.

What kinds of things I should NOT do in the base layer?

  • Don’t put business logic in here. You will absolutely have business logic in the next layer up (the intermediate layer); If you keep it all in one place, by not putting any of it here, you won’t go searching for code and conditions when debugging things later on (or, worse, not noticing that some business or process specific data transformation is happening).
  • Don’t join across tables or aggregate rows. You’ll be doing enough joining and aggregating in the intermediate layer. Moreover, aggregations you make here can’t be unmade later on. It’s best to postpone any changes which destroy data until it’s obvious what they’re used for and you’re sure you don’t need the fine-grained information.

Shouldn’t this be called staging?

No, base is better.

Staging means something very specific, and different, to the dev and engineering teams, so we’re going with the unambiguous naming “base”.

5. Modelling: the intermediate layer

This layer exists for two equally important reasons:

  1. To produce a limited set of very wide tables our stakeholders will use to answer their business questions.
  2. To allow us to organise, manage, and control the ever-increasing complexity of the business logic.

Do this

This layer depends completely on the specific business and use cases, so it’s not possible to give a prescriptive “do this then that”, but if you were working for a normal online store, here’s what it would look like (assume order and transaction data is in Shopify, customer contract and CRM data is in Salesforce, and payment data is in Stripe):

  • Join the base tables backend.customer with backend.customer_addresses to create int_backend_customer with a set of primary_address_... columns.
  • Join the CRM tables salesforce.account and salesforce.customer to create an int_sf_customer table
  • Join these two tables to produce int_customer_details
  • Perform similar steps for the Shopify order data, and the invoicing data from Stripe, to produce int_order
  • Produce an int_customer_day daily table by joining int_customer and int_order grouping by day and calculating num_orders, num_cancelled_orders, total_wallet_to_date, reorder_probability, etc.
  • Finally, aggregate and join int_customer_details and int_customer_day to produce int_customer . A very wide table with all the business data around our customers.

How do I know how to split my models?

While the raw and base layers will basically organise themselves, as they follow directly from the sources and the ingestion tool, there’s too much freedom in organising the intermediate layer.

Aim to build little blocks of logic around bigger and bigger business concepts:

  • Select a business-level entity like order or invoice, or ticket. Join together all the backend tables to produce a single “order as it exists in the backend” table. Do this for the various sources and then use “order as it exists in the backend” along with “order as it exists in stripe” and “order as it exists in salesforce” to produce a single “order as it’s understood in our business” entity.
  • When you see that the same calculation or aggregation is used in multiple places, move that to a model and reference that model in multiple places instead of repeating the code to do the calculation itself.

But then I’ll have lots and lots of really short SQL files

Yes, and that means you’ll be able to easily find what you’re looking for just by scanning the file names in the directory. It means you’ll always be able to fit the logic for a model on a single monitor screen. It means you can test the pipeline by querying intermediate steps in the modelling layer.

This doesn’t mean you should never use CTEs, on the contrary, but for anything longer than 20 or 30 lines, and for anything which needs to be used in multiple places, an intermediate model is a good idea.

Also, these models are basically free. All tables in the intermediate layer are written exactly once and read a small number of times (once per downstream model) and then never touched again. It’s just not that important if they run in 7 seconds or 17.

One thing to decide is how to name these models; go with <LAYER>_<MODEL>__<PART>.sql.

But this isn’t a star schema?

It is, but we’re dropping the words dim and fact and only using the word we need to understand where to find what.

But won’t this layer get very complicated?

Only as complicated as the business itself 🙂

How do I use this layer to make debugging and maintenance easier?

This layer is where you will go when people ask “where does this number come from?” keep that in mind while implementing it.

Keep the code simple. Find the right balance between “not too clever” and “not overly verbose”, you want to be able to quickly relearn what every model does; because in six months you’ll be back in the code trying to debug it or adapt it to changes in the business.

6. Modelling: the mart

This layer doesn’t have any real logic and exists for non-modelling reasons:

  • It’s a bucket where we put the models we want to expose to the business. This allows us to freely create intermediate models based on what’s most convenient for us and our development process, without worrying that our stakeholders will build reports on data we don’t want to maintain.
  • Since a layer maps to a dataset (BigQuery) or a schema (Snowflake) it provides a simple way to apply access controls, permissions, and track billing.
  • It’s a basic form of documentation, unlike the raw, base, and intermediate layers, the mart layer is small. Simply scanning the tables and columns should give a nice overview of what kind of data and metrics are available in the data warehouse.

Do this

To make things simple and easier to reuse implement it as just a bunch of copies of intermediate tables. Literally just select * from {{ ref('int_...') }}, maybe dropping some debugging or logging info or doing final transformations which make the visualization and reporting tools easier.

7. Modelling: the (dirty) ad-hoc layer

Do this

  • Invest as little time here as possible. You’re building these models for specific use cases based on the constraints of other tools or arbitrary requirements
  • example: your start-up’s investors want a pivot table over a bunch of unrelated data points
  • example: your visualization tool requires mixing rows of totally different kinds, such as single transactions and weekly targets, into the same table (I’m looking at you Looker….)
  • Optimize more for speed of development than reusability. Building one-off solutions here is ok. This layer is the end of your pipeline, there’s no benefit in being generic and reusable here.
  • Drop everything older than 6 months.
  • If it’s still needed find a way to move it cleanly to the mart
  • Otherwise, delete it.

8. What’s next?

Take a break. Once you’ve gotten here, take an evening and do whatever you do to relax and have fun. Getting to this point isn’t easy when you’re all by yourself. Well done.

Then consider adding two small improvements:

  1. Add a few tests, so you notice when things break.
  2. Set up an extra environment, or two, so you can test things without breaking production.

--

--

Aleksander Kruszelnicki
Aleksander Kruszelnicki

Written by Aleksander Kruszelnicki

“A problem well stated is a problem half solved” — Charles Kettering | Bridging the gap between business and data | Co-founder of leukos.io

No responses yet