TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Deploying dbt Projects at Scale on Google Cloud

Containerising and running dbt projects with Artifact Registry, Cloud Composer, GitHub Actions and dbt-airflow

Giorgos Myrianthous
TDS Archive
Published in
11 min readJul 29, 2024

Managing data models at scale is a common challenge for data teams using dbt (data build tool). Initially, teams often start with simple models that are easy to manage and deploy. However, as the volume of data grows and business needs evolve, the complexity of these models increases.

This progression often leads to a monolithic repository where all dependencies are intertwined, making it difficult for different teams to collaborate efficiently. To address this, data teams may find it beneficial to distribute their data models across multiple dbt projects. This approach not only promotes better organisation and modularity but also enhances the scalability and maintainability of the entire data infrastructure.

One significant complexity introduced by handling multiple dbt projects is the way they are executed and deployed. Managing library dependencies becomes a critical concern, especially when different projects require different versions of dbt. While dbt Cloud offers a robust solution for scheduling and executing multi-repo dbt projects, it comes with significant investments that not every organisation can afford or find reasonable. A common alternative is to run dbt projects using Cloud Composer, Google Cloud’s managed Apache Airflow service.

Cloud Composer provides a managed environment with a substantial set of pre-defined dependencies. However, based on my experience, this setup poses a significant challenge. Installing any Python library without encountering unresolved dependencies is often difficult. When working with dbt-core, I found that installing a specific version of dbt within the Cloud Composer environment was nearly impossible due to conflicting version dependencies. This experience highlighted the difficulty of running any dbt version on Cloud Composer directly.

Containerisation offers an effective solution. Instead of installing libraries within the Cloud Composer environment, you can containerise your dbt projects using Docker images and run them on Kubernetes via Cloud Composer. This approach keeps your Cloud Composer environment clean

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Giorgos Myrianthous
Giorgos Myrianthous

Written by Giorgos Myrianthous

I strive to build data-intensive systems that are not only functional, but also scalable, cost effective and maintainable over the long term.

Responses (2)

Write a response