Building dbt packages
Introduction
Creating packages is an advanced use of dbt. If you're new to the tool, we recommend that you first use the product for your own analytics before attempting to create a package for others.
Prerequisites
A strong understanding of:
- packages
- administering a repository on GitHub
- semantic versioning
Assess whether a package is the right solution
Packages typically contain either:
- macros that solve a particular analytics engineering problem — for example, auditing the results of a query, generating code, or adding additional schema tests to a dbt project.
- models for a common dataset — for example a dataset for software products like MailChimp or Snowplow, or even models for metadata about your data stack like Snowflake query spend and the artifacts produced by
dbt run
. In general, there should be a shared set of industry-standard metrics that you can model (e.g. email open rate).
Packages are not a good fit for sharing models that contain business-specific logic, for example, writing code for marketing attribution, or monthly recurring revenue. Instead, consider sharing a blog post and a link to a sample repo, rather than bundling this code as a package (here's our blog post on marketing attribution as an example).
Create your new project
We tend to use the command line interface for package development. The development workflow often involves installing a local copy of your package in another dbt project — at present dbt Cloud is not designed for this workflow.
- Use the dbt init command to create a new dbt project, which will be your package:
$ dbt init [package_name]
- Create a public GitHub¹ repo, named
dbt-<package-name>
, e.g.dbt-mailchimp
. Follow the GitHub instructions to link this to the dbt project you just created. - Update the
name:
of the project indbt_project.yml
to your package name, e.g.mailchimp
. - Define the allowed dbt versions by using the
require-dbt-version
config.
¹Currently, our package registry only supports packages that are hosted in GitHub.
Develop your package
We recommend that first-time package authors first develop macros and models for use in their own dbt project. Once your new package is created, you can get to work on moving them across, implementing some additional package-specific design patterns along the way.
When working on your package, we often find it useful to install a local copy of the package in another dbt project — this workflow is described here.
Follow best practices
Modeling packages only
Use our dbt coding conventions, our article on how we structure our dbt projects, and our best practices for all of our advice on how to build your dbt project.
This is where it comes in especially handy to have worked on your own dbt project previously.
Make the location of raw data configurable
Modeling packages only
Not every user of your package is going to store their Mailchimp data in a schema named mailchimp
. As such, you'll need to make the location of raw data configurable.
We recommend using sources and variables to achieve this. Check out this package for an example — notably, the README includes instructions on how to override the default schema from a dbt_project.yml
file.
Install upstream packages from hub.getdbt.com
If your package relies on another package (for example, you use some of the cross-database macros from dbt-utils), we recommend you install the package from hub.getdbt.com, specifying a version range like so:
packages:
- package: dbt-labs/dbt_utils
version: [">0.6.5", "0.7.0"]
When packages are installed from hub.getdbt.com, dbt is able to handle duplicate dependencies.
Implement cross-database compatibility
Many SQL functions are specific to a particular database. For example, the function name and order of arguments to calculate the difference between two dates varies between Redshift, Snowflake and BigQuery, and no similar function exists on Postgres!
If you wish to support multiple warehouses, we have a number of tricks up our sleeve:
- We've written a number of macros that compile to valid SQL snippets on each of the original four adapters. Where possible, leverage these macros.
- If you need to implement cross-database compatibility for one of your macros, use the
adapter.dispatch
macro to achieve this. Check out the cross-database macros in dbt-utils for examples. - If you're working on a modeling package, you may notice that you need write different models for each warehouse (for example, if the EL tool you are working with stores data differently on each warehouse). In this case, you can write different versions of each model, and use the
enabled
config, in combination withtarget.type
to enable the correct models — check out this package as an example.
If your package has only been written to work for one data warehouse, make sure you document this in your package README.
Use specific model names
Modeling packages only
Many datasets have a concept of a "user" or "account" or "session". To make sure things are unambiguous in dbt, prefix all of your models with [package_name]_
. For example, mailchimp_campaigns.sql
is a good name for a model, whereas campaigns.sql
is not.
Default to views
Modeling packages only
dbt makes it possible for users of your package to override your model materialization settings. In general, default to materializing models as view
s instead of table
s.
The major exception to this is when working with data sources that benefit from incremental modeling (for example, web page views). Implementing incremental logic on behalf of your end users is likely to be helpful in this case.
Test and document your package
It's critical that you test your models and sources. This will give your end users confidence that your package is actually working on top of their dataset as intended.
Further, adding documentation via descriptions will help communicate your package to end users, and benefit their stakeholders that use the outputs of this package.
Include useful GitHub artifacts
Over time, we've developed a set of useful GitHub artifacts that make administering our packages easier for us. In particular, we ensure that we include:
- A useful README, that has:
- GitHub templates, including PR templates and issue templates (example)
Add integration tests
Optional
We recommend that you implement integration tests to confirm that the package works as expected — this is an even more advanced step, so you may find that you build up to this.
This pattern can be seen most packages, including the audit-helper
and snowplow
packages.
As a rough guide:
- Create a subdirectory named
integration_tests
- In this subdirectory, create a new dbt project — you can use the
dbt init
command to do this. However, our preferred method is to copy the files from an existingintegration_tests
project, like the ones here (removing the contents of themacros
,models
andtests
folders since they are project-specific) - Install the package in the
integration_tests
subdirectory by using thelocal
syntax, and then runningdbt deps
packages:
- local: ../ # this means "one directory above the current directory"
-
Add resources to the package (seeds, models, tests) so that you can successfully run your project, and compare the output with what you expect. The exact approach here will vary depending on your packages. In general you will find that you need to:
- Add mock data via a seed with a few sample (anonymized) records. Configure the
integration_tests
project to point to the seeds instead of raw data tables. - Add more seeds that represent the expected output of your models, and use the dbt_utils.equality test to confirm the output of your package, and the expected output matches.
- Add mock data via a seed with a few sample (anonymized) records. Configure the
-
Confirm that you can run
dbt run
anddbt test
from your command line successfully. -
(Optional) Use a CI tool, like CircleCI or GitHub Actions, to automate running your dbt project when you open a new Pull Request. For inspiration, check out one of our CircleCI configs, which runs tests against our four main warehouses. Note: this is an advanced step — if you are going down this path, you may find it useful to say hi on dbt Slack.
Deploy the docs for your package
Optional
A dbt docs site can help a prospective user of your package understand the code you've written. As such, we recommend that you deploy the site generated by dbt docs generate
and link to the deployed site from your package.
The easiest way we've found to do this is to use GitHub Pages.
- On a new git branch, run
dbt docs generate
. If you have integration tests set up (above), use the integration-test project to do this. - Move the following files into a directory named
docs
(example):catalog.json
,index.html
,manifest.json
,run_results.json
. - Merge these changes into the main branch
- Enable GitHub pages on the repo in the settings tab, and point it to the “docs” subdirectory
- GitHub should then deploy the docs at
<org-name>.github.io/<repo-name>
, like so: fivetran.github.io/dbt_ad_reporting
Release your package
Create a new release once you are ready for others to use your work! Be sure to use semantic versioning when naming your release.
In particular, if new changes will cause errors for users of earlier versions of the package, be sure to use at least a minor release (e.g. go from 0.1.1
to 0.2.0
).
The release notes should contain an overview of the changes introduced in the new version. Be sure to call out any changes that break the existing interface!
Add the package to hub.getdbt.com
Our package registry, hub.getdbt.com, gets updated by the hubcap script. To add your package to hub.getdbt.com, create a PR on the hubcap repository to include it in the hub.json
file.