{"id":1378,"date":"2025-01-23T07:03:51","date_gmt":"2025-01-23T07:03:51","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/23\/optimize-the-dbt-doc-function-with-a-ci-5e2c3e8428e7\/"},"modified":"2025-01-23T07:03:51","modified_gmt":"2025-01-23T07:03:51","slug":"optimize-the-dbt-doc-function-with-a-ci-5e2c3e8428e7","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/23\/optimize-the-dbt-doc-function-with-a-ci-5e2c3e8428e7\/","title":{"rendered":"Optimize the dbt Doc Function with a CI"},"content":{"rendered":"<p>    Optimize the dbt Doc Function with a CI<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>How to set an automated check to improve your dbt documentation<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AipPU8SjbSbqth6gsejDBGg.jpeg?ssl=1\"><figcaption>Image by the author (generated with\u00a0chatgpt)<\/figcaption><\/figure>\n<p>In large dbt projects, maintaining consistent and up-to-date documentation can be a challenge. Although dbt\u2019s {{ doc() }} function allows you to store and reuse descriptions for the columns of your models, ensuring its usage remains quite a manual process and prone to mistakes, and therefore can easily lead to incomplete or outdated documentation. In this article, we\u2019ll explore how to automate the validation of your dbt documentation using a custom Continuous Integration (CI) workflow in GitHub\u00a0Actions.<\/p>\n<p>I already wrote about the doc feature in dbt and how it helps create consistent and accurate documentation across the entire dbt project (see <a href=\"https:\/\/medium.com\/@massimocapobianco\/setting-up-a-dbt-project-a-short-guide-on-best-practices-and-lesser-known-features-8acb8148ed37\">this<\/a>). In short, you can store the description of the most common\/important columns used in the data models in your project by adding them in the <em>docs.md<\/em> files, which live in the docs folder of your dbt\u00a0project.<\/p>\n<p>A very simple example of a <em>orders.md<\/em> file that contains the description for the most common customer-related column\u00a0names:<\/p>\n<pre># Fields description<br><br>## order_id<br>{% docs orders__order_id %}<br><br>Unique alphanumeric identifier of the order, used to join all order dimension tables<br><br>{% enddocs %}<br><br>## order_country<br>{% docs orders__order_country %}<br><br>Country where the order was placed. Format is country ISO 3166 code.<br><br>{% enddocs %}<br><br>## order_value<br>{% docs orders__value %}<br><br>Total value of the order in local currency. <br><br>{% enddocs %}<br><br>## order_date<br>{% docs orders__date %}<br><br>Date of the order in local timezone<br><br>{% enddocs %}<\/pre>\n<p>And its usage in the\u00a0<em>.yml<\/em> file of a\u00a0model:<\/p>\n<pre>    columns:<br>      - name: order_id<br>        description: '{{ doc(\"orders__order_id\") }}'<\/pre>\n<p>When the dbt docs are generated the description of order_id will be always the same, as long as the <em>doc<\/em> function is used in the yml file of the model. The benefit of having this centralized documentation is clear and undeniable.<\/p>\n<h4>The challenge<\/h4>\n<p>However, especially with large projects and frequent changes (new models, or changes to existing ones), it is likely that the repository\u2019s contributors will either forget to use the <em>doc<\/em> function, or they are not aware that a specific column has been added to the <em>docs<\/em> folder. This has two consequences:<\/p>\n<ul>\n<li>someone must catch this during PR review and request a change\u200a\u2014\u200aassuming there\u2019s at least one reviewer who either knows all the documented columns by heart or always checks the docs folder\u00a0manually<\/li>\n<li>if it\u2019s easy to go unnoticed and relies on individuals, this setup defeats the purpose of having a centralized documentation.<\/li>\n<\/ul>\n<h4>The solution<\/h4>\n<p>The simple answer to this problem is a CI (continuous integration) check, that combines a GitHub workflow with a python script. This check fails\u00a0if:<\/p>\n<ul>\n<li>the changes in the PR are affecting a\u00a0.yml file that contains a column name present in the docs, but the <em>doc<\/em> function is not used for that\u00a0column<\/li>\n<li>the changes in the PR are affecting a\u00a0.yml file that contains a column name present in the docs, but that column has no description at\u00a0all<\/li>\n<\/ul>\n<p>Let\u2019s have a closer look at the necessary code and files to run this check, and to a couple of examples. As previously mentioned, there are two things to consider: a <strong><em>(1)<\/em><\/strong>\u00a0.yml file for the workflow and a <strong><em>(2)<\/em><\/strong> python file for the actual validation check.<\/p>\n<p><strong><em>(1)<\/em><\/strong> This is how the <em>validation_docs<\/em> file looks like. It is placed in the <em>github\/workflows<\/em> folder.<\/p>\n<pre>name: Validate Documentation<br><br>on:<br>  pull_request:<br>    types: [opened, synchronize, reopened]<br><br>jobs:<br>  validate_docs:<br>    runs-on: ubuntu-latest<br>    steps:<br>      - name: Check out repository code<br>        uses: actions\/checkout@v3<br>        with:<br>          fetch-depth: 0<br><br>      - name: Install PyYAML<br>        run: pip install pyyaml<br><br>      - name: Run validation script<br>        run: python validate_docs.py<\/pre>\n<p>The workflow will run whenever a pull request is open or re-open, and every time that a new commit is pushed to the remote branch. Then there are basically 3 steps: retrieving the repository\u2019s files for the current pull request, install the dependencies, and run the validation script.<\/p>\n<p><strong><em>(2)<\/em><\/strong>. Then the <em>validate_docs.py <\/em>script, placed in the root folder of your dbt project repository, that looks like\u00a0this<\/p>\n<pre>import os<br>import sys<br>import yaml<br>import re<br>from glob import glob<br>from pathlib import Path<br>import subprocess<br><br>def get_changed_files():<br>    diff_command = ['git', 'diff', '--name-only', 'origin\/main...']<br>    result = subprocess.run(diff_command, capture_output=True, text=True)<br>    changed_files = result.stdout.strip().split('n')<br>    return changed_files<br><br>def extract_doc_names():<br>    doc_names = set()<br>    md_files = glob('docs\/**\/*.md', recursive=True)<br>    doc_pattern = re.compile(r'{%s*docss+([^s%]+)s*%}')<br><br>    for md_file in md_files:<br>        with open(md_file, 'r') as f:<br>            content = f.read()<br>            matches = doc_pattern.findall(content)<br>            doc_names.update(matches)<br>    <br>    print(f\"Extracted doc names: {doc_names}\")<br>    return doc_names<br><br><br>def parse_yaml_file(yaml_path):<br>    with open(yaml_path, 'r') as f:<br>        try:<br>            return list(yaml.safe_load_all(f))<br>        except yaml.YAMLError as exc:<br>            print(f\"Error parsing YAML file {yaml_path}: {exc}\")<br>            return []<br><br>def validate_columns(columns, doc_names, errors, model_name):<br>    for column in columns:<br>        column_name = column.get('name')<br>        description = column.get('description', '')<br><br>        print(f\"Validating column '{column_name}' in model '{model_name}'\")<br>        print(f\"Description: '{description}'\")<br><br>        doc_usage = re.findall(r'{{s*doc([\"']([^\"']+)[\"'])s*}}', description)<br>        print(f\"Doc usage found: {doc_usage}\")<br><br>        if doc_usage:<br>            for doc_name in doc_usage:<br>                if doc_name not in doc_names:<br>                    errors.append(<br>                        f\"Column '{column_name}' in model '{model_name}' references undefined doc '{doc_name}'.\"<br>                    )<br>        else:<br>            matching_docs = [dn for dn in doc_names if dn.endswith(f\"__{column_name}\")]<br>            if matching_docs:<br>                suggested_doc = matching_docs[0]<br>                errors.append(<br>                    f\"Column '{column_name}' in model '{model_name}' should use '{{{{ doc(\"{suggested_doc}\") }}}}' in its description.\"<br>                )<br>            else:<br>                print(f\"No matching doc found for column '{column_name}'\")<br><br>def main():<br>    changed_files = get_changed_files()<br>    yaml_files = [f for f in changed_files if f.endswith('.yml') or f.endswith('.yaml')]<br>    doc_names = extract_doc_names()<br>    errors = []<br><br>    for yaml_file in yaml_files:<br>        if not os.path.exists(yaml_file):<br>            continue  <br>        yaml_content = parse_yaml_file(yaml_file)<br>        for item in yaml_content:<br>            if not isinstance(item, dict):<br>                continue<br>            models = item.get('models') or item.get('sources')<br>            if not models:<br>                continue<br>            for model in models:<br>                model_name = model.get('name')<br>                columns = model.get('columns', [])<br>                validate_columns(columns, doc_names, errors, model_name)<br><br>    if errors:<br>        print(\"Documentation validation failed with the following errors:\")<br>        for error in errors:<br>            print(f\"- {error}\")<br>        sys.exit(1)<br>    else:<br>        print(\"All documentation validations passed.\")<br><br>if __name__ == \"__main__\":<br>    main()<\/pre>\n<p>Let\u2019s summarise the steps in the\u00a0script:<\/p>\n<ol>\n<li>it lists all files that have been changed in the pull request compared to the origin\u00a0branch.<\/li>\n<li>it looks through all markdown (<em>.md<\/em>) files within the <em>docs<\/em> folder (including subdirectories) and it searches for special documentation block patterns using a regex. Each time it finds such a pattern, it extracts the doc_name part and adds it to a set of doc\u00a0names.<\/li>\n<li>for each changed\u00a0.yml file, the script opens and parses it using yaml.safe_load_all. This converts the\u00a0.yml content into Python dictionaries (or lists) for easy analysis.<\/li>\n<li>\n<em>validate_columns: <\/em>for each columns defined in the\u00a0.yml files, it checks the description field to see if it includes a <em>{{ doc() }}<\/em> reference. If references are found, it verifies that the referenced doc name actually exists in the set of doc names extracted earlier. If not, it reports an error. If no doc references are found, it attempts to see if there is a doc block that matches this column\u2019s name. Note that here we are using a naming convention like <em>doc_block__column_name<\/em>. If such a block exists, it suggests that the column description should reference this doc.<br \/>Any problems (missing doc references, non-existent referenced docs) are recorded as\u00a0errors.<\/li>\n<\/ol>\n<h4>Examples<\/h4>\n<p>Now, let\u2019s have a look at the CI in action. Given the <em>orders.md<\/em> file shared at the beginning of the article, we now push to remote this commit that contains the <em>ref_country_orders.yml<\/em> file:<\/p>\n<pre>version: 2<br><br>models:<br>  - name: ref_country_orders<br>    description: &gt;<br>      This model filters orders from the staging orders table to include only those with an order date on or after January 1, 2020. <br>      It includes information such as the order ID, order country, order value, and order date.<br>    columns:<br>      - name: order_id<br>        description: '{{ doc(\"orders__order_id\") }}'<br>      - name: order_country<br>        description: The country where the order was placed.<br>      - name: order_value<br>        description: The value of the order.<br>      - name: order_address<br>        description: The address where the order was placed.<br>      - name: order_date<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ArqAwqjoIA5tGoctSD46JWA.png?ssl=1\"><\/figure>\n<p>The CI has failed. Clicking on the Details will take us to the log of the CI, where we see\u00a0this:<\/p>\n<pre>Validating column 'order_id' in model 'ref_country_orders'<br>Description: '{{ doc(\"orders__order_id\") }}'<br>Doc usage found: ['orders__order_id']<br>Validating column 'order_country' in model 'ref_country_orders'<br>Description: 'The country where the order was placed.'<br>Doc usage found: []<br>Validating column 'order_value' in model 'ref_country_orders'<br>Description: 'The value of the order.'<br>Doc usage found: []<br>Validating column 'order_address' in model 'ref_country_orders'<br>Description: 'The address where the order was placed.'<br>Doc usage found: []<br>No matching doc found for column 'order_address'<br>Validating column 'order_date' in model 'ref_country_orders'<br>Description: ''<br>Doc usage found: []<\/pre>\n<p>Let\u2019s analyze the log:<br \/>&#8211; for the <strong><em>order_id<\/em><\/strong> column it found the doc usage in its description. <br \/>&#8211; the <strong><em>order_address<\/em><\/strong> column isn\u2019t found in the docs file, so it returns a <em>No matching doc found for column \u2018order_address\u2019<br \/><\/em>&#8211; for the <strong><em>order_value<\/em><\/strong> and <strong><em>order_country<\/em><\/strong>, it knows that they are listed in the docs but the doc usage is empty. Same for the <strong><em>order_date<\/em><\/strong>, and note that for this one we didn\u2019t even add a description line<\/p>\n<p>All good so far. But let\u2019s keep looking at the\u00a0log:<\/p>\n<pre>Documentation validation failed with the following errors:<br>- Column 'order_country' in model 'ref_country_orders' should use '{{ doc(\"orders__order_country\") }}' in its description.<br>- Column 'order_value' in model 'ref_country_orders' should use '{{ doc(\"orders__order_value\") }}' in its description.<br>- Column 'order_date' in model 'ref_country_orders' should use '{{ doc(\"orders__order_date\") }}' in its description.<br>Error: Process completed with exit code 1.<\/pre>\n<p>Since <strong><em>order_country<\/em><\/strong>, <strong><em>order_value<\/em><\/strong>, <strong><em>order_date<\/em><\/strong> are in the docs file, but the doc function isn\u2019t used, the CI raise an error. And it suggests the actual value to add in the description, which makes it extremely easy for the PR author to copy-paste the correct description value from the CI log and add it into the\u00a0.<em>yml<\/em>\u00a0file.<\/p>\n<p>After pushing the new changes the CI check was succesfull and the log now looks like\u00a0this:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AvTxCA0Ymn17JjQPTzk0gJw.png?ssl=1\"><\/figure>\n<pre>Validating column 'order_id' in model 'ref_country_orders'<br>Description: '{{ doc(\"orders__order_id\") }}'<br>Doc usage found: ['orders__order_id']<br>Validating column 'order_country' in model 'ref_country_orders'<br>Description: '{{ doc(\"orders__order_country\") }}'<br>Doc usage found: ['orders__order_country']<br>Validating column 'order_value' in model 'ref_country_orders'<br>Description: '{{ doc(\"orders__order_value\") }}'<br>Doc usage found: ['orders__order_value']<br>Validating column 'order_address' in model 'ref_country_orders'<br>Description: 'The address where the order was placed.'<br>Doc usage found: []<br>No matching doc found for column 'order_address'<br>Validating column 'order_date' in model 'ref_country_orders'<br>Description: '{{ doc(\"orders__order_date\") }}'<br>Doc usage found: ['orders__order_date']<br>All documentation validations passed.<\/pre>\n<p>For the <strong><em>order_address<\/em><\/strong> column, the log shows that no matching doc was found. However, that\u2019s fine and does not cause the CI to fail, since adding that column to the docs file is not our intention for this demonstration. Meanwhile, the rest of the columns are listed in the docs and are correctly using the <em>{{ doc() }}<\/em>\u00a0function<\/p>\n<p>In conclusion, by integrating this validation CI into your dbt repository, you can confidently maintain a centralised source of truth for column descriptions across your entire project and make the best out of the {{ doc() }} feature in dbt. This setup can save valuable review time, reduces human error, and upholds a higher standard of documentation quality. As your project grows, you\u2019ll find that this automated approach to documentation management is both scalable and maintainable, ultimately enabling your team to focus on analytics rather than debugging inconsistent docs.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=5e2c3e8428e7\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/optimize-the-dbt-doc-function-with-a-ci-5e2c3e8428e7\">Optimize the dbt Doc Function with a CI<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Massimo Capobianco<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Foptimize-the-dbt-doc-function-with-a-ci-5e2c3e8428e7\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Optimize the dbt Doc Function with a CI How to set an automated check to improve your dbt documentation Image by the author (generated with\u00a0chatgpt) In large dbt projects, maintaining consistent and up-to-date documentation can be a challenge. Although dbt\u2019s {{ doc() }} function allows you to store and reuse descriptions for the columns of [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,992,1423,243,401,1066],"tags":[994,1425,1424],"class_list":["post-1378","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-analytics-engineering","category-continous-integration","category-data-analytics","category-data-engineering","category-dbt","tag-dbt","tag-doc","tag-order"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1378"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1378"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1378\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1378"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1378"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1378"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}