{"id":921,"date":"2025-01-02T07:02:54","date_gmt":"2025-01-02T07:02:54","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/02\/scaling-statistics-incremental-standard-deviation-in-sql-with-dbt-2eb0505aad2b\/"},"modified":"2025-01-02T07:02:54","modified_gmt":"2025-01-02T07:02:54","slug":"scaling-statistics-incremental-standard-deviation-in-sql-with-dbt-2eb0505aad2b","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/02\/scaling-statistics-incremental-standard-deviation-in-sql-with-dbt-2eb0505aad2b\/","title":{"rendered":"Scaling Statistics: Incremental Standard Deviation in SQL with dbt"},"content":{"rendered":"<p>    Scaling Statistics: Incremental Standard Deviation in SQL with dbt<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Why scan yesterday\u2019s data when you can increment today\u2019s?<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/873\/1%2ADA3LZ-s3h-CUjupuOlbtVw.png?ssl=1\"><figcaption>Image by the\u00a0author<\/figcaption><\/figure>\n<p>SQL aggregation functions can be computationally expensive when applied to large datasets. As datasets grow, recalculating metrics over the entire dataset repeatedly becomes inefficient. To address this challenge, <strong>incremental aggregation<\/strong> is often employed\u200a\u2014\u200aa method that involves maintaining a previous state and updating it with new incoming data. While this approach is straightforward for aggregations like COUNT or SUM, the question arises: how can it be applied to more complex metrics like standard deviation?<\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Standard_deviation\">Standard deviation<\/a> is a statistical metric that measures the extent of variation or dispersion in a variable\u2019s values relative to its mean. <br \/>It is derived by taking the square root of the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Variance\">variance<\/a>.<br \/>The formula for calculating the variance of a sample is as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/936\/1%2ADusZxO4bqEoqHnTKxUPkfA.png?ssl=1\"><figcaption>Sample variance\u00a0formula<\/figcaption><\/figure>\n<p>Calculating standard deviation can be complex, as it involves updating both the mean and the sum of squared differences across all data points. However, with algebraic manipulation, we can derive a formula for incremental computation\u200a\u2014\u200aenabling updates using an existing dataset and incorporating new data seamlessly. This approach avoids recalculating from scratch whenever new data is added, making the process much more efficient (A detailed derivation is available <a href=\"https:\/\/github.com\/YuvalGorch\/math_proofs\/blob\/main\/incremental_stddev_proof.pdf\">on my\u00a0GitHub<\/a>).<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AbKTSdpKAquhcwGDLxv2FTQ.png?ssl=1\"><figcaption>Derived sample variance\u00a0formula<\/figcaption><\/figure>\n<p>The formula was basically broken into 3 parts: <br \/>1. The existing\u2019s set weighted variance<br \/>2. The new set\u2019s weighted variance<br \/>3. The mean difference variance, accounting for between-group variance.<\/p>\n<p>This method enables incremental variance computation by retaining the COUNT (k), AVG (\u00b5k), and VAR (Sk) of the existing set, and combining them with the COUNT (n), AVG (\u00b5n), and VAR (Sn) of the new set. As a result, the updated standard deviation can be calculated efficiently without rescanning the entire\u00a0dataset.<\/p>\n<p>Now that we\u2019ve wrapped our heads around the math behind incremental standard deviation (or at least caught the gist of it), let\u2019s dive into the dbt SQL implementation. In the following example, we\u2019ll walk through how to set up an incremental model to calculate and update these statistics for a user\u2019s transaction data.<\/p>\n<p>Consider a transactions table named <strong>stg__transactions<\/strong>, which tracks user transactions (events). Our goal is to create a time-static table, <strong>int__user_tx_state<\/strong>, that aggregates the \u2018state\u2019 of user transactions. The column details for both tables are provided in the picture\u00a0below.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ApiKMzS88uHhzmO84pzPEqQ.jpeg?ssl=1\"><figcaption>Image by the\u00a0author<\/figcaption><\/figure>\n<p>To make the process efficient, we aim to update the state table incrementally by combining the new incoming transactions data with the existing aggregated data (i.e. the current user state). This approach allows us to calculate the updated user state without scanning through all historical data.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AGggy08MzUWgsQYQ6n9as1Q.jpeg?ssl=1\"><figcaption>Image by the\u00a0author<\/figcaption><\/figure>\n<blockquote><p>The code below assumes understanding of some dbt concepts, if you\u2019re unfamiliar with it, you may still be able to understand the code, although I strongly encourage going through <a href=\"https:\/\/docs.getdbt.com\/docs\/build\/incremental-models-overview\">dbt\u2019s incremental guide<\/a> or read <a href=\"https:\/\/towardsdatascience.com\/dbt-incremental-the-right-way-63f931263f4a\">this awesome\u00a0post<\/a>.<\/p><\/blockquote>\n<p>We\u2019ll construct a full dbt SQL step by step, aiming to calculate incremental aggregations efficiently without repeatedly scanning the entire table. The process begins by defining the model as incremental in dbt and using unique_key to update existing rows rather than inserting new\u00a0ones.<\/p>\n<pre>-- depends_on: {{ ref('stg__transactions') }}<br>{{ config(materialized='incremental', unique_key=['USER_ID'], incremental_strategy='merge') }}<\/pre>\n<p>Next, we fetch records from the <strong><em>stg__transactions <\/em><\/strong>table. <br \/>The is_incremental block filters transactions with timestamps later than the latest user update, effectively including &#8220;only new transactions&#8221;.<\/p>\n<pre>WITH NEW_USER_TX_DATA AS (<br>    SELECT<br>        USER_ID,<br>        TX_ID,<br>        TX_TIMESTAMP,<br>        TX_VALUE<br>    FROM {{ ref('stg__transactions') }}<br>    {% if is_incremental() %}<br>      WHERE TX_TIMESTAMP &gt; COALESCE((select max(UPDATED_AT) from {{ this }}), 0::TIMESTAMP_NTZ)<br>    {% endif %}<br>)<\/pre>\n<p>After retrieving the new transaction records, we aggregate them by user, allowing us to incrementally update each user\u2019s state in the following CTEs.<\/p>\n<pre>INCREMENTAL_USER_TX_DATA AS (<br>    SELECT<br>        USER_ID,<br>        MAX(TX_TIMESTAMP) AS UPDATED_AT,<br>        COUNT(TX_VALUE) AS INCREMENTAL_COUNT,<br>        AVG(TX_VALUE) AS INCREMENTAL_AVG,<br>        SUM(TX_VALUE) AS INCREMENTAL_SUM,<br>        COALESCE(STDDEV(TX_VALUE), 0) AS INCREMENTAL_STDDEV,<br>    FROM<br>        NEW_USER_TX_DATA<br>    GROUP BY<br>        USER_ID<br>)<\/pre>\n<p>Now we get to the heavy part where we need to actually calculate the aggregations. When we\u2019re not in incremental mode (i.e. we don\u2019t have any \u201cstate\u201d rows yet) we simply select the new aggregations<\/p>\n<pre>NEW_USER_CULMULATIVE_DATA AS (<br>    SELECT<br>        NEW_DATA.USER_ID,<br>        {% if not is_incremental() %}<br>            NEW_DATA.UPDATED_AT AS UPDATED_AT,<br>            NEW_DATA.INCREMENTAL_COUNT AS COUNT_TX,<br>            NEW_DATA.INCREMENTAL_AVG AS AVG_TX,<br>            NEW_DATA.INCREMENTAL_SUM AS SUM_TX,<br>            NEW_DATA.INCREMENTAL_STDDEV AS STDDEV_TX<br>        {% else %}<br>        ...<\/pre>\n<p>But when we\u2019re in incremental mode, we need to join past data and combine it with the new data we created in the INCREMENTAL_USER_TX_DATA CTE based on the formula described above.<br \/>We start by calculating the new SUM, COUNT and\u00a0AVG:<\/p>\n<pre>  ...<br>  {% else %}<br>      COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) AS _n, -- this is n<br>      NEW_DATA.INCREMENTAL_COUNT AS _k,  -- this is k<br>      COALESCE(EXISTING_USER_DATA.SUM_TX, 0) + NEW_DATA.INCREMENTAL_SUM AS NEW_SUM_TX,  -- new sum<br>      COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) + NEW_DATA.INCREMENTAL_COUNT AS NEW_COUNT_TX,  -- new count<br>      NEW_SUM_TX \/ NEW_COUNT_TX AS AVG_TX,  -- new avg<br>   ...<\/pre>\n<p>We then calculate the variance formula\u2019s three\u00a0parts<\/p>\n<p>1. The existing weighted variance, which is truncated to 0 if the previous set is composed of one or less\u00a0items:<\/p>\n<pre>    ...<br>      CASE<br>          WHEN _n &gt; 1 THEN (((_n - 1) \/ (NEW_COUNT_TX - 1)) * POWER(COALESCE(EXISTING_USER_DATA.STDDEV_TX, 0), 2))<br>          ELSE 0<br>      END AS EXISTING_WEIGHTED_VARIANCE,  -- existing weighted variance<br>    ...<\/pre>\n<p>2. The incremental weighted variance in the same\u00a0way:<\/p>\n<pre>    ...<br>      CASE<br>          WHEN _k &gt; 1 THEN (((_k - 1) \/ (NEW_COUNT_TX - 1)) * POWER(NEW_DATA.INCREMENTAL_STDDEV, 2))<br>          ELSE 0<br>      END AS INCREMENTAL_WEIGHTED_VARIANCE,  -- incremental weighted variance<br>    ...<\/pre>\n<p>3. The mean difference variance, as outlined earlier, along with SQL join terms to include past\u00a0data.<\/p>\n<pre>    ...<br>      POWER((COALESCE(EXISTING_USER_DATA.AVG_TX, 0) - NEW_DATA.INCREMENTAL_AVG), 2) AS MEAN_DIFF_SQUARED,<br>      CASE<br>          WHEN NEW_COUNT_TX = 1 THEN 0<br>          ELSE (_n * _k) \/ (NEW_COUNT_TX * (NEW_COUNT_TX - 1))<br>      END AS BETWEEN_GROUP_WEIGHT,  -- between group weight<br>      BETWEEN_GROUP_WEIGHT * MEAN_DIFF_SQUARED AS MEAN_DIFF_VARIANCE,  -- mean diff variance<br>      EXISTING_WEIGHTED_VARIANCE + INCREMENTAL_WEIGHTED_VARIANCE + MEAN_DIFF_VARIANCE AS VARIANCE_TX,<br>      CASE<br>          WHEN _n = 0 THEN NEW_DATA.INCREMENTAL_STDDEV -- no \"past\" data<br>          WHEN _k = 0 THEN EXISTING_USER_DATA.STDDEV_TX -- no \"new\" data<br>          ELSE SQRT(VARIANCE_TX)  -- stddev (which is the root of variance)<br>      END AS STDDEV_TX,<br>      NEW_DATA.UPDATED_AT AS UPDATED_AT,<br>      NEW_SUM_TX AS SUM_TX,<br>      NEW_COUNT_TX AS COUNT_TX<br>  {% endif %}<br>    FROM<br>        INCREMENTAL_USER_TX_DATA new_data<br>    {% if is_incremental() %}<br>    LEFT JOIN<br>        {{ this }} EXISTING_USER_DATA<br>    ON<br>        NEW_DATA.USER_ID = EXISTING_USER_DATA.USER_ID<br>    {% endif %}<br>)<\/pre>\n<p>Finally, we select the table\u2019s columns, accounting for both incremental and non-incremental cases:<\/p>\n<pre>SELECT<br>    USER_ID,<br>    UPDATED_AT,<br>    COUNT_TX,<br>    SUM_TX,<br>    AVG_TX,<br>    STDDEV_TX<br>FROM NEW_USER_CULMULATIVE_DATA<\/pre>\n<p>By combining all these steps, we arrive at the final SQL\u00a0model:<\/p>\n<pre>-- depends_on: {{ ref('stg__initial_table') }}<br>{{ config(materialized='incremental', unique_key=['USER_ID'], incremental_strategy='merge') }}<br>WITH NEW_USER_TX_DATA AS (<br>    SELECT<br>        USER_ID,<br>        TX_ID,<br>        TX_TIMESTAMP,<br>        TX_VALUE<br>    FROM {{ ref('stg__initial_table') }}<br>    {% if is_incremental() %}<br>      WHERE TX_TIMESTAMP &gt; COALESCE((select max(UPDATED_AT) from {{ this }}), 0::TIMESTAMP_NTZ)<br>    {% endif %}<br>),<br>INCREMENTAL_USER_TX_DATA AS (<br>    SELECT<br>        USER_ID,<br>        MAX(TX_TIMESTAMP) AS UPDATED_AT,<br>        COUNT(TX_VALUE) AS INCREMENTAL_COUNT,<br>        AVG(TX_VALUE) AS INCREMENTAL_AVG,<br>        SUM(TX_VALUE) AS INCREMENTAL_SUM,<br>        COALESCE(STDDEV(TX_VALUE), 0) AS INCREMENTAL_STDDEV,<br>    FROM<br>        NEW_USER_TX_DATA<br>    GROUP BY<br>        USER_ID<br>),<br><br>NEW_USER_CULMULATIVE_DATA AS (<br>    SELECT<br>        NEW_DATA.USER_ID,<br>        {% if not is_incremental() %}<br>            NEW_DATA.UPDATED_AT AS UPDATED_AT,<br>            NEW_DATA.INCREMENTAL_COUNT AS COUNT_TX,<br>            NEW_DATA.INCREMENTAL_AVG AS AVG_TX,<br>            NEW_DATA.INCREMENTAL_SUM AS SUM_TX,<br>            NEW_DATA.INCREMENTAL_STDDEV AS STDDEV_TX<br>        {% else %}<br>            COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) AS _n, -- this is n<br>            NEW_DATA.INCREMENTAL_COUNT AS _k,  -- this is k<br>            COALESCE(EXISTING_USER_DATA.SUM_TX, 0) + NEW_DATA.INCREMENTAL_SUM AS NEW_SUM_TX,  -- new sum<br>            COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) + NEW_DATA.INCREMENTAL_COUNT AS NEW_COUNT_TX,  -- new count<br>            NEW_SUM_TX \/ NEW_COUNT_TX AS AVG_TX,  -- new avg<br>            CASE<br>                WHEN _n &gt; 1 THEN (((_n - 1) \/ (NEW_COUNT_TX - 1)) * POWER(COALESCE(EXISTING_USER_DATA.STDDEV_TX, 0), 2))<br>                ELSE 0<br>            END AS EXISTING_WEIGHTED_VARIANCE,  -- existing weighted variance<br>            CASE<br>                WHEN _k &gt; 1 THEN (((_k - 1) \/ (NEW_COUNT_TX - 1)) * POWER(NEW_DATA.INCREMENTAL_STDDEV, 2))<br>                ELSE 0<br>            END AS INCREMENTAL_WEIGHTED_VARIANCE,  -- incremental weighted variance<br>            POWER((COALESCE(EXISTING_USER_DATA.AVG_TX, 0) - NEW_DATA.INCREMENTAL_AVG), 2) AS MEAN_DIFF_SQUARED,<br>            CASE<br>                WHEN NEW_COUNT_TX = 1 THEN 0<br>                ELSE (_n * _k) \/ (NEW_COUNT_TX * (NEW_COUNT_TX - 1))<br>            END AS BETWEEN_GROUP_WEIGHT,  -- between group weight<br>            BETWEEN_GROUP_WEIGHT * MEAN_DIFF_SQUARED AS MEAN_DIFF_VARIANCE,<br>            EXISTING_WEIGHTED_VARIANCE + INCREMENTAL_WEIGHTED_VARIANCE + MEAN_DIFF_VARIANCE AS VARIANCE_TX,<br>            CASE<br>                WHEN _n = 0 THEN NEW_DATA.INCREMENTAL_STDDEV -- no \"past\" data<br>                WHEN _k = 0 THEN EXISTING_USER_DATA.STDDEV_TX -- no \"new\" data<br>                ELSE SQRT(VARIANCE_TX)  -- stddev (which is the root of variance)<br>            END AS STDDEV_TX,<br>            NEW_DATA.UPDATED_AT AS UPDATED_AT,<br>            NEW_SUM_TX AS SUM_TX,<br>            NEW_COUNT_TX AS COUNT_TX<br>        {% endif %}<br>    FROM<br>        INCREMENTAL_USER_TX_DATA new_data<br>    {% if is_incremental() %}<br>    LEFT JOIN<br>        {{ this }} EXISTING_USER_DATA<br>    ON<br>        NEW_DATA.USER_ID = EXISTING_USER_DATA.USER_ID<br>    {% endif %}<br>)<br><br>SELECT<br>    USER_ID,<br>    UPDATED_AT,<br>    COUNT_TX,<br>    SUM_TX,<br>    AVG_TX,<br>    STDDEV_TX<br>FROM NEW_USER_CULMULATIVE_DATA<\/pre>\n<p>Throughout this process, we demonstrated how to handle both non-incremental and incremental modes effectively, leveraging mathematical techniques to update metrics like variance and standard deviation efficiently. By combining historical and new data seamlessly, we achieved an optimized, scalable approach for real-time data aggregation.<\/p>\n<p>In this article, we explored the mathematical technique for incrementally calculating standard deviation and how to implement it using dbt\u2019s incremental models. This approach proves to be highly efficient, enabling the processing of large datasets without the need to re-scan the entire dataset. In practice, this leads to faster, more scalable systems that can handle real-time updates efficiently. If you\u2019d like to discuss this further or share your thoughts, feel free to reach out\u200a\u2014\u200aI\u2019d love to hear your thoughts!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=2eb0505aad2b\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/scaling-statistics-incremental-standard-deviation-in-sql-with-dbt-2eb0505aad2b\">Scaling Statistics: Incremental Standard Deviation in SQL with dbt<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Yuval Gorchover<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fscaling-statistics-incremental-standard-deviation-in-sql-with-dbt-2eb0505aad2b\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scaling Statistics: Incremental Standard Deviation in SQL with dbt Why scan yesterday\u2019s data when you can increment today\u2019s? Image by the\u00a0author SQL aggregation functions can be computationally expensive when applied to large datasets. As datasets grow, recalculating metrics over the entire dataset repeatedly becomes inefficient. To address this challenge, incremental aggregation is often employed\u200a\u2014\u200aa method [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1067,83,1066,49,404],"tags":[1070,1068,1069],"class_list":["post-921","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-big-data","category-data-science","category-dbt","category-engineering","category-sql","tag-deviation","tag-incremental","tag-standard"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/921"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=921"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/921\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=921"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=921"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=921"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}