{"id":2332,"date":"2025-03-11T07:02:55","date_gmt":"2025-03-11T07:02:55","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/03\/11\/linear-regression-in-time-series-sources-of-spurious-regression\/"},"modified":"2025-03-11T07:02:55","modified_gmt":"2025-03-11T07:02:55","slug":"linear-regression-in-time-series-sources-of-spurious-regression","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/03\/11\/linear-regression-in-time-series-sources-of-spurious-regression\/","title":{"rendered":"Linear Regression in Time Series: Sources of Spurious Regression"},"content":{"rendered":"<p>    Linear Regression in Time Series: Sources of Spurious Regression<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\" id=\"8b9b\">1. Introduction<a href=\"https:\/\/medium.com\/@jumbongjunior1999?source=post_page---byline--427d43459d7d---------------------------------------\"><\/a><br \/>\n<\/h2>\n<p class=\"wp-block-paragraph\" id=\"0a79\">It\u2019s pretty clear that most of our work will be automated by AI in the future. This will be possible because many researchers and professionals are working hard to make their work available online. These contributions not only help us understand fundamental concepts but also refine AI models, ultimately freeing up time to focus on other activities.<\/p>\n<p class=\"wp-block-paragraph\" id=\"118a\">However, there is one concept that remains misunderstood, even among experts. It is\u00a0<strong>spurious regression<\/strong>\u00a0in time series analysis. This issue arises when regression models suggest strong relationships between variables, even when none exist. It is typically observed in time series regression equations that\u00a0<strong>seem to have a high degree of fit<\/strong>\u00a0\u2014 as indicated by a high\u00a0<strong>R\u00b2 (coefficient of multiple correlation)<\/strong>\u00a0\u2014 but with an\u00a0<strong>extremely low Durbin-Watson statistic (d), signaling strong autocorrelation in the error terms<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ff2b\">What is particularly surprising is that almost all econometric textbooks warn about the danger of autocorrelated errors, yet this issue persists in many published papers. Granger and Newbold (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-granger1974spurious\" rel=\"noreferrer noopener\" target=\"_blank\">1974<\/a>) identified several examples. For instance, they found published equations with\u00a0<strong>R\u00b2 = 0.997<\/strong>\u00a0and the Durbin-Watson statistic (d) equal to 0.53. The most extreme found is an equation with\u00a0<strong>R\u00b2 = 0.999<\/strong>\u00a0and\u00a0<strong>d = 0.093<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"639b\">It is especially problematic in economics and finance, where\u00a0<strong>many key variables exhibit autocorrelation\u00a0<\/strong>or<strong>\u00a0serial correlation between adjacent values<\/strong>, particularly if the sampling interval is small, such as a week or a month, leading to misleading conclusions if not handled correctly. For example, today\u2019s GDP is strongly correlated with the GDP of the previous quarter. Our post provides a detailed explanation of the results from Granger and Newbold (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-granger1974spurious\" rel=\"noreferrer noopener\" target=\"_blank\">1974<\/a>) and Python simulation (<strong>see section 7<\/strong>) replicating the key results presented in their article.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7fc2\">Whether you\u2019re an economist, data scientist, or analyst working with time series data, understanding this issue is crucial to ensuring\u00a0<strong>your models produce meaningful results.<\/strong><\/p>\n<p class=\"wp-block-paragraph\" id=\"3c73\">To walk you through this paper, the next section will introduce the random walk and the ARIMA(0,1,1) process. In section 3, we will explain how Granger and Newbold (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-granger1974spurious\" rel=\"noreferrer noopener\" target=\"_blank\">1974<\/a>) describe the emergence of nonsense regressions, with examples illustrated in section 4. Finally, we\u2019ll show how to avoid spurious regressions when working with time series data.<\/p>\n<h2 class=\"wp-block-heading\" id=\"95b6\">2. Simple presentation of a Random Walk and ARIMA(0,1,1) Process<\/h2>\n<h3 class=\"wp-block-heading\" id=\"5dda\">2.1 Random Walk<\/h3>\n<p class=\"wp-block-paragraph\" id=\"ad99\">Let \ud835\udc17\u209c be a time series. We say that \ud835\udc17\u209c follows a random walk if its representation is given by:<\/p>\n<p class=\"wp-block-paragraph\" id=\"d3ed\">\ud835\udc17\u209c = \ud835\udc17\u209c\u208b\u2081 + \ud835\udf16\u209c. (1)<\/p>\n<p class=\"wp-block-paragraph\" id=\"bb84\">Where \ud835\udf16\u209c is a white noise. It can be written as a sum of white noise, a useful form for simulation. It is a non-stationary time series because its variance depends on the time t.<\/p>\n<p class=\"wp-block-paragraph\" id=\"e1bf\">2.2\u00a0<strong>ARIMA(0,1,1) Process<\/strong><\/p>\n<p class=\"wp-block-paragraph\" id=\"fe62\">The ARIMA(0,1,1) process is given by:<\/p>\n<p class=\"wp-block-paragraph\" id=\"58c3\">\ud835\udc17\u209c = \ud835\udc17\u209c\u208b\u2081 + \ud835\udf16\u209c \u2212 \ud835\udf03 \ud835\udf16\u209c\u208b\u2081. (2)<\/p>\n<p class=\"wp-block-paragraph\" id=\"f761\">where \ud835\udf16\u209c is a white noise. The ARIMA(0,1,1) process is non-stationary. It can be written as a sum of an independent random walk and white noise:<\/p>\n<p class=\"wp-block-paragraph\" id=\"8098\">\ud835\udc17\u209c = \ud835\udc17\u2080\u00a0<strong>+ random walk + white noise. (3)<\/strong>\u00a0This form is useful for simulation.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ffe0\">Those non-stationary series are often employed as benchmarks against which the forecasting performance of other models is judged.<\/p>\n<h2 class=\"wp-block-heading\" id=\"087e\">3. Random walk can lead to Nonsense Regression<\/h2>\n<p class=\"wp-block-paragraph\" id=\"548b\">First, let\u2019s recall the <a href=\"https:\/\/towardsdatascience.com\/tag\/linear-regression\/\" title=\"Linear Regression\">Linear Regression<\/a> model. The linear regression model is given by:<\/p>\n<p class=\"wp-block-paragraph\" id=\"4ef6\">\ud835\udc18 = \ud835\udc17\ud835\udefd + \ud835\udf16. (4)<\/p>\n<p class=\"wp-block-paragraph\" id=\"aaa6\">Where \ud835\udc18 is a T \u00d7 1 vector of the dependent variable, \ud835\udefd is a K \u00d7 1 vector of the coefficients, \ud835\udc17 is a T \u00d7 K matrix of the independent variables containing a column of ones and (K\u22121) columns with T observations on each of the (K\u22121) independent variables, which are stochastic but distributed independently of the T \u00d7 1 vector of the errors \ud835\udf16. It is generally assumed that:<\/p>\n<p class=\"wp-block-paragraph\" id=\"2fb2\">\ud835\udc04(\ud835\udf16) = 0, (5)<\/p>\n<p class=\"wp-block-paragraph\" id=\"de28\">and<\/p>\n<p class=\"wp-block-paragraph\" id=\"3af1\">\ud835\udc04(\ud835\udf16\ud835\udf16\u2032) = \ud835\udf0e\u00b2\ud835\udc08. (6)<\/p>\n<p class=\"wp-block-paragraph\" id=\"0579\">where \ud835\udc08 is the identity matrix.<\/p>\n<p class=\"wp-block-paragraph\" id=\"848f\">A test of the contribution of independent variables to the explanation of the dependent variable is the F-test. The null hypothesis of the test is given by:<\/p>\n<p class=\"wp-block-paragraph\" id=\"c07b\">\ud835\udc07\u2080: \ud835\udefd\u2081 = \ud835\udefd\u2082 = \u22ef = \ud835\udefd\u2096\u208b\u2081 = 0, (7)<\/p>\n<p class=\"wp-block-paragraph\" id=\"917c\">And the statistic of the test is given by:<\/p>\n<p class=\"wp-block-paragraph\" id=\"e73b\">\ud835\udc05 = (\ud835\udc11\u00b2 \/ (\ud835\udc0a\u22121)) \/ ((1\u2212\ud835\udc11\u00b2) \/ (\ud835\udc13\u2212\ud835\udc0a)). (8)<\/p>\n<p class=\"wp-block-paragraph\" id=\"12a7\">where \ud835\udc11\u00b2 is the coefficient of determination.<\/p>\n<p class=\"wp-block-paragraph\" id=\"76b1\">If we want to construct the statistic of the test, let\u2019s assume that the null hypothesis is true, and one tries to fit a regression of the form (Equation 4) to the levels of an economic time series. Suppose next that these series are not stationary or are highly autocorrelated. In such a situation, the test procedure is invalid since \ud835\udc05 in (Equation 8) is not distributed as an F-distribution under the null hypothesis (Equation 7). In fact, under the null hypothesis, the errors or residuals from (Equation 4) are given by:<\/p>\n<p class=\"wp-block-paragraph\" id=\"c22f\">\ud835\udf16\u209c = \ud835\udc18\u209c \u2212 \ud835\udc17\ud835\udefd\u2080 ; t = 1, 2, \u2026, T. (9)<\/p>\n<p class=\"wp-block-paragraph\" id=\"64b2\">And will have the same autocorrelation structure as the original series \ud835\udc18.<\/p>\n<p class=\"wp-block-paragraph\" id=\"1527\">Some idea of the distribution problem can arise in the situation when:<\/p>\n<p class=\"wp-block-paragraph\" id=\"a1da\">\ud835\udc18\u209c = \ud835\udefd\u2080 + \ud835\udc17\u209c\ud835\udefd\u2081 + \ud835\udf16\u209c. (10)<\/p>\n<p class=\"wp-block-paragraph\" id=\"1ac0\">Where \ud835\udc18\u209c and \ud835\udc17\u209c follow independent first-order autoregressive processes:<\/p>\n<p class=\"wp-block-paragraph\" id=\"4058\">\ud835\udc18\u209c = \ud835\udf0c \ud835\udc18\u209c\u208b\u2081 + \ud835\udf02\u209c, and \ud835\udc17\u209c = \ud835\udf0c* \ud835\udc17\u209c\u208b\u2081 + \ud835\udf08\u209c. (11)<\/p>\n<p class=\"wp-block-paragraph\" id=\"6c35\">Where \ud835\udf02\u209c and \ud835\udf08\u209c are white noise.<\/p>\n<p class=\"wp-block-paragraph\" id=\"34c8\">We know that in this case, \ud835\udc11\u00b2 is the square of the correlation between \ud835\udc18\u209c and \ud835\udc17\u209c. They use Kendall\u2019s result from the article Knowles (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-knowles1954exercises\" rel=\"noreferrer noopener\" target=\"_blank\">1954<\/a>), which expresses the variance of \ud835\udc11:<\/p>\n<p class=\"wp-block-paragraph\" id=\"0b2d\">\ud835\udc15\ud835\udc1a\ud835\udc2b(\ud835\udc11) = (1\/T)* (1 + \ud835\udf0c\ud835\udf0c*) \/ (1 \u2212 \ud835\udf0c\ud835\udf0c*). (12)<\/p>\n<p class=\"wp-block-paragraph\" id=\"883d\">Since \ud835\udc11 is constrained to lie between -1 and 1, if its variance is greater than 1\/3, the distribution of \ud835\udc11 cannot have a mode at 0. This implies that \ud835\udf0c\ud835\udf0c* &gt; (T\u22121) \/ (T+1).<\/p>\n<p class=\"wp-block-paragraph\" id=\"b1b3\">Thus, for example, if T = 20 and \ud835\udf0c = \ud835\udf0c*, a distribution that is not unimodal at 0 will be obtained if \ud835\udf0c &gt; 0.86, and if \ud835\udf0c = 0.9, \ud835\udc15\ud835\udc1a\ud835\udc2b(\ud835\udc11) = 0.47. So the \ud835\udc04(\ud835\udc11\u00b2) will be close to 0.47.<\/p>\n<p class=\"wp-block-paragraph\" id=\"5ea2\">It has been shown that when \ud835\udf0c is close to 1, \ud835\udc11\u00b2 can be\u00a0<strong>very high<\/strong>, suggesting a strong relationship between \ud835\udc18\u209c and \ud835\udc17\u209c. However, in reality, the two series are completely independent. When \ud835\udf0c is near 1, both series behave like random walks or near-random walks. On top of that, both series are\u00a0<strong>highly autocorrelated<\/strong>, which causes the\u00a0<strong>residuals from the regression<\/strong>\u00a0to also be strongly autocorrelated. As a result, the\u00a0<strong>Durbin-Watson statistic<\/strong>\u00a0\ud835\udc1d will be\u00a0<strong>very low<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"6008\">This is why a high \ud835\udc11\u00b2 in this context should never be taken as evidence of a true relationship between the two series.<\/p>\n<p class=\"wp-block-paragraph\" id=\"99d0\">To explore the possibility of obtaining a spurious regression when regressing two independent random walks, a series of simulations proposed by Granger and Newbold (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-granger1974spurious\" rel=\"noreferrer noopener\" target=\"_blank\">1974<\/a>) will be conducted in the next section.<\/p>\n<h2 class=\"wp-block-heading\" id=\"daca\">4. Simulation results using Python.<\/h2>\n<p class=\"wp-block-paragraph\" id=\"b82c\">In this section, we will show using simulations that using the regression model with independent random walks bias the estimation of the coefficients and the hypothesis tests of the coefficients are invalid. The <a href=\"https:\/\/towardsdatascience.com\/tag\/python\/\" title=\"Python\">Python<\/a> code that will produce the results of the simulation will be presented in section 6.<\/p>\n<p class=\"wp-block-paragraph\" id=\"756c\">A regression equation proposed by Granger and Newbold (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-granger1974spurious\" rel=\"noreferrer noopener\" target=\"_blank\">1974<\/a>) is given by:<\/p>\n<p class=\"wp-block-paragraph\" id=\"ff0f\">\ud835\udc18\u209c = \ud835\udefd\u2080 + \ud835\udc17\u209c\ud835\udefd\u2081 + \ud835\udf16\u209c<\/p>\n<p class=\"wp-block-paragraph\" id=\"a244\">Where \ud835\udc18\u209c and \ud835\udc17\u209c were generated as independent random walks, each of length 50. The values \ud835\udc12 = |\ud835\udefd\u0302\u2081| \/ \u221a(\ud835\udc12\ud835\udc04\u0302(\ud835\udefd\u0302\u2081)), representing the statistic for testing the significance of \ud835\udefd\u2081, for 100 simulations will be reported in the table below.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e3e3e3\" data-has-transparency=\"false\" style=\"--dominant-color: #e3e3e3;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"129\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-1-1024x129.png?resize=1024%2C129&#038;ssl=1\" alt=\"\" class=\"wp-image-599428 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-1-1024x129.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-1-300x38.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-1-768x97.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-1.png 1202w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><\/figure>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/miro.medium.com\/v2\/resize%3Afit%3A1400\/1%2AQvcYCvJoeXIV5WCgicZ0SA.png?ssl=1\" alt=\"\"><figcaption class=\"wp-element-caption\"><strong>Table 1: Regressing two independent random walks<\/strong><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"29f2\">The null hypothesis of no relationship between \ud835\udc18\u209c and \ud835\udc17\u209c is rejected at the 5% level if \ud835\udc12 &gt; 2. This table shows that the null hypothesis (\ud835\udefd = 0) is wrongly rejected in about a quarter (71 times) of all cases. This is awkward because the two variables are independent random walks, meaning there\u2019s no actual relationship. Let\u2019s break down why this happens.<\/p>\n<p class=\"wp-block-paragraph\" id=\"b6a8\">If \ud835\udefd\u0302\u2081 \/ \ud835\udc12\ud835\udc04\u0302 follows a \ud835\udc0d(0,1), the expected value of \ud835\udc12, its absolute value, should be \u221a2 \/ \u03c0 \u2248 0.8 (\u221a2\/\u03c0 is the mean of the absolute value of a standard normal distribution). However, the simulation results show an average of 4.59, meaning the estimated \ud835\udc12 is underestimated by a factor of:<\/p>\n<p class=\"wp-block-paragraph\" id=\"d30c\">4.59 \/ 0.8 = 5.7<\/p>\n<p class=\"wp-block-paragraph\" id=\"c5d8\">In classical statistics, we usually use a t-test threshold of around 2 to check the significance of a coefficient. However, these results show that, in this case, you would need to use a threshold of 11.4 to properly test for significance:<\/p>\n<p class=\"wp-block-paragraph\" id=\"6ed7\">2 \u00d7 (4.59 \/ 0.8) = 11.4<\/p>\n<p class=\"wp-block-paragraph\" id=\"3eb6\">Interpretation: We\u2019ve just shown that including variables that don\u2019t belong in the model \u2014 especially random walks \u2014 can lead to completely invalid significance tests for the coefficients.<\/p>\n<p class=\"wp-block-paragraph\" id=\"cc94\">To make their simulations even clearer, Granger and Newbold (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-granger1974spurious\" rel=\"noreferrer noopener\" target=\"_blank\">1974<\/a>) ran a series of regressions using variables that follow either a random walk or an ARIMA(0,1,1) process.<\/p>\n<p class=\"wp-block-paragraph\" id=\"edcb\">Here is how they set up their simulations:<\/p>\n<p class=\"wp-block-paragraph\" id=\"7f04\">They regressed a dependent series \ud835\udc18\u209c on m series \ud835\udc17\u2c7c,\u209c (with j = 1, 2, \u2026, m), varying m from 1 to 5. The dependent series \ud835\udc18\u209c and the independent series \ud835\udc17\u2c7c,\u209c follow the same types of processes, and they tested four cases:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Case 1 (Levels)<\/strong>: \ud835\udc18\u209c and \ud835\udc17\u2c7c,\u209c follow random walks.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Case 2 (Differences)<\/strong>: They use the first differences of the random walks, which are stationary.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Case 3 (Levels)<\/strong>: \ud835\udc18\u209c and \ud835\udc17\u2c7c,\u209c follow ARIMA(0,1,1).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Case 4 (Differences)<\/strong>: They use the first differences of the previous ARIMA(0,1,1) processes, which are stationary.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"497d\">Each series has a length of 50 observations, and they ran 100 simulations for each case.<\/p>\n<p class=\"wp-block-paragraph\" id=\"38c3\">All error terms are distributed as \ud835\udc0d(0,1), and the ARIMA(0,1,1) series are derived as the sum of the random walk and independent white noise. The simulation results, based on 100 replications with series of length 50, are summarized in the next table.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f2f2f2\" data-has-transparency=\"false\" style=\"--dominant-color: #f2f2f2;\" loading=\"lazy\" decoding=\"async\" width=\"998\" height=\"788\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-2.png?resize=998%2C788&#038;ssl=1\" alt=\"\" class=\"wp-image-599429 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-2.png 998w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-2-300x237.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-2-768x606.png 768w\" sizes=\"auto, (max-width: 998px) 100vw, 998px\"><\/figure>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/miro.medium.com\/v2\/resize%3Afit%3A1400\/1%2AUfWmRvXLbEb3KQnfpoLpOw.png?ssl=1\" alt=\"\"><figcaption class=\"wp-element-caption\"><strong>Table 2: Regressions of a series on m independent \u2018explanatory\u2019 series.<\/strong><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"6b47\">Interpretation of the results :<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">It is seen that the probability of not rejecting the null hypothesis of no relationship between \ud835\udc18\u209c and \ud835\udc17\u2c7c,\u209c becomes very small when m \u2265 3 when regressions are made with random walk series (rw-levels). The \ud835\udc11\u00b2 and the mean Durbin-Watson increase. Similar results are obtained when the regressions are made with ARIMA(0,1,1) series (arima-levels).<\/li>\n<li class=\"wp-block-list-item\">When white noise series (rw-diffs) are used, classical regression analysis is valid since the error series will be white noise and least squares will be efficient.<\/li>\n<li class=\"wp-block-list-item\">However, when the regressions are made with the differences of ARIMA(0,1,1) series (arima-diffs) or first-order moving average series MA(1) process, the null hypothesis is rejected, on average:<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"de6c\">(10 + 16 + 5 + 6 + 6) \/ 5 = 8.6<\/p>\n<p class=\"wp-block-paragraph\" id=\"8c1e\">which is greater than 5% of the time.<\/p>\n<p class=\"wp-block-paragraph\" id=\"0db8\">If your variables are random walks or close to them, and you include unnecessary variables in your regression, you will often get fallacious results. High \ud835\udc11\u00b2 and low Durbin-Watson values do not confirm a true relationship but instead indicate a likely spurious one.<\/p>\n<h2 class=\"wp-block-heading\" id=\"04cd\">5.\u00a0How to avoid spurious regression in time series<\/h2>\n<p class=\"wp-block-paragraph\" id=\"3a95\">It\u2019s really hard to come up with a complete list of ways to avoid spurious regressions. However, there are a few good practices you can follow to\u00a0<strong>minimize the risk as much as possible<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a6ec\">If one performs a regression analysis with time series data and finds that the residuals are strongly autocorrelated, there is a serious problem when it comes to interpreting the coefficients of the equation. To check for autocorrelation in the residuals, one can use the Durbin-Watson test or the Portmanteau test.<\/p>\n<p class=\"wp-block-paragraph\" id=\"646c\">Based on the study above, we can conclude that if a regression analysis performed with economical variables produces strongly autocorrelated residuals, meaning a low Durbin-Watson statistic, then the results of the analysis are likely to be spurious, whatever the value of the coefficient of determination R\u00b2 observed.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8c10\">In such cases, it is important to understand where the mis-specification comes from. According to the literature, misspecification usually falls into three categories : (i) the omission of a relevant variable, (ii) the inclusion of an irrelevant variable, or (iii) autocorrelation of the errors. Most of the time, mis-specification comes from a mix of these three sources.<\/p>\n<p class=\"wp-block-paragraph\" id=\"53b5\">To avoid spurious regression in a time series, several recommendations can be made:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The first recommendation is to select the right macroeconomic variables that are likely to explain the dependent variable. This can be done by reviewing the literature or consulting experts in the field.<\/li>\n<li class=\"wp-block-list-item\">The second recommendation is to stationarize the series by taking first differences. In most cases, the first differences of macroeconomic variables are stationary and still easy to interpret. For macroeconomic data, it\u2019s strongly recommended to differentiate the series once to reduce the autocorrelation of the residuals, especially when the sample size is small. There is indeed sometimes strong serial correlation observed in these variables. A simple calculation shows that the first differences will almost always have much smaller serial correlations than the original series.<\/li>\n<li class=\"wp-block-list-item\">The third recommendation is to use the Box-Jenkins methodology to model each macroeconomic variable individually and then search for relationships between the series by relating the residuals from each individual model. The idea here is that the Box-Jenkins process extracts the explained part of the series, leaving the residuals, which contain only what can\u2019t be explained by the series\u2019 own past behavior. This makes it easier to check whether these unexplained parts (residuals) are related across variables.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"4b34\">6. Conclusion<\/h2>\n<p class=\"wp-block-paragraph\" id=\"7a3c\">Many econometrics textbooks warn about specification errors in regression models, but the problem still shows up in many published papers. Granger and Newbold (<a href=\"https:\/\/jumbong.github.io\/personal-website\/Others\/spurious_reg.html#ref-granger1974spurious\" rel=\"noreferrer noopener\" target=\"_blank\">1974<\/a>) highlighted the risk of spurious regressions, where you get a high paired with very low Durbin-Watson statistics.<\/p>\n<p class=\"wp-block-paragraph\" id=\"58d6\">Using Python simulations, we showed some of the main causes of these spurious regressions, especially including variables that don\u2019t belong in the model and are highly autocorrelated. We also demonstrated how these issues can completely distort hypothesis tests on the coefficients.<\/p>\n<p class=\"wp-block-paragraph\" id=\"8ece\">Hopefully, this post will help reduce the risk of spurious regressions in future econometric analyses.<\/p>\n<h2 class=\"wp-block-heading\" id=\"aa70\">7. Appendice: Python code for simulation.<\/h2>\n<p class=\"wp-block-paragraph\" id=\"9652\">#####################################################Simulation Code for table 1 #####################################################<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nimport matplotlib.pyplot as plt\n\nnp.random.seed(123)\nM = 100 \nn = 50\nS = np.zeros(M)\nfor i in range(M):\n#---------------------------------------------------------------\n# Generate the data\n#---------------------------------------------------------------\n    espilon_y = np.random.normal(0, 1, n)\n    espilon_x = np.random.normal(0, 1, n)\n\n    Y = np.cumsum(espilon_y)\n    X = np.cumsum(espilon_x)\n#---------------------------------------------------------------\n# Fit the model\n#---------------------------------------------------------------\n    X = sm.add_constant(X)\n    model = sm.OLS(Y, X).fit()\n#---------------------------------------------------------------\n# Compute the statistic\n#------------------------------------------------------\n    S[i] = np.abs(model.params[1])\/model.bse[1]\n\n\n#------------------------------------------------------ \n#              Maximum value of S\n#------------------------------------------------------\nS_max = int(np.ceil(max(S)))\n\n#------------------------------------------------------ \n#                Create bins\n#------------------------------------------------------\nbins = np.arange(0, S_max + 2, 1)  \n\n#------------------------------------------------------\n#    Compute the histogram\n#------------------------------------------------------\nfrequency, bin_edges = np.histogram(S, bins=bins)\n\n#------------------------------------------------------\n#    Create a dataframe\n#------------------------------------------------------\n\ndf = pd.DataFrame({\n    \"S Interval\": [f\"{int(bin_edges[i])}-{int(bin_edges[i+1])}\" for i in range(len(bin_edges)-1)],\n    \"Frequency\": frequency\n})\nprint(df)\nprint(np.mean(S))<\/code><\/pre>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f7f7f7\" data-has-transparency=\"false\" style=\"--dominant-color: #f7f7f7;\" loading=\"lazy\" decoding=\"async\" width=\"722\" height=\"1020\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-4.png?resize=722%2C1020&#038;ssl=1\" alt=\"\" class=\"wp-image-599430 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-4.png 722w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-4-212x300.png 212w\" sizes=\"auto, (max-width: 722px) 100vw, 722px\"><\/figure>\n<p class=\"wp-block-paragraph\" id=\"c000\">#####################################################Simulation Code for table 2 #####################################################<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np\nimport pandas as pd\nimport statsmodels.api as sm\nfrom statsmodels.stats.stattools import durbin_watson\nfrom tabulate import tabulate\n\nnp.random.seed(1)  # Pour rendre les r\u00e9sultats reproductibles\n\n#------------------------------------------------------\n# Definition of functions\n#------------------------------------------------------\n\ndef generate_random_walk(T):\n    \"\"\"\n    G\u00e9n\u00e8re une s\u00e9rie de longueur T suivant un random walk :\n        Y_t = Y_{t-1} + e_t,\n    o\u00f9 e_t ~ N(0,1).\n    \"\"\"\n    e = np.random.normal(0, 1, size=T)\n    return np.cumsum(e)\n\ndef generate_arima_0_1_1(T):\n    \"\"\"\n    G\u00e9n\u00e8re un ARIMA(0,1,1) selon la m\u00e9thode de Granger &amp; Newbold :\n    la s\u00e9rie est obtenue en additionnant une marche al\u00e9atoire et un bruit blanc ind\u00e9pendant.\n    \"\"\"\n    rw = generate_random_walk(T)\n    wn = np.random.normal(0, 1, size=T)\n    return rw + wn\n\ndef difference(series):\n    \"\"\"\n    Calcule la diff\u00e9rence premi\u00e8re d'une s\u00e9rie unidimensionnelle.\n    Retourne une s\u00e9rie de longueur T-1.\n    \"\"\"\n    return np.diff(series)\n\n#------------------------------------------------------\n# Param\u00e8tres\n#------------------------------------------------------\n\nT = 50           # longueur de chaque s\u00e9rie\nn_sims = 100     # nombre de simulations Monte Carlo\nalpha = 0.05     # seuil de significativit\u00e9\n\n#------------------------------------------------------\n# Definition of function for simulation\n#------------------------------------------------------\n\ndef run_simulation_case(case_name, m_values=[1,2,3,4,5]):\n    \"\"\"\n    case_name : un identifiant pour le type de g\u00e9n\u00e9ration :\n        - 'rw-levels' : random walk (levels)\n        - 'rw-diffs'  : differences of RW (white noise)\n        - 'arima-levels' : ARIMA(0,1,1) en niveaux\n        - 'arima-diffs'  : diff\u00e9rences d'un ARIMA(0,1,1) =&gt; MA(1)\n    \n    m_values : liste du nombre de r\u00e9gresseurs.\n    \n    Retourne un DataFrame avec pour chaque m :\n        - % de rejets de H0\n        - Durbin-Watson moyen\n        - R^2_adj moyen\n        - % de R^2 &gt; 0.1\n    \"\"\"\n    results = []\n    \n    for m in m_values:\n        count_reject = 0\n        dw_list = []\n        r2_adjusted_list = []\n        \n        for _ in range(n_sims):\n#--------------------------------------\n# 1) Generation of independents de Y_t and X_{j,t}.\n#----------------------------------------\n            if case_name == 'rw-levels':\n                Y = generate_random_walk(T)\n                Xs = [generate_random_walk(T) for __ in range(m)]\n            \n            elif case_name == 'rw-diffs':\n                # Y et X sont les diff\u00e9rences d'un RW, i.e. ~ white noise\n                Y_rw = generate_random_walk(T)\n                Y = difference(Y_rw)\n                Xs = []\n                for __ in range(m):\n                    X_rw = generate_random_walk(T)\n                    Xs.append(difference(X_rw))\n                # NB : maintenant Y et Xs ont longueur T-1\n                # =&gt; ajuster T_effectif = T-1\n                # =&gt; on prendra T_effectif points pour la r\u00e9gression\n            \n            elif case_name == 'arima-levels':\n                Y = generate_arima_0_1_1(T)\n                Xs = [generate_arima_0_1_1(T) for __ in range(m)]\n            \n            elif case_name == 'arima-diffs':\n                # Diff\u00e9rences d'un ARIMA(0,1,1) =&gt; MA(1)\n                Y_arima = generate_arima_0_1_1(T)\n                Y = difference(Y_arima)\n                Xs = []\n                for __ in range(m):\n                    X_arima = generate_arima_0_1_1(T)\n                    Xs.append(difference(X_arima))\n            \n            # 2) Pr\u00e9pare les donn\u00e9es pour la r\u00e9gression\n            #    Selon le cas, la longueur est T ou T-1\n            if case_name in ['rw-levels','arima-levels']:\n                Y_reg = Y\n                X_reg = np.column_stack(Xs) if m&gt;0 else np.array([])\n            else:\n                # dans les cas de diff\u00e9rences, la longueur est T-1\n                Y_reg = Y\n                X_reg = np.column_stack(Xs) if m&gt;0 else np.array([])\n            \n            # 3) R\u00e9gression OLS\n            X_with_const = sm.add_constant(X_reg)  # Ajout de l'ordonn\u00e9e \u00e0 l'origine\n            model = sm.OLS(Y_reg, X_with_const).fit()\n            \n            # 4) Test global F : H0 : tous les beta_j = 0\n            #    On regarde si p-value &lt; alpha\n            if model.f_pvalue is not None and model.f_pvalue &lt; alpha:\n                count_reject += 1\n            \n            # 5) R^2, Durbin-Watson\n            r2_adjusted_list.append(model.rsquared_adj)\n            \n            \n            dw_list.append(durbin_watson(model.resid))\n        \n        # Statistiques sur n_sims r\u00e9p\u00e9titions\n        reject_percent = 100 * count_reject \/ n_sims\n        dw_mean = np.mean(dw_list)\n        r2_mean = np.mean(r2_adjusted_list)\n        r2_above_0_7_percent = 100 * np.mean(np.array(r2_adjusted_list) &gt; 0.7)\n        \n        results.append({\n            'm': m,\n            'Reject %': reject_percent,\n            'Mean DW': dw_mean,\n            'Mean R^2': r2_mean,\n            '% R^2_adj&gt;0.7': r2_above_0_7_percent\n        })\n    \n    return pd.DataFrame(results)\n    \n#------------------------------------------------------\n# Application of the simulation\n#------------------------------------------------------       \n\ncases = ['rw-levels', 'rw-diffs', 'arima-levels', 'arima-diffs']\nall_results = {}\n\nfor c in cases:\n    df_res = run_simulation_case(c, m_values=[1,2,3,4,5])\n    all_results[c] = df_res\n\n#------------------------------------------------------\n# Store data in table\n#------------------------------------------------------\n\nfor case, df_res in all_results.items():\n    print(f\"nn{case}\")\n    print(tabulate(df_res, headers='keys', tablefmt='fancy_grid'))<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"efefef\" data-has-transparency=\"false\" style=\"--dominant-color: #efefef;\" loading=\"lazy\" decoding=\"async\" width=\"999\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-5-999x1024.png?resize=999%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-599431 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-5-999x1024.png 999w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-5-293x300.png 293w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-5-768x787.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-5.png 1400w\" sizes=\"auto, (max-width: 999px) 100vw, 999px\"><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"efefef\" data-has-transparency=\"false\" style=\"--dominant-color: #efefef;\" loading=\"lazy\" decoding=\"async\" width=\"990\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-6-990x1024.png?resize=990%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-599432 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-6-990x1024.png 990w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-6-290x300.png 290w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-6-768x794.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Spurious-Regression-6.png 1400w\" sizes=\"auto, (max-width: 990px) 100vw, 990px\"><\/figure>\n<h2 class=\"wp-block-heading\" id=\"05ee\">References<\/h2>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Granger, Clive WJ, and Paul Newbold. 1974. \u201cSpurious Regressions in Econometrics.\u201d\u00a0<em>Journal of Econometrics<\/em>\u00a02 (2): 111\u201320.<\/li>\n<li class=\"wp-block-list-item\">Knowles, EAG. 1954. \u201cExercises in Theoretical Statistics.\u201d Oxford University Press.<\/li>\n<\/ul>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/linear-regression-in-time-series-sources-of-spurious-regression\/\">Linear Regression in Time Series: Sources of Spurious Regression<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Junior Jumbong<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/linear-regression-in-time-series-sources-of-spurious-regression\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Linear Regression in Time Series: Sources of Spurious Regression 1. Introduction It\u2019s pretty clear that most of our work will be automated by AI in the future. This will be possible because many researchers and professionals are working hard to make their work available online. These contributions not only help us understand fundamental concepts but [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,83,311,803,229,157,354],"tags":[336,325,15],"class_list":["post-2332","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-science","category-getting-started","category-linear-regression","category-math","category-python","category-time-series-analysis","tag-regression","tag-series","tag-time"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2332"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2332"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2332\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2332"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2332"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2332"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}