{"id":3692,"date":"2025-05-09T07:02:21","date_gmt":"2025-05-09T07:02:21","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/09\/the-dangers-of-deceptive-data-part-2-base-proportions-and-bad-statistics\/"},"modified":"2025-05-09T07:02:21","modified_gmt":"2025-05-09T07:02:21","slug":"the-dangers-of-deceptive-data-part-2-base-proportions-and-bad-statistics","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/09\/the-dangers-of-deceptive-data-part-2-base-proportions-and-bad-statistics\/","title":{"rendered":"The Dangers of Deceptive Data Part 2\u2013Base Proportions and Bad Statistics"},"content":{"rendered":"<p>    The Dangers of Deceptive Data Part 2\u2013Base Proportions and Bad Statistics<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1746680040658\" class=\"mdspan-comment\">This is a follow<\/mdspan>-up to my earlier article: <a href=\"https:\/\/towardsdatascience.com\/the-dangers-of-deceptive-data-confusing-charts-and-misleading-headlines\/\">The Dangers of Deceptive Data\u2013Confusing Charts and Misleading Headlines<\/a>. My first article focused on how <em>visualizations<\/em> can be used to mislead, diving into a form of data presentation widely used in public matters.<\/p>\n<p class=\"wp-block-paragraph\">In this article, I go a bit deeper, looking at how a misunderstanding of statistical ideas is breeding ground for being deceived by data. Specifically, I\u2019ll walk through how correlation, base proportions, summary statistics, and misinterpretation of uncertainty can lead people astray.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s get right into it.<\/p>\n<h2 class=\"wp-block-heading\">Correlation \u2260 Causation<\/h2>\n<p class=\"wp-block-paragraph\">Let\u2019s start with a classic to get in the right frame of mind for some more complex ideas. From the earliest statistics classes in grade school, we are all told that correlation is not equal to causation.<\/p>\n<p class=\"wp-block-paragraph\">If you do a bit of Googling or reading, you can find \u201cstatistics\u201d that show a high correlation between cigarette consumption and average life expectancy [1]. Interesting. Well, does that mean we should all start smoking to live longer?<\/p>\n<p class=\"wp-block-paragraph\">Of course not. We\u2019re missing a confounding factor: buying cigarettes requires money, and countries with higher wealth understandably have higher life expectancies. There is no causal link between cigarettes and age. I like this example because it is so blatantly misleading and highlights the point well. In general, it\u2019s important to be wary of any data that only shows a correlational link.<\/p>\n<p class=\"wp-block-paragraph\">From a scientific standpoint, a correlation can be identified via observation, but the only way to claim causation is to actually conduct a randomized trial controlling for potential confounding factors\u2014a fairly involved process.<\/p>\n<p class=\"wp-block-paragraph\">I chose to start here because while being introductory, this concept also highlights a key idea that underpins understanding data effectively: <strong>The data only shows what it shows, and nothing else.<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Keep that in mind as we move forward.<\/p>\n<h2 class=\"wp-block-heading\">Remember Base Proportions<\/h2>\n<p class=\"wp-block-paragraph\">In 1978, Dr. Stephen Casscells and his team famously asked a group of 60 physicians, residents, and students at Harvard Medical School the following questions:<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u201cIf a test to detect a disease whose prevalence is 1 in 1,000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person\u2019s symptoms or signs?\u201d<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">Though presented in medical terms, this question is really about statistics. Accordingly, it also has connections to data science. Take a second to think about your own answer to this question before reading further.<\/p>\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/unsplash.com\/photos\/three-doctors-woman-in-hijab-and-two-men-in-medical-apparel-discussing-patients-x-ray-tomography-scan-walking-outside-on-the-background-of-modern-hospital-with-stairs-KrQAGZasfcU\"><img data-recalc-dims=\"1\" height=\"683\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/premium_photo-1661746503127-004c48f57496-1024x683.jpg?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-603627\"><\/a><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@gettyimages\">Getty Images<\/a> on Unsplash<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The answer is (approximately) 2%. Now, if you looked through this quickly (and aren\u2019t up to speed with your statistics), you may have guessed significantly higher.<\/p>\n<p class=\"wp-block-paragraph\">This was certainly the case with the medical school folks. Only 11\/60 people correctly answered the question, with 27\/60 going as high as 95% in their response (presumably just subtracting the false positive rate from 100).<\/p>\n<p class=\"wp-block-paragraph\">It is easy to assume that the actual value should be high due to the positive rest result, but this assumption contains a crucial reasoning error: It fails to account for the extremely low prevalence of the disease in the population.<\/p>\n<p class=\"wp-block-paragraph\">Said another way, if only 1 in every 1,000 people has the disease, this needs to be taken into account when calculating the probability of a random person having the disease. The probability does not rely only on the positive test result. As soon as the test accuracy falls below 100%, the influence of the base rate comes into play quite significantly.<\/p>\n<p class=\"wp-block-paragraph\">Formally, this reasoning error is known as the <strong>base rate fallacy<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">To see this more clearly, imagine that only 1 in every 1,000,000 people had the disease, but the test still has a false positive rate of 5%. Would you still assume that a positive test result immediately indicates a 95% chance of having the disease? What if it was 1 in a billion?<\/p>\n<p class=\"wp-block-paragraph\">Base rates are extremely important. Remember that.<\/p>\n<h2 class=\"wp-block-heading\">Statistical Measures Are NOT Equivalent to the Data<\/h2>\n<p class=\"wp-block-paragraph\">Let\u2019s take a look at the following quantitative data sets (13 of them, to be precise), all of which are visualized as a scatter plot. One is even in the shape of a dinosaur.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/datasaurus_dozen.png?ssl=1\" alt=\"\" class=\"wp-image-603705\"><figcaption class=\"wp-element-caption\">Image By Author. Generated using code available under MIT license at <a href=\"https:\/\/jumpingrivers.github.io\/datasauRus\/\">https:\/\/jumpingrivers.github.io\/datasauRus\/<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Do you see anything interesting about these data sets?<\/p>\n<p class=\"wp-block-paragraph\">I\u2019ll point you in the right direction. Here is a set of summary statistics for the data:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<tbody>\n<tr>\n<td>X-Mean<\/td>\n<td>54.26<\/td>\n<\/tr>\n<tr>\n<td>Y-Mean<\/td>\n<td>47.83<\/td>\n<\/tr>\n<tr>\n<td>X-SD (Standard Deviation)<\/td>\n<td>16.76<\/td>\n<\/tr>\n<tr>\n<td>Y-SD<\/td>\n<td>26.93<\/td>\n<\/tr>\n<tr>\n<td>Correlation<\/td>\n<td>-0.06<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">If you\u2019re wondering why there is only one set of statistics, it\u2019s because they\u2019re all the same. Every single one of the 13 <a href=\"https:\/\/towardsdatascience.com\/tag\/charts\/\" title=\"Charts\">Charts<\/a> above has the same mean, standard deviation, and correlation between variables.<\/p>\n<p class=\"wp-block-paragraph\">This famous set of 13 data sets is known as the <em>Datasaurus Dozen<\/em> [5], and was published some years ago as a stark example of why summary statistics cannot always be trusted. It also highlights the value of visualization as a tool for data exploration. In the words of renowned statistician John Tukey,<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u201c<strong>The greatest value of a picture is when it forces us to notice what we never expected to see.<\/strong>\u201c<\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Understanding Uncertainty<\/h2>\n<p class=\"wp-block-paragraph\">To conclude, I want to talk about a slight variation of deceptive data, but one that is equally important: <strong>mistrusting data that is actually correct.<\/strong> In other words, false deception.<\/p>\n<p class=\"wp-block-paragraph\">The following chart is taken from a study analyzing the sentiments of headlines taken from left-leaning, right-leaning, and centrist news outlets [6]:<\/p>\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/journals.plos.org\/plosone\/article?id=10.1371\/journal.pone.0276367\"><img data-recalc-dims=\"1\" height=\"691\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/Average_yearly_sentiment_of_headlines_grouped_by_the_ideological_leanings_of_news_outlets-1024x691.png?resize=1024%2C691&#038;ssl=1\" alt=\"\" class=\"wp-image-603543\"><\/a><figcaption class=\"wp-element-caption\">\u201cAverage yearly sentiment of headlines grouped by the ideological leanings of news outlets\u201d by Authors of the study: David Rozado, Ruth Hughes, Jamin Halberstadt is licensed under CC BY 4.0. To view a copy of this license, visit https:\/\/creativecommons.org\/licenses\/by\/4.0\/?ref=openverse.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">There is quite a bit going on in the chart above, but there is one particular aspect I want to draw your attention to: the vertical lines extending from each plotted point. You may have seen these before. Formally, these are called <em>error bars<\/em>, and they are one way that scientists often depict uncertainty in the data.<\/p>\n<p class=\"wp-block-paragraph\">Let me say that again. In statistics and <a href=\"https:\/\/towardsdatascience.com\/tag\/data-science\/\" title=\"Data Science\">Data Science<\/a>, \u201cerror\u201d is synonymous with \u201cuncertainty.\u201d Crucially, <strong>it does not mean something is wrong or incorrect about what is being shown<\/strong>. When a chart depicts uncertainty, it depicts a carefully calculated measure of the range of a value and the level of confidence at various points within that range. Unfortunately, many people just take it to mean that whoever made the chart is essentially guessing.<\/p>\n<p class=\"wp-block-paragraph\">This is a serious error in reasoning, for the damage is twofold: Not only does the data at hand get misinterpreted, but the presence of this misconception also contributes to the dangerous societal belief that science is not to be trusted. Being upfront about the limitations of knowledge should actually increase our confidence in a claim\u2019s reliability, but mistaking that limitation as admission of foul play leads to the opposite effect.<\/p>\n<p class=\"wp-block-paragraph\">Learning how to interpret uncertainty is challenging but incredibly important. At the minimum, a good place to start is realizing what the so-called \u201cerror\u201d is actually trying to convey.<\/p>\n<h2 class=\"wp-block-heading\">Recap and Final Thoughts<\/h2>\n<p class=\"wp-block-paragraph\">Here\u2019s a cheat sheet for being wary of deceptive data:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Correlation \u2260<\/strong> <strong>causation<\/strong>. Look for the confounding factor.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Remember base proportions.<\/strong> The probability of a phenomenon is <em>highly<\/em> influenced by its prevalence in the population, no matter how accurate your test is (with the exception of 100% accuracy, which is rare).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Beware summary <a href=\"https:\/\/towardsdatascience.com\/tag\/statistics\/\" title=\"Statistics\">Statistics<\/a>.<\/strong> Means and medians will only take you so far; you need to explore your data.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Don\u2019t misunderstand uncertainty.<\/strong> It isn\u2019t an error; it\u2019s a carefully considered description of confidence levels.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Remember these, and you\u2019ll be well positioned to tackle the next data science problem that makes its way to you.<\/p>\n<p class=\"wp-block-paragraph\">Until next time.<\/p>\n<h2 class=\"wp-block-heading\">References<\/h2>\n<p class=\"wp-block-paragraph\">[1] <em>How Charts Lie<\/em>, Alberto Cairo<\/p>\n<p class=\"wp-block-paragraph\">[2] <a href=\"https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC4955674\">https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC4955674<\/a><\/p>\n<p class=\"wp-block-paragraph\">[3] <a href=\"https:\/\/data88s.org\/textbook\/content\/Chapter_02\/04_Use_and_Interpretation.html?utm_source=chatgpt.com\">https:\/\/data88s.org\/textbook\/content\/Chapter_02\/04_Use_and_Interpretation.html?utm_source=chatgpt.com<\/a><\/p>\n<p class=\"wp-block-paragraph\">[4] <a href=\"https:\/\/visualizing.jp\/the-datasaurus-dozen\">https:\/\/visualizing.jp\/the-datasaurus-dozen<\/a><\/p>\n<p class=\"wp-block-paragraph\">[5] <a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3025453.3025912?casa_token=AU6PWgCWQuMAAAAA:5a9-oA38RxxzmVGZiIFJdrNdOMII2kmsFLJK22WJgaAk37PECCmAQjwVzAiapGiV4MAOPTJ8-uax0g\">https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3025453.3025912?casa_token=AU6PWgCWQuMAAAAA:5a9-oA38RxxzmVGZiIFJdrNdOMII2kmsFLJK22WJgaAk37PECCmAQjwVzAiapGiV4MAOPTJ8-uax0g<\/a><\/p>\n<p class=\"wp-block-paragraph\">[6] <a href=\"https:\/\/journals.plos.org\/plosone\/article?id=10.1371\/journal.pone.0276367\">https:\/\/journals.plos.org\/plosone\/article?id=10.1371\/journal.pone.0276367<\/a><\/p>\n<p class=\"wp-block-paragraph\">\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/the-dangers-of-deceptive-data-part-2-base-proportions-and-bad-statistics\/\">The Dangers of Deceptive Data Part 2\u2013Base Proportions and Bad Statistics<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Murtaza Ali<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/the-dangers-of-deceptive-data-part-2-base-proportions-and-bad-statistics\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Dangers of Deceptive Data Part 2\u2013Base Proportions and Bad Statistics This is a follow-up to my earlier article: The Dangers of Deceptive Data\u2013Confusing Charts and Misleading Headlines. My first article focused on how visualizations can be used to mislead, diving into a form of data presentation widely used in public matters. In this article, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1034,211,83,82,240,238],"tags":[1642,84,2530],"class_list":["post-3692","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-charts","category-data-analysis","category-data-science","category-data-visualization","category-editors-pick","category-statistics","tag-correlation","tag-data","tag-statistics"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3692"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3692"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3692\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3692"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3692"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3692"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}