{"id":3801,"date":"2025-05-14T07:03:04","date_gmt":"2025-05-14T07:03:04","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/14\/non-parametric-density-estimation-theory-and-applications\/"},"modified":"2025-05-14T07:03:04","modified_gmt":"2025-05-14T07:03:04","slug":"non-parametric-density-estimation-theory-and-applications","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/14\/non-parametric-density-estimation-theory-and-applications\/","title":{"rendered":"Non-Parametric Density Estimation: Theory and Applications"},"content":{"rendered":"<p>    Non-Parametric Density Estimation: Theory and Applications<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">In <mdspan datatext=\"el1747164910687\" class=\"mdspan-comment\">this article<\/mdspan>, we\u2019ll talk about what <a href=\"https:\/\/towardsdatascience.com\/tag\/density-estimation\/\" title=\"Density Estimation\">Density Estimation<\/a> is and the role it plays in statistical analysis. We\u2019ll analyze two popular density estimation methods, <strong>histograms<\/strong> and <strong>kernel density estimators<\/strong>, and analyze their theoretical properties as well as how they perform in practice. Finally, we\u2019ll look at how density estimation may be used as a tool for classification tasks. Hopefully after reading this article, you leave with an appreciation of density estimation as a fundamental statistical tool, and a solid intuition behind the density estimation approaches we discuss here. Ideally, this article will also spark an interest in learning more about density estimation and point you towards additional resources to help you dive deeper than what is discussed here!<\/p>\n<p class=\"has-heading-6-font-size wp-block-paragraph\"><strong>Contents:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"has-subtitle-1-font-size wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#background-concepts\">Background Concepts<\/a><\/strong><\/li>\n<li class=\"has-subtitle-1-font-size wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#what-is-density-estimation\">What is density estimation?<\/a><\/strong><\/li>\n<li class=\"has-subtitle-1-font-size wp-block-list-item\">\n<strong><a href=\"https:\/\/towardsdatascience.com\/#histograms\">Histograms<\/a><\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"has-subtitle-2-font-size wp-block-list-item\"><a href=\"https:\/\/towardsdatascience.com\/#histogram-overview\"><strong>Overview<\/strong><\/a><\/li>\n<li class=\"has-subtitle-2-font-size wp-block-list-item\"><a href=\"https:\/\/towardsdatascience.com\/#histogram-theory\"><strong>Theoretical Properties<\/strong><\/a><\/li>\n<li class=\"has-subtitle-2-font-size wp-block-list-item\"><a href=\"https:\/\/towardsdatascience.com\/#histogram-theory-demonstrated\"><strong>Demonstration of Theoretical Properties<\/strong><\/a><\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong><a href=\"https:\/\/towardsdatascience.com\/#kde\">Kernel Density Estimators (KDE)<\/a><\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#naive-density-estimator\">Naive Density Estimator<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#kde-overview\">KDE: Overview<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#kernel-and-bandwidth\">Kernel and Bandwidth<\/a><\/strong><\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#density-for-classification\">Density Estimation for Classification<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#wrap-up\">Wrap-up<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#sources\">Sources<\/a><\/strong><\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"background-concepts\">Background Concepts<\/h2>\n<p class=\"wp-block-paragraph\">Learning\/refreshing on the following concepts will be helpful to fully appreciate the rest of what is discussed in this article.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Bias%E2%80%93variance_tradeoff\" target=\"_blank\" rel=\"noreferrer noopener\">Bias and variance<\/a>: important concepts for discussing the accuracy of the density estimation approaches discussed.<\/li>\n<li class=\"wp-block-list-item\">The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cumulative_distribution_function\" target=\"_blank\" rel=\"noreferrer noopener\">cumulative distribution function<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Probability_density_function\" target=\"_blank\" rel=\"noreferrer noopener\">probability density function<\/a>.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Parametric_statistics\" target=\"_blank\" rel=\"noreferrer noopener\">Parametric<\/a> vs. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Nonparametric_statistics\" target=\"_blank\" rel=\"noreferrer noopener\">non-parametric<\/a> statistics: knowing the distinction will help understand the relevance of the density estimation methods discussed.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Big_O_notation\" target=\"_blank\" rel=\"noreferrer noopener\">O notation<\/a>: used to describe the asymptotic behavior of the bias\/variance of the density estimators.<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Kernel_(statistics)\" target=\"_blank\" rel=\"noreferrer noopener\"><mdspan datatext=\"el1747164591183\" class=\"mdspan-comment\">Kernel<\/mdspan><\/a>: kind of important for the kernel density estimator.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"what-is-density-estimation\">What is density estimation?<\/h2>\n<p class=\"wp-block-paragraph\">Density estimation is concerned with reconstructing the probability density function of a random variable, <em>X<\/em>, given a sample of random variates <em>X<sub>1<\/sub>, X<\/em><sub>2<\/sub><em>,\u2026, X<sub>n<\/sub><\/em>.<\/p>\n<p class=\"wp-block-paragraph\">Density estimation plays a crucial role in statistical analysis. It may be used as a standalone method for analyzing the properties of a random variable\u2019s distribution, such as modality, spread, and skew. Alternatively, density estimation may be used as a means for further statistical analysis, such as classification tasks, goodness-of-fit tests, and anomaly detection, to name a few.<\/p>\n<p class=\"wp-block-paragraph\">Some of you may recall that the probability distribution of a random variable <em>X<\/em> can be completely characterized by its cumulative distribution function (CDF), <em>F<\/em>(\u22c5).<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">If <em>X<\/em> is a discrete random variable, then we can derive its probability mass function (PMF), <em>p<\/em>(\u22c5), from its CDF via the following relationship:  <em>p<\/em>(X<sub>i<\/sub>)<i> = F(X<\/i><sub style=\"font-style: italic;\">i<\/sub><i>) \u2212 F(X<\/i><sub style=\"\"><i>i<\/i>-1<\/sub>), where <em><i>X<\/i><sub style=\"\"><i>i<\/i>-1<\/sub><\/em> denotes the largest value within the discrete distribution of <em>X<\/em> that is less than <i>X<\/i><sub style=\"\"><i>i<\/i><\/sub>.<\/li>\n<li class=\"wp-block-list-item\">If <em>X<\/em> is continuous, then its probability density function (PDF), <em>p<\/em>(\u22c5), may be derived by differentiating its CDF i.e. <em>F\u2032<\/em>(\u22c5)<em> = p<\/em>(\u22c5).<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Based on this, you may be wondering why we need methods to estimate the probability distribution of <em>X<\/em>, when we can just exploit the relationships stated above.<\/p>\n<p class=\"wp-block-paragraph\">Certainly, given a sample of data <em><em>X<sub>1<\/sub>,\u2026, X<sub>n<\/sub><\/em><\/em>, we may always construct an estimate of its CDF. If <em>X<\/em> is discrete, then constructing its PMF is straightforward, as it simply requires counting the frequency of observations for each distinct value that appears in our sample.<\/p>\n<p class=\"wp-block-paragraph\">However, if <em>X<\/em> is continuous, estimating its PDF is not so trivial. Notice that our estimate of the CDF, <em>F<\/em>(\u22c5), will necessarily follow a discrete distribution, since we have a finite amount of empirical data. Since <em>F<\/em>(\u22c5) is discrete, we cannot simply differentiate it to obtain an estimate of the PDF. Thus, this motivates the need for other methods of estimating <em>p<\/em>(\u22c5).<\/p>\n<p class=\"wp-block-paragraph\">To provide some additional motivation behind density estimation, the CDF may be suboptimal to use for analyzing the properties of the probability distribution of <em>X<\/em>. For example, consider the following display. <\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"731\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/cdf-vs-pdf-1-1024x731.png?resize=1024%2C731&#038;ssl=1\" alt=\"\" class=\"wp-image-603909\"><figcaption class=\"wp-element-caption\">PDF vs. CDF of data following a bimodal distribution.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Certain properties of the distribution of <em>X<\/em>, such as its bimodal nature, are immediately clear from analyzing its PDF. However, these properties are harder to notice from analyzing its CDF, due to the cumulative nature of the distribution. For many folks, the PDF likely provides a more intuitive display of the distribution of <em>X\u200a\u2014\u200a<\/em>it is larger at values of <em>X<\/em> that are more likely to \u201coccur\u201d and smaller for values of <em>X<\/em> that are less likely.<\/p>\n<p class=\"wp-block-paragraph\">Broadly speaking, density estimation approaches may be categorized as <em>parametric<\/em> or <em>non-parametric<\/em>.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<em>Parametric<\/em> density estimation assumes <em>X<\/em> follows some distribution that may be characterized by some parameters (ex: <em>X \u223c N<\/em>(<em>\u03bc,\u03c3<\/em>)). Density estimation in this case involves estimating the relevant parameters for the parametric distribution of <em>X<\/em>, and then plugging in these parameter estimates to the corresponding density function formula for <em>X<\/em>.<\/li>\n<li class=\"wp-block-list-item\">\n<em>Non-parametric<\/em> density estimation makes less rigid assumptions about the distribution of <em>X<\/em>, and estimates the shape of the density function directly from the empirical data. As a result, non-parametric density estimates will typically have lower bias and higher variance compared to parametric density estimates. Non-parametric methods may be desired when the underlying distribution of <em>X<\/em> is unknown and we\u2019re working with a large amount of empirical data.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">For the rest of this article, we\u2019ll focus on analyzing two popular non-parametric methods for density estimation: <strong><a href=\"https:\/\/towardsdatascience.com\/tag\/histograms\/\" title=\"Histograms\">Histograms<\/a><\/strong> and <strong>kernel density estimators<\/strong> (KDEs). We\u2019ll dig into how they work, the benefits and drawbacks of each approach, and how accurately they estimate the true density function of a random variable. Finally, we\u2019ll examine how density estimation can be applied to classification problems, and how the quality of the density estimator can impact classification performance.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"histograms\">Histograms<\/h2>\n<h3 class=\"wp-block-heading\" id=\"histogram-overview\">Overview<\/h3>\n<p class=\"wp-block-paragraph\">Histograms are a simple non-parametric approach for constructing a density estimate from a sample of data. Intuitively, this approach involves partitioning the range of our data into distinct equal length bins. Then, for any given point, assign its density to be equal to the proportion of points that reside within the same bin, normalized by the bin length.<\/p>\n<p class=\"wp-block-paragraph\">Formally, given a sample of <em>n<\/em> observations<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-121.png?ssl=1\" alt=\"\" class=\"wp-image-603920\"><\/figure>\n<p class=\"wp-block-paragraph\">partition the domain into <em>M<\/em> bins<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-122.png?ssl=1\" alt=\"\" class=\"wp-image-603921\"><\/figure>\n<p class=\"wp-block-paragraph\">such that<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-120.png?ssl=1\" alt=\"\" class=\"wp-image-603919\"><\/figure>\n<p class=\"wp-block-paragraph\">For a given point\u00a0<em>x<\/em> \u2208 <em>\u03b2<sub>l<\/sub><\/em>, where\u00a0<em>\u03b2<sub>l<\/sub><\/em>\u00a0denotes the\u00a0<em>l<\/em>th bin, the density estimate produced by the histogram will be<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-123.png?ssl=1\" alt=\"\" class=\"wp-image-603922\"><figcaption class=\"wp-element-caption\">Pointwise density estimate of the histogram.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Since the histogram density estimator assigns uniform density to all points within the same bin, the density estimate will be discontinuous at all of its breakpoints where the density estimates differ.<\/p>\n<figure class=\"wp-block-image aligncenter size-large\"><img data-recalc-dims=\"1\" height=\"731\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/std-gaussian-1-1024x731.png?resize=1024%2C731&#038;ssl=1\" alt=\"\" class=\"wp-image-603923\"><figcaption class=\"wp-element-caption\">Histogram density estimate for the standard Gaussian. Uniform densities are assigned to all points within the same bin.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Above, we have the histogram density estimate of the standard Gaussian distribution generated from a sample of 1000 data points. We see that <em>x <\/em>=<em> <\/em>0 and <em>x <\/em>=<em> <\/em>\u22120.5 lie within the same bin, and thus have identical density estimates.<\/p>\n<h3 class=\"wp-block-heading\" id=\"histogram-theory\">Theoretical Properties<\/h3>\n<p class=\"wp-block-paragraph\">Histograms are a simple and intuitive method for density estimation. They make no assumptions about the underlying distribution of the random variable. Histogram estimation simply requires tuning the bin width, <em>h<\/em>, and the point where the histogram bins originate from, <em>t<\/em><sub>0<\/sub>. However, we\u2019ll see very soon that the accuracy of the histogram estimator is highly dependent on tuning these parameters appropriately.<\/p>\n<p class=\"wp-block-paragraph\">As desired, the histogram estimator is a true density function.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">It is non-negative over its entire domain.<\/li>\n<li class=\"wp-block-list-item\">It integrates to 1.<\/li>\n<\/ul>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/11a9PfrN5OpWE93enFFZ5iA.png?ssl=1\" alt=\"\" class=\"wp-image-603931\"><figcaption class=\"wp-element-caption\">Integral of the histogram density estimator.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We can evaluate the accuracy of the histogram estimator for estimating the true density, <em>p<\/em>(\u22c5), by decomposing its mean squared error into its bias and variance terms.<\/p>\n<p class=\"wp-block-paragraph\">First, lets examine its bias at a given point <em>x<\/em> \u2208 (<em>b<sub>k-1<\/sub>, b<sub>k<\/sub><\/em>].<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/16knyxdHXfzFMvhOO9wvYGA.png?ssl=1\" alt=\"\" class=\"wp-image-603926\"><figcaption class=\"wp-element-caption\">Expected value of the pointwise histogram density estimate.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s take a bit of a leap here. Using the Taylor series expansion, the fact that the PDF is the derivative of the CDF, and |<em>x \u2212 b<sub>k-1<\/sub><\/em>| \u2264 <em>h<\/em>, we can derive the following.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1ZNiyXD6w2qYWP5tBBbR78g.png?ssl=1\" alt=\"\" class=\"wp-image-603928\"><\/figure>\n<p class=\"wp-block-paragraph\">Thus, we have<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1BkSKnylcLXZuHFz8fRJoQA.png?ssl=1\" alt=\"\" class=\"wp-image-603927\"><\/figure>\n<p class=\"wp-block-paragraph\">which implies<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1UTsTbWvj8oUIDbOx3Bx7lw.png?ssl=1\" alt=\"\" class=\"wp-image-603929\"><figcaption class=\"wp-element-caption\">Asymptotic bias of the histogram density estimator.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Therefore, the histogram estimator is an unbiased estimator of the true density, <em>p<\/em>(\u22c5), as the bin width approaches 0.<\/p>\n<p class=\"wp-block-paragraph\">Now, let\u2019s analyze the variance of the histogram estimator.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1ojjhfUvwq_519d_0LfMqtg.png?ssl=1\" alt=\"\" class=\"wp-image-603933\"><\/figure>\n<p class=\"wp-block-paragraph\">Notice that as <em>h<\/em> \u2192 \u221e, we have<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1Y3SD7Xqiwe7JSPqC3WEiaA.png?ssl=1\" alt=\"\" class=\"wp-image-603930\"><\/figure>\n<p class=\"wp-block-paragraph\">Therefore,<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1HSLVqco696DM1P9cXMil0Q.png?ssl=1\" alt=\"\" class=\"wp-image-603925\"><figcaption class=\"wp-element-caption\">Asymptotic variance of the histogram density estimator.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now, we\u2019re at a bit of an impasse; we see that as <em>h<\/em> \u2192 \u221e, the bias of the histogram density estimate decreases, while its variance increases.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">We\u2019re typically concerned with the accuracy of the density estimate at large sample sizes (i.e. as <em>n<\/em> \u2192 \u221e). Therefore, to maximize the accuracy of the histogram density estimate, we\u2019ll want to tune <em>h<\/em> to achieve the following behavior:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Choose <em>h<\/em> to be small to minimize bias.<\/li>\n<li class=\"wp-block-list-item\">As <em>h <\/em>\u2192 0 and <em>n <\/em>\u2192 \u221e, we must have <em>nh <\/em>\u2192 \u221e to minimize variance. In other words, the large sample size should overpower the small bin width, asymptotically.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">This bias-variance trade-off is not unexpected:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Small bin widths may capture the density around a particular point with high precision. However, density estimates may change from small random variations across data sets as less points will fall within the same bin.<\/li>\n<li class=\"wp-block-list-item\">Large bin widths include more data points when computing the density estimate at a given point, which means density estimates will be more robust to small random variations in the data.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Let\u2019s illustrate this trade-off with some examples.<\/p>\n<h3 class=\"wp-block-heading\" id=\"histogram-theory-demonstrated\">Demonstration of Theoretical Properties<\/h3>\n<p class=\"wp-block-paragraph\">First, we\u2019ll look at how small bin widths may lead to large variance in the histogram density estimator. For this example, we\u2019ll draw four samples of 50 random variates, where each sample is drawn from a standard Gaussian distribution. We\u2019ll set a relatively small bin width (<em>h <\/em>= 0.2).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">set.seed(25)\n\n# Standard Gaussian\nmu &lt;- 0\nsd &lt;- 1\n\n# Parameters for density estimate\nn &lt;- 50\nh &lt;- 0.2\n\n# Generate 4 samples of standard Gaussian\nsamples &lt;- replicate(4, rnorm(n, mean = mu, sd = sd), simplify = FALSE)\n\n# Setup 2x2 plot\npar(mfrow = c(2, 2), mar = c(4, 4, 3, 1))\n\n# Plot histograms\ntitles &lt;- paste(\"Sample\", 1:4)\ninvisible(mapply(plot_histogram, samples, title = titles,\n       MoreArgs = list(binwidth = h, origin = 0, line = 0)))<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1E42lCU2s1yAhY2cQBpLRhg.png?ssl=1\" alt=\"\" class=\"wp-image-603937\"><figcaption class=\"wp-element-caption\">Histogram density estimates (h = 0.2) generated from 4 different samples of the standard Gaussian. Notice the high variability in density estimates across\u00a0samples.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">It\u2019s clear that the histogram density estimates vary quite a bit. For instance, we see that the pointwise density estimate at <em>x<\/em> = 0 ranges from approximately 0.2 in Sample 4 to approximately 0.6 in Sample 2. Additionally, the distribution of the density estimate produced in Sample 1 appears almost bimodal, with peaks around \u22121 and a little above 0.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s repeat this exercise to demonstrate how large bin widths may result in a density estimate with lower variance, but higher bias. For this example, let\u2019s draw four samples from a bimodal distribution consisting of a mixture of two Gaussian distributions, N(0, 1) and N(3, 1). We\u2019ll set a relatively large bin width (<em>h <\/em>= 2).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">set.seed(25)\n\n# Bimodal distribution parameters - mixture of N(0, 1) and N(4, 1)\nmu_1 &lt;- 0\nsd_1 &lt;- 1\nmu_2 &lt;- 3\nsd_2 &lt;- 1\n\n# Density estimation parameters\nn &lt;- 100\nh &lt;- 2\n\n# Generate 4 samples from bimodal distribution\nsamples &lt;- replicate(4, c(rnorm(n\/2, mean = mu_1, sd = sd_1), rnorm(n\/2, mean = mu_2, sd = sd_2)), simplify = FALSE)\n\n# Set up 2x2 plotting grid\npar(mfrow = c(2, 2), mar = c(4, 4, 3, 1))\n\n# Plot histograms\ntitles &lt;- paste(\"Sample\", 1:4)\ninvisible(mapply(plot_histogram, samples, title = titles,\n       MoreArgs = list(binwidth = h, origin = 0, line = 0)))<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1vKvykXkFmdA4kMl7QpGxHQ.png?ssl=1\" alt=\"\" class=\"wp-image-603932\"><figcaption class=\"wp-element-caption\">Histogram density estimates (h = 2) generated from 4 different samples of a bimodal distribution. These histograms fail to capture the bimodal nature of the\u00a0data.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">There is still some variation in the density estimates across the four histograms, but they appear stable relative to the density estimates we saw above with smaller bin widths. For instance, it appears that the pointwise density estimate at <em>x<\/em> = 0 is approximately 0.15 across all the histograms. However, it\u2019s clear that these histogram estimators introduce a large amount of bias, as the bimodal distribution of the true density function is masked by the large bin widths.<\/p>\n<p class=\"wp-block-paragraph\">Additionally, we mentioned previously that the histogram estimator requires tuning the origin point, <em>t<\/em><sub>0<\/sub>. Let\u2019s look at an example that illustrates the impact that the choice of <em><em>t<\/em><sub>0<\/sub><\/em> can have on the histogram density estimate.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">set.seed(123)\n\n# Distribution and density estimation parameters\n# Bimodal distribution: mixture of N(0, 1) and N(5, 1)\nn &lt;- 50\ndata &lt;- c(rnorm(n\/2, mean = 0, sd = 1), rnorm(n\/2, mean = 5, sd = 1))\nh &lt;- 3\n\n# Set up plotting grid\npar(mfrow = c(1, 2), mar = c(4, 4, 3, 1))\n\n# Same bin width, different origins\nplot_histogram(data, binwidth = h, origin = 0, title = paste(\"Bin width = \", h, \", Origin = 0\"))\nplot_histogram(data, binwidth = h, origin = 1, title = paste(\"Bin width = \", h, \", Origin = 1\"))<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1pqbrQYnNQlsKMW-Du4d9Vw.png?ssl=1\" alt=\"\" class=\"wp-image-603939\"><figcaption class=\"wp-element-caption\">Histogram density estimates of a bimodal distribution with different origin points. Notice the histogram on the right fails to capture the bimodal nature of the\u00a0data.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The histogram density estimates above differ in their origin point by a magnitude of 1. The impact of the different origin point on the resulting histogram density estimates is evident. The histogram on the left captures the fact that the distribution is bimodal with peaks around 0 and 5. In contrast, the histogram on the right gives the impression that the density of <em>X<\/em> follows a unimodal distribution with a single peak around 5.<\/p>\n<p class=\"wp-block-paragraph\">Histograms are a simple and intuitive approach to density estimation. However, histograms will always produce density estimates that follow a discrete distribution, and we\u2019ve seen that the resulting density estimate may be highly dependent on an arbitrary choice of the origin point. Next, we\u2019ll look at an alternative method for density estimation, <strong><a href=\"https:\/\/towardsdatascience.com\/tag\/kernel-density-estimation\/\" title=\"Kernel Density Estimation\">Kernel Density Estimation<\/a><\/strong>, that addresses these shortcomings.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"kde\">Kernel Density Estimators (KDE)<\/h2>\n<h3 class=\"wp-block-heading\" id=\"naive-density-estimator\">Naive Density Estimator<\/h3>\n<p class=\"wp-block-paragraph\">We\u2019ll first look at the most basic form of a kernel density estimator, the <strong>naive density estimator<\/strong>. This approach is also known as the \u201cmoving histogram\u201d; it is an extension of the traditional histogram density estimator that computes the density at a given point by examining the number of observations that fall within an interval that is centered around that point.<\/p>\n<p class=\"wp-block-paragraph\">Formally, the pointwise density estimate at <em>x<\/em> produced by the naive density estimator can be written as follows.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1uzCAglC2Sq_bHXaPWbAU9Q.png?ssl=1\" alt=\"\" class=\"wp-image-603935\"><figcaption class=\"wp-element-caption\">Pointwise density estimate of the Naive Density Estimator.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Its corresponding <a href=\"https:\/\/en.wikipedia.org\/wiki\/Kernel_%28statistics%29\" target=\"_blank\" rel=\"noreferrer noopener\">kernel<\/a> is defined as follows.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1G8vop3Cu2tXmekbUvJVnxQ.png?ssl=1\" alt=\"\" class=\"wp-image-603934\"><figcaption class=\"wp-element-caption\">Naive Density Estimator kernel function.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Unlike the traditional histogram density estimate, the density estimate produced by the moving histogram does not vary based on the choice of origin point. In fact, there is no concept of \u201corigin point\u201d in the moving histogram, as the density estimate at <em>x<\/em> only depends on the points that lie within the neighborhood (<em>x<\/em> \u2212 (<em>h<\/em>\/2), <em>x<\/em> + (<em>h<\/em>\/2)).<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s examine the density estimate produced by the naive density estimator for the same bimodal distribution as we used above for highlighting the histogram\u2019s dependency on origin point.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">set.seed(123)\n\n# Bimodal distribution - mixture of N(0, 1) and N(5, 1)\ndata &lt;- c(rnorm(n\/2, mean = 0, sd = 1), rnorm(n\/2, mean = 5, sd = 1))\n\n# Density estimate parameters\nn &lt;- 50\nh &lt;- 1 \n\n# Naive Density Estimator: KDE with rectangular kernel using half the bin width\n# Rectangular kernel counts points within (x - h, x + h)\npdf_est &lt;- density(data, kernel = \"rectangular\", bw = h\/2) \n\n# Plot PDF\nplot(pdf_est, main = \"NDE: Bimodal Gaussian\", xlab = \"x\", ylab = \"Density\", col = \"blue\", lwd = 2)\nrug(data)\npolygon(pdf_est, col = rgb(0, 0, 1, 0.2), border = NA)\ngrid()<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1fknX1I1mmtgYLr-EuRldKg.png?ssl=1\" alt=\"\" class=\"wp-image-603938\"><figcaption class=\"wp-element-caption\">Naive Density Estimate of a bimodal distribution containing a mixture of N(0, 1) and N(5,\u00a01).<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Clearly, the density estimate produced by the naive density estimator captures the bimodal distribution much more accurately than the traditional histogram. Additionally, the density at each point is captured with much finer granularity.<\/p>\n<p class=\"wp-block-paragraph\">That being said, the density estimate produced by the NDE is still quite \u201crough\u201d i.e. the density estimate does not have smooth curvature. This is because each observation is weighted as \u201call or nothing\u201d when computing the pointwise density estimate, which is apprent from its kernel, <em>K<\/em><sub>0<\/sub>. Specifically, all points within the neighborhood (<em>x<\/em> \u2212 (<em>h<\/em>\/2), <em>x<\/em> + (<em>h<\/em>\/2)) contribute equally to the density estimate, while points outside the interval contribute nothing.<\/p>\n<p class=\"wp-block-paragraph\">Ideally, when computing the density estimate for <em>x<\/em>, we would like to <em>weigh points in proportion to their distance from x<\/em>, such that the points closer\/farther from <em>x<\/em> have a higher\/lower impact on its density estimate, respectively.<\/p>\n<p class=\"wp-block-paragraph\">This is essentially what the KDE does: it generalizes the naive density estimator by replacing the uniform density function with an arbitrary density function, the <strong>kernel<\/strong>. Intuitively, you can think of the KDE as a smoothed histogram.<\/p>\n<h3 class=\"wp-block-heading\" id=\"kde-overview\">KDE: Overview<\/h3>\n<p class=\"wp-block-paragraph\">The kernel density estimator generated from a sample <em>X<sub>1<\/sub>,\u2026, X<sub>n<\/sub><\/em>, can be defined as follows:<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1PrYlrAC_Gu8zkn4yBrEzqQ.png?ssl=1\" alt=\"\" class=\"wp-image-603936\"><figcaption class=\"wp-element-caption\">Pointwise density estimate of the\u00a0KDE.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Below are some popular choices for kernels used in density estimation.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1zZdvlDp47aWY8evbcGxZRQ.png?ssl=1\" alt=\"\" class=\"wp-image-603940\"><\/figure>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1rSv0I0u1zmtRM0wKF67M2g.png?ssl=1\" alt=\"\" class=\"wp-image-603942\"><\/figure>\n<p class=\"wp-block-paragraph\">These are just several of the more popular kernels that are typically used for density estimation. For more information about kernel functions, check out the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Kernel_%28statistics%29#Kernel_functions_in_common_use\" rel=\"noreferrer noopener\" target=\"_blank\">Wikipedia<\/a>. If you\u2019re seeking for some intuition behind what exactly a kernel function is (as I was), check out this <a href=\"https:\/\/www.quora.com\/What-is-the-intuitive-explanation-of-a-kernel-in-statistics\" rel=\"noreferrer noopener\" target=\"_blank\">quora thread<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">We can see that the KDE is a genuine density function.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">It is always non-negative, since <em>K<\/em>(\u22c5) is a density function.<\/li>\n<li class=\"wp-block-list-item\">It integrates to 1.<\/li>\n<\/ul>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/14GkUkhL-nETVtUuOy1COUA.png?ssl=1\" alt=\"\" class=\"wp-image-603941\"><figcaption class=\"wp-element-caption\">Integral of the\u00a0KDE.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\" id=\"kernel-and-bandwidth\">Kernel and Bandwidth<\/h3>\n<p class=\"wp-block-paragraph\">In practice, <em>K<\/em>(\u22c5) is chosen to be symmetric and unimodal around 0 (\u222b<em>u\u22c5K<\/em>(<em>u<\/em>)<em>du<\/em> = 0). Additionally, <em>K<\/em>(\u22c5) is typically scaled to have unit variance when used for density estimation (\u222b<em>u<\/em><sup>2<\/sup>\u22c5<em>K<\/em>(<em>u<\/em>)<em>du <\/em>= 1). This scaling essentially standardizes the impact that the choice of bandwidth, <em>h<\/em>, has on the KDE, regardless of the kernel being used.<\/p>\n<p class=\"wp-block-paragraph\">Since the KDE at a given point is the weighted sum of its neighboring points, where the weights are computed by <em>K<\/em>(\u22c5), the smoothness of the density estimate is inherited from the smoothness of the kernel function.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Smooth kernel functions will produce smooth KDEs. We can see that the Gaussian kernel depicted above is infinitely differentiable, so KDEs with the Gaussian kernel will produce density estimates with smooth curvature.<\/li>\n<li class=\"wp-block-list-item\">On the other hand, the other kernel functions (Epanechnikov, rectangular, triangular) are not differentiable everywhere (ex: \u00b11), and in the case of the rectangular and triangular kernels, do not have smooth curvature. Thus, KDEs using these kernels will produce rougher density estimates.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">However, in practice, we\u2019ll see that as long as the kernel function is continuous, the choice of the kernel has relatively little impact on the KDE compared to the choice of bandwidth.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">set.seed(123)\n\n# sample from standard Gaussian\nx &lt;- rnorm(50)\n\n# kernel\/bandwidths for KDEs\nkernels &lt;- c(\"gaussian\", \"epanechnikov\", \"rectangular\", \"triangular\")\nbandwidths &lt;- c(0.5, 1, 2)\n\ncolors_k &lt;- rainbow(length(kernels))\ncolors_b &lt;- rainbow(length(bandwidths))\n\nplot_kde_comparison &lt;- function(values, label, type = c(\"kernel\", \"bandwidth\")) {\n  type &lt;- match.arg(type)\n  plot(NULL, xlim = range(x) + c(-1, 1), ylim = c(0, 0.5),\n       xlab = \"x\", ylab = \"Density\", main = paste(\"KDE with Different\", label))\n\n  for (i in seq_along(values)) {\n    if (type == \"kernel\") {\n      d &lt;- density(x, kernel = values[i])\n      col &lt;- colors_k[i]\n    } else {\n      d &lt;- density(x, bw = values[i], kernel = \"gaussian\")\n      col &lt;- colors_b[i]\n    }\n    lines(d$x, d$y, col = col, lwd = 2)\n  }\n\n  curve(dnorm(x), add = TRUE, lty = 2, lwd = 2)\n  legend(\"topright\", legend = c(as.character(values), \"True Density\"),\n         col = c(if (type == \"kernel\") colors_k else colors_b, \"black\"),\n         lwd = 2, lty = c(rep(1, length(values)), 2), cex = 0.8)\n  rug(x)\n}\n\nplot_kde_comparison(kernels, \"Kernels\", type = \"kernel\")\nplot_kde_comparison(bandwidths, \"Bandwidths\", type = \"bandwidth\")<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1hXaCCMRtW1A1Xs-lzld27g.png?ssl=1\" alt=\"\" class=\"wp-image-603943\"><\/figure>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1Ym7gc54lmN7bxAC2oG6U2Q.png?ssl=1\" alt=\"\" class=\"wp-image-603944\"><\/figure>\n<p class=\"wp-block-paragraph\">We see that the KDEs for the standard Gaussian with various kernels are relatively similar, compared to the KDEs produced with various bandwidths.<\/p>\n<h3 class=\"wp-block-heading\">Accuracy of the\u00a0KDE<\/h3>\n<p class=\"wp-block-paragraph\">Let\u2019s examine how accurately the KDE estimates the true density, <em>p<\/em>(\u22c5). As we did with the histogram estimator, we can decompose its mean squared error into its bias and variance terms. For details behind how to derive these bias and variance terms, check out lecture 6 of <a href=\"https:\/\/faculty.washington.edu\/yenchic\/18W_stat425.html\" target=\"_blank\" rel=\"noreferrer noopener\">these notes<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">The bias and variance of the KDE at <em>x<\/em> can be expressed as follows.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1D8DIww0hGp7tULAwoQfGaQ.png?ssl=1\" alt=\"\" class=\"wp-image-603945\"><figcaption class=\"wp-element-caption\">Asymptotic bias and variance of the\u00a0KDE.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Intuitively, these results give us the following insights:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The effect of <em>K<\/em>(\u22c5) on the accuracy of the KDE is primarily captured via the term \u03c3<sup>2<\/sup><sub>K<\/sub> = \u222b<em>K<\/em>(<em>u<\/em>)<sup>2<\/sup><em>du<\/em>. The Epanechnikov kernel minimizes this integral, so theoretically it should produce the optimal KDE. However, we\u2019ve seen that the choice of kernel has little practical impact on the KDE relative to its bandwidth. Additionally, the Epanechnikov kernel has a bounded support interval ([\u22121, 1]). As a result, it may produce rougher density estimates relative to kernels that are nonzero across the entire real number space (ex: Gaussian). Thus, the Gaussian kernel is commonly used in practice.<\/li>\n<li class=\"wp-block-list-item\">Recall that the asymptotic bias and variance of the histogram estimator as h \u2192 \u221e was <em>O<\/em>(<em>h<\/em>) and <em>O<\/em>(<em>1<\/em>\/<em>(nh<\/em>)), respectively. Comparing these against KDE tells us that <em>the KDE improves upon the histogram density estimator primarily through decreased asymptotic bias<\/em>. This is expected: the kernel smoothly varies the weight of the neighboring points of <em>x<\/em> when computing the pointwise density at <em>x<\/em>, instead of assigning uniform density to arbitrary fixed intervals of the domain. In other words, the KDE imposes a less rigid structure on the density estimate compared to the histogram approach.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">For histograms and KDEs, we\u2019ve seen that the bandwidth <em>h<\/em> can have a significant impact on the accuracy of the density estimate. Ideally, we would select the <em>h<\/em> such that the mean squared error of the density estimator is minimized. However, it turns out that this theoretically optimal <em>h<\/em> depends on the curvature of the true density <em>p(\u22c5)<\/em>, which is unknown practice (otherwise we wouldn\u2019t need density estimation)!<\/p>\n<p class=\"wp-block-paragraph\">Some popular approaches for bandwidth selection include:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Assuming the true density resembles some reference distribution <em>p<sub>0<\/sub><\/em>(\u22c5) (ex: Gaussian), then plugging in the curvature of <em>p<sub>0<\/sub><\/em>(\u22c5) to derive the bandwidth. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Kernel_density_estimation#A_rule-of-thumb_bandwidth_estimator\" target=\"_blank\" rel=\"noreferrer noopener\">This approach<\/a> is simple, but it assumes the distribution of the data, so it may be a poor choice if you\u2019re looking to build density estimates to <em>explore<\/em> your data.<\/li>\n<li class=\"wp-block-list-item\">Non-parametric approaches to bandwidth selection, such as cross-validation and plug-in methods. The <a href=\"https:\/\/academic.oup.com\/biomet\/article-abstract\/71\/2\/353\/233423?redirectedFrom=fulltext&amp;login=false\" target=\"_blank\" rel=\"noreferrer noopener\">unbiased cross-validation<\/a> and <a href=\"https:\/\/academic.oup.com\/jrsssb\/article\/53\/3\/683\/7028194?login=false\" target=\"_blank\" rel=\"noreferrer noopener\">Sheather-Jones<\/a> methods are popular bandwidth selectors and typically produce fairly accurate density estimates.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">For more information on the impact of bandwidth selection on the KDE, check out this <a href=\"https:\/\/aakinshin.net\/posts\/kde-bw\/\" rel=\"noreferrer noopener\" target=\"_blank\">blog post<\/a>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">set.seed(42)\n\n# Simulate data: a bimodal distribution\nx &lt;- c(rnorm(150, mean = -2), rnorm(150, mean = 2))\n\n# Define true density\ntrue_density &lt;- function(x) {\n  0.5 * dnorm(x, mean = -2, sd = 1) + \n  0.5 * dnorm(x, mean = 2, sd = 1)\n}\n\n# Create plotting range\nx_grid &lt;- seq(min(x) - 1, max(x) + 1, length.out = 500)\nxlim &lt;- range(x_grid)\nylim &lt;- c(0, max(true_density(x_grid)) * 1.2)\n\n# Base plot\nplot(NULL, xlim = xlim, ylim = ylim,\n     main = \"KDE: Various Bandwidth Selection Methods\",\n     xlab = \"x\", ylab = \"Density\")\n\n# KDE with different bandwidths\nlines(density(x), col = \"red\", lwd = 2, lty = 4)\nh_scott &lt;- 1.06 * sd(x) * length(x)^(-1\/5)\nlines(density(x, bw = h_scott), col = \"blue\", lwd = 2, lty = 2)\nlines(density(x, bw = bw.ucv(x)), col = \"darkgreen\", lwd = 2, lty = 3)\nlines(density(x, bw = bw.SJ(x)), col = \"purple\", lwd = 2, lty = 4)\n\n# True density\nlines(x_grid, true_density(x_grid), col = \"black\", lwd = 2)\n\n# Add legend\nlegend(\"topright\",\n       legend = c(\"Silverman (Default))\", \"Scott's Rule\", \"Unbiased CV\",\n                  \"Sheather-Jones\", \"True Density\"),\n       col = c(\"red\", \"blue\", \"darkgreen\", \"purple\", \"black\"),\n       lty = 1:6, lwd = 2, cex = 0.8)<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1qTBOesPOeDPNOioL4iv0hw.png?ssl=1\" alt=\"\" class=\"wp-image-603950\"><figcaption class=\"wp-element-caption\">KDEs using various bandwidth selection methods where the underlying data follows a bimodal distribution. Notice the KDEs using the Sheather-Jones and Unbiased Cross-Validation methods produce density estimates closest to the true\u00a0density.<\/figcaption><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"density-for-classification\">Density Estimation for Classification<\/h2>\n<p class=\"wp-block-paragraph\">We\u2019ve discussed a great deal about the underlying theory of histograms and KDE, and we\u2019ve demonstrated how they perform at modeling the true density of some sample data. Now, we\u2019ll look at how we can apply what we learned about density estimation for a simple classification task.<\/p>\n<p class=\"wp-block-paragraph\">For instance, say we want to build a classifier from a sample of <em>n<\/em> observations (<em>x<\/em><sub>1<\/sub>, <em>y<sub>1<\/sub><\/em>),\u2026, (<em><em>x<\/em><sub>n<\/sub>, <em>y<\/em><\/em><sub>n<\/sub>), where each <em>x<sub>i<\/sub><\/em> comes from a <em>p<\/em>-dimensional feature space, <em>X<\/em>, and <em>y<\/em><sub>i<\/sub> corresponds to the target labels drawn from <em>Y<\/em> = {1,\u2026, <em>m<\/em>}.<\/p>\n<p class=\"wp-block-paragraph\">Intuitively, we want to build a classifier such that for each observation, our classifier assigns it the class label <em>k<\/em> such that the following is satisfied.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1M27v2hf9NJTb_3P4xHf49g.png?ssl=1\" alt=\"\" class=\"wp-image-603946\"><\/figure>\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayes_classifier\" target=\"_blank\" rel=\"noreferrer noopener\">Bayes classifier<\/a> does precisely that, and computes the conditional probability above using the following equation.<\/p>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1uCr3fMQWKSjf_Jy7KfUTxw.png?ssl=1\" alt=\"\" class=\"wp-image-603947\"><figcaption class=\"wp-element-caption\">The Bayes Classifier<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">This classifier relies on the following:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\u03c0<sub>k<\/sub> = P(<em>Y<\/em> = <em>k<\/em>): the prior probability that an observation (<em>x<\/em><sub>i<\/sub>, <em>y<sub>i<\/sub><\/em>) belongs to the <em>k<\/em>th class (i.e. <em>y<sub>i<\/sub> <\/em>= <em>k<\/em>). This can be estimated by simply counting the proportion of points in each class from our sample data.<\/li>\n<li class=\"wp-block-list-item\">\n<em>f<sub>k<\/sub><\/em>(<em>x<\/em>) \u2261 P(<em>X<\/em> = <em>x<\/em> | <em>Y<\/em> = <em>k<\/em>): the <em>p<\/em>-dimensional density function of <em>X<\/em> for all observations in target class <em>k<\/em>. This is harder to estimate: for each of the <em>m<\/em> target classes, we must determine the shape of the distribution for each dimension of <em>X<\/em>, and also whether there are any associations between the different dimensions.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The Bayes classifier is <em>optimal<\/em> if the quantities above can be computed precisely. However, this is impossible to achieve in practice when working with a finite sample of data. For more detail behind why the Bayes classifier is optimal, check out <a href=\"https:\/\/mlweb.loria.fr\/book\/en\/bayesclassifier.html\" rel=\"noreferrer noopener\" target=\"_blank\">this site<\/a>.<\/p>\n<p class=\"wp-block-paragraph\"><strong><em>So the question becomes, how can we approximate the Bayes classifier?<\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">One popular method is the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Naive_Bayes_classifier\" rel=\"noreferrer noopener\" target=\"_blank\">Naive Bayes classifier<\/a>. Naive Bayes assumes class-conditional independence, which means that for each target class, it reduces the <em>p<\/em>-dimensional density estimation problem into <em>p<\/em> separate univariate density estimation tasks. These univariate densities may be estimated parametrically or non-parametrically. A typical parametric approach would assume that each dimension of <em>X<\/em> follows a univariate Gaussian distribution with class-specific mean and a diagonal co-variance matrix, whereas a non-parametric approach may model each dimension of <em>X<\/em> using a histogram or KDE.<\/p>\n<p class=\"wp-block-paragraph\">The parametric approach to univariate density estimation in Naive Bayes may be useful when we have a small amount of data relative to the size of the feature space, as the bias introduced by the Gaussian assumption may help reduce the variance of the classifier. However, the Gaussian assumption may not always be appropriate depending on the distribution of data that you\u2019re working with.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s examine how parametric vs. non-parametric density estimates can impact the decision boundary of the Naive Bayes classifier. We\u2019ll build two classifiers on the <a href=\"https:\/\/archive.ics.uci.edu\/dataset\/53\/iris\" rel=\"noreferrer noopener\" target=\"_blank\">Iris dataset<\/a>: one of them will assume each feature follows a Gaussian distribution, and the other will build kernel density estimates for each feature.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\"># Parametric Naive Bayes\nparam_nb &lt;- naive_bayes(Species ~ ., data = train)\n\n# Nonparametric Naive Bayes\n# KDE with Gaussian kernel and Sheather-Jones bandwidth\nnonparam_nb &lt;- naive_bayes(Species ~ ., data = train, \n                           usekernel = TRUE, \n                           kernel=\"gaussian\",\n                           bw=\"sj\") # play with bandwidth to see how it affects the classification boundaries!\n\n# Create grid for plotting decision boundaries\nx_seq &lt;- seq(min(iris2D$Sepal.Length), max(iris2D$Sepal.Length), length.out = 200)\ny_seq &lt;- seq(min(iris2D$Petal.Length), max(iris2D$Petal.Length), length.out = 200)\ngrid &lt;- expand.grid(Sepal.Length = x_seq, Petal.Length = y_seq)\n\n# Predict class for each point on grid\ngrid$param_pred &lt;- predict(param_nb, grid)\ngrid$nonparam_pred &lt;- predict(nonparam_nb, grid)\n\n# Plot decision boundaries\nnb_parametric &lt;- ggplot() +\n  geom_tile(data = grid, aes(x = Sepal.Length, y = Petal.Length, fill = param_pred), alpha = 0.3) +\n  geom_point(data = train, aes(x = Sepal.Length, y = Petal.Length, color = Species), size = 2) +\n  ggtitle(\"Parametric Naive Bayes Decision Boundary\") +\n  theme_minimal()\n\nnb_nonparametric &lt;- ggplot() +\n  geom_tile(data = grid, aes(x = Sepal.Length, y = Petal.Length, fill = nonparam_pred), alpha = 0.3) +\n  geom_point(data = train, aes(x = Sepal.Length, y = Petal.Length, color = Species), size = 2) +\n  ggtitle(\"Nonparametric Naive Bayes Decision Boundary\") +\n  theme_minimal()\n\nnb_parametric\nnb_nonparametric<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1IP5AUfh_BFoMfeVuHgYceA.png?ssl=1\" alt=\"\" class=\"wp-image-603951\"><figcaption class=\"wp-element-caption\">Decision boundaries produced by the parametric Naive Bayes classifier.<\/figcaption><\/figure>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1wF8RRWcWeznXZBpNG2kUpQ.png?ssl=1\" alt=\"\" class=\"wp-image-603948\"><figcaption class=\"wp-element-caption\">Decision boundaries produced by the non-parametric Naive Bayes classifier. Notice the rough decision boundaries relative to that of its parametric counterpart.<\/figcaption><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\"># Parametric Naive Bayes prediction on test data\nparam_pred &lt;- predict(param_nb, newdata = test)\n\n# Non-parametric Naive Bayes prediction on test data\nnonparam_pred &lt;- predict(nonparam_nb, newdata = test)\n\n# Create confusion matrices\nparam_cm &lt;- confusionMatrix(param_pred, test$Species)\nnonparam_cm &lt;- confusionMatrix(nonparam_pred, test$Species)\n\noutput &lt;- capture.output({\n  # Print confusion matrices\n  cat(\"n=== Parametric Naive Bayes Metrics ===n\")\n  print(param_cm$table)\n  cat(\"Parametric Naive Bayes Accuracy: \", param_cm$overall['Accuracy'], \"nn\")\n  \n  cat(\"=== Non-parametric Naive Bayes Metrics ===n\")\n  print(nonparam_cm$table)\n  cat(\"Nonparametric Naive Bayes Accuracy: \", nonparam_cm$overall['Accuracy'], \"n\")\n})\ncat(paste(output, collapse = \"n\"))<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1rehl4HqaQJk1nP7ET3IOmg.png?ssl=1\" alt=\"\" class=\"wp-image-603952\"><figcaption class=\"wp-element-caption\">Classification performance for both Naive Bayes models. Non-parametric Naive Bayes achieved slightly better performance on our\u00a0data.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We see that the non-parametric Naive Bayes classifier achieves slightly better accuracy than its parametric counterpart. This is because the non-parametric density estimates produce a classifier with a more flexible decision boundary. As a result, several of the \u201cvirginica\u201d observations that were incorrectly classified as \u201cversicolor\u201d by the parametric classifier ended up being classified correctly by the non-parametric model.<\/p>\n<p class=\"wp-block-paragraph\">That being said, the decision boundaries produced by non-parametric Naive Bayes appear to be rough and disconnected. Thus, there are some regions of the feature space where the classification boundary may be questionable, and fail to generalize well to new data. In contrast, the parametric Naive Bayes classifier produces smooth, connected decision boundaries that appear to accurately capture the general pattern of the feature distributions for each species.<\/p>\n<p class=\"wp-block-paragraph\">This distinction brings up an important point that \u201cmore flexible density estimation\u201d does not equate to \u201cbetter density estimation\u201d, especially when applied to classification. After all, there\u2019s a reason why Naive Bayes classification is popular. Although making less assumptions about the distribution of your data may seem desirable to produce unbiased density estimates, simplifying assumptions may be effective when there is insufficient empirical data to produce high quality estimates, or if the parametric assumptions are believed to be mostly accurate. In the latter case, parametric estimation will introduce little to no bias to the estimator, whereas non-parametric approaches may introduce large amounts of variance.<\/p>\n<p class=\"wp-block-paragraph\">Indeed, looking at the feature distributions below, the Gaussian assumption of parametric Naive Bayes doesn\u2019t seem inappropriate. For the most part, it appears the class distributions for petal and sepal length appear to be unimodal and symmetric.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-r\">iris_long &lt;- pivot_longer(iris, cols = c(Sepal.Length, Petal.Length), names_to = \"Feature\", values_to = \"Value\")\n\nggplot(iris_long, aes(x = Value, fill = Species)) +\n  geom_density(alpha = 0.5, bw=\"sj\") +\n  facet_wrap(~ Feature, scales = \"free\") +\n  labs(title = \"Distribution of Sepal and Petal Lengths by Species\", x = \"Length (cm)\", y = \"Density\") +\n  theme_minimal()<\/code><\/pre>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/1pDoiUI53Eaj7CrPI6ynQgA.png?ssl=1\" alt=\"\" class=\"wp-image-603949\"><figcaption class=\"wp-element-caption\">Density distributions for Petal and Sepal length. The univariate densities appear to be unimodal and symmetric across all species for both features.<\/figcaption><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"wrap-up\">Wrap-up<\/h2>\n<p class=\"wp-block-paragraph\">Thanks for reading! We dove into the theory behind the histogram and kernel density estimators and how to apply them in context..<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s briefly summarize what we discussed:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Density estimation is a fundamental tool in <a href=\"https:\/\/towardsdatascience.com\/tag\/statistical-analysis\/\" title=\"Statistical Analysis\">Statistical Analysis<\/a> for analyzing the distribution of a variable or as an intermediate tool for deeper statistical analysis. Density estimation approaches may be broadly categorized as parametric or non-parametric.<\/li>\n<li class=\"wp-block-list-item\">Histograms and KDEs are two popular approaches for non-parametric density estimation. Histograms produce density estimates by computing the normalized frequency of points within each distinct bin of the data. KDEs are \u201csmoothed\u201d histograms that estimate the density at a given point by computing a weighted sum of its surrounding points, where neighbors are weighted in proportion to their distance.<\/li>\n<li class=\"wp-block-list-item\">Non-parametric density estimation can be applied to classification algorithms that require modeling the feature densities for each target class (Bayesian classification). Classifiers built using non-parametric density estimates may be able to define more flexible decision boundaries at the cost of higher variance.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Check out the sources below if you\u2019re interested in learning more!<\/p>\n<p class=\"wp-block-paragraph\"><em>The author has created all images in this article.<\/em><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"sources\">Sources<\/h2>\n<p class=\"wp-block-paragraph\">Learning Resources:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Garc\u00eda-Portugu\u00e9s, E. (2025).\u00a0<em>Notes for Nonparametric Statistics<\/em>. Version 6.12.0. ISBN 978-84-09-29537-1. Available at\u00a0<a href=\"https:\/\/bookdown.org\/egarpor\/NP-UC3M\/\">https:\/\/bookdown.org\/egarpor\/NP-UC3M\/<\/a>.<\/li>\n<li class=\"wp-block-list-item\">Garc\u00eda-Portugu\u00e9s, E. (2022).\u00a0<em>A Short Course on Nonparametric Curve Estimation<\/em>. Version 2.1.1. Available at\u00a0<a href=\"https:\/\/bookdown.org\/egarpor\/NP-EAFIT\/\">https:\/\/bookdown.org\/egarpor\/NP-EAFIT\/<\/a>.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/faculty.washington.edu\/yenchic\/18W_stat425.html\" target=\"_blank\" rel=\"noreferrer noopener\">UW Stat 425: Introduction to Nonparametric Statistics (Winter 2018)<\/a><\/li>\n<li class=\"wp-block-list-item\">James et al., <a href=\"https:\/\/www.statlearning.com\/\">An Introduction to Statistical Learning<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Hastie et al., <a href=\"https:\/\/hastie.su.domains\/ElemStatLearn\/\">The Elements of Statistical Learning<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Andrey Akinshin, <a href=\"https:\/\/aakinshin.net\/posts\/kde-bw\/\" target=\"_blank\" rel=\"noreferrer noopener\">The importance of kernel density estimation bandwidth<\/a>\n<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Datasets:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Fisher, R. (1936). Iris [Dataset]. UCI Machine Learning Repository. https:\/\/doi.org\/10.24432\/C56C76. (CC BY 4.0)<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/non-parametric-density-estimation-theory-and-applications\/\">Non-Parametric Density Estimation: Theory and Applications<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Jimin Kang<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/non-parametric-density-estimation-theory-and-applications\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Non-Parametric Density Estimation: Theory and Applications In this article, we\u2019ll talk about what Density Estimation is and the role it plays in statistical analysis. We\u2019ll analyze two popular density estimation methods, histograms and kernel density estimators, and analyze their theoretical properties as well as how they perform in practice. Finally, we\u2019ll look at how density [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,2664,240,2665,2666,968,238],"tags":[1502,374,41],"class_list":["post-3801","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-density-estimation","category-editors-pick","category-histograms","category-kernel-density-estimation","category-statistical-analysis","category-statistics","tag-density","tag-estimation","tag-what"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3801"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3801"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3801\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3801"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}