{"id":839,"date":"2024-12-27T07:02:33","date_gmt":"2024-12-27T07:02:33","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/27\/how-neural-networks-learn-a-probabilistic-viewpoint-0f6a78dc58e2\/"},"modified":"2024-12-27T07:02:33","modified_gmt":"2024-12-27T07:02:33","slug":"how-neural-networks-learn-a-probabilistic-viewpoint-0f6a78dc58e2","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/27\/how-neural-networks-learn-a-probabilistic-viewpoint-0f6a78dc58e2\/","title":{"rendered":"How Neural Networks Learn: A Probabilistic Viewpoint"},"content":{"rendered":"<p>    How Neural Networks Learn: A Probabilistic Viewpoint<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Understanding loss functions for training neural\u00a0networks<\/h4>\n<p>Machine learning is very hands-on, and everyone charts their own path. There isn\u2019t a standard set of courses to follow, as was traditionally the case. There\u2019s no \u2018Machine Learning 101,\u2019 so to speak. However, this sometimes leaves gaps in understanding. If you\u2019re like me, these gaps can feel uncomfortable. For instance, I used to be bothered by things we do casually, like the choice of a loss function. I admit that some practices are learned through heuristics and experience, but most concepts are rooted in solid mathematical foundations. Of course, not everyone has the time or motivation to dive deeply into those foundations\u200a\u2014\u200aunless you\u2019re a researcher.<\/p>\n<p>I have attempted to present some basic ideas on how to approach a machine learning problem. Understanding this background will help practitioners feel more confident in their design choices. The concepts I covered\u00a0include:<\/p>\n<ul>\n<li>Quantifying the difference in probability distributions using cross-entropy.<\/li>\n<li>A probabilistic view of neural network\u00a0models.<\/li>\n<li>Deriving and understanding the loss functions for different applications.<\/li>\n<\/ul>\n<h3>Entropy<\/h3>\n<p>In information theory, entropy is a measure of the uncertainty associated with the values of a random variable. In other words, it is used to quantify the spread of distribution. The narrower the distribution the lower the entropy and vice versa. Mathematically, entropy of distribution <strong><em>p(x)<\/em><\/strong> is defined\u00a0as;<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ad-PFEvxfWbxlxJMTl0q11Q.png?ssl=1\"><\/figure>\n<p>It is common to use log with the base 2 and in that case entropy is measured in bits. The figure below compares two distributions: the blue one with high entropy and the orange one with low\u00a0entropy.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/500\/1%2ADA0iEWJkLdAgtTGlfPB2aA.png?ssl=1\"><figcaption>Visualization examples of distributions having high and low entropy\u200a\u2014\u200acreated by the author using\u00a0Python.<\/figcaption><\/figure>\n<p>We can also measure entropy between two distributions. For example, consider the case where we have observed some data having the distribution <strong><em>p(x)<\/em><\/strong> and a distribution <strong><em>q(x)<\/em><\/strong> that could potentially serve as a model for the observed data. In that case we can compute cross-entropy <strong><em>Hpq\u200b(X)<\/em><\/strong> between data distribution <strong><em>p(x)<\/em><\/strong> and the model distribution <strong><em>q(x)<\/em><\/strong>. Mathematically cross-entropy is written as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A-F7o-cUKOs06JjLdoc6IBA.png?ssl=1\"><\/figure>\n<p>Using cross entropy we can compare different models and the one with lowest cross entropy is better fit to the data. This is depicted in the contrived example in the following figure. We have two candidate models and we want to decide which one is better model for the observed data. As we can see the model whose distribution exactly matches that of the data has lower cross entropy than the model that is slightly\u00a0off.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/800\/1%2AVlUMj6R5fvLPGngx19Mrhg.png?ssl=1\"><figcaption>Comparison of cross entropy of data distribution p(x) with two candidate models. (a) candidate model exactly matches data distribution and has low cross entropy. (b) candidate model does not match the data distribution hence it has high cross entropy\u200a\u2014\u200acreated by the author using\u00a0Python.<\/figcaption><\/figure>\n<p>There is another way to state the same thing. As the model distribution deviates from the data distribution cross entropy increases. While trying to fit a model to the data i.e. training a machine learning model, we are interested in minimizing this deviation. This increase in cross entropy due to deviation from the data distribution is defined as relative entropy commonly known as <strong>Kullback-Leibler Divergence<\/strong> of simply <strong>KL-Divergence.<\/strong><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AtG7x9Hopkn9IJPIcnYlNww.png?ssl=1\"><\/figure>\n<p>Hence, we can quantify the divergence between two probability distributions using cross-entropy or KL-Divergence. To train a model we can adjust the parameters of the model such that they minimize the cross-entropy or KL-Divergence. Note that minimizing cross-entropy or KL-Divergence achieves the same solution. KL-Divergence has a better interpretation as its minimum is zero, that will be the case when the model exactly matches the\u00a0data.<\/p>\n<p>Another important consideration is how do we pick the model distribution? This is dictated by two things: the problem we are trying to solve and our preferred approach to solving the problem. Let\u2019s take the example of a classification problem where we have <strong><em>(X, Y)<\/em><\/strong> pairs of data, with <strong><em>X<\/em><\/strong> representing the input features and <strong><em>Y<\/em><\/strong> representing the true class labels. We want to train a model to correctly classify the inputs. There are two ways we can approach this\u00a0problem.<\/p>\n<h3>Discriminative vs Generative<\/h3>\n<p>The generative approach refers to modeling the joint distribution <strong><em>p(X,Y)<\/em><\/strong> such that it learns the data-generating process, hence the name \u2018generative\u2019. In the example under discussion, the model learns the prior distribution of class labels <strong><em>p(Y)<\/em><\/strong> and for given class label <strong><em>Y<\/em><\/strong>, it learns to generate features <strong><em>X<\/em><\/strong> using\u00a0<strong><em>p(X|Y)<\/em><\/strong>.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ATdX5wvjj_bbJImN8uu97CA.png?ssl=1\"><\/figure>\n<p>It should be clear that the learned model is capable of generating new data <strong><em>(X,Y)<\/em><\/strong>. However, what might be less obvious is that it can also be used to classify the given features <strong><em>X<\/em><\/strong> using Bayes\u2019 Rule, though this may not always be feasible depending on the model\u2019s complexity. Suffice it to say that using this for a task like classification might not be a good idea, so we should instead take the direct approach.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A4OX-X97JbRoqqCRO30DnTg.png?ssl=1\"><figcaption>Discriminative vs generative approach of modelling\u200a\u2014\u200acreated by the author using\u00a0Python.<\/figcaption><\/figure>\n<p>Discriminative approach refers to modelling the relationship between input features <strong><em>X<\/em><\/strong> and output labels <strong><em>Y<\/em><\/strong> directly i.e. modelling the conditional distribution <strong><em>p(Y|X)<\/em><\/strong>. The model thus learnt need not capture the details of features <strong><em>X<\/em><\/strong> but only the class discriminatory aspects of it. As we saw earlier, it is possible to learn the parameters of the model by minimizing the cross-entropy between observed data and model distribution. The cross-entropy for a discriminative model can be written\u00a0as:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A_uMcEEKL5M8jDtQWQsN8PQ.png?ssl=1\"><\/figure>\n<p>Where the right most sum is the sample average and it approximates the expectation w.r.t data distribution. Since our learning rule is to minimize the cross-entropy, we can call it our general loss function.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ACQLRP6puQTQR9N80VeHHWg.png?ssl=1\"><\/figure>\n<p>Goal of learning (training the model) is to minimize this loss function. Mathematically, we can write the same statement as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AliWvXE1Qgy-5YYOXNKb3hw.png?ssl=1\"><\/figure>\n<p>Let\u2019s now consider specific examples of discriminative models and apply the general loss function to each\u00a0example.<\/p>\n<h3>Binary Classification<\/h3>\n<p>As the name suggests, the class label <strong><em>Y<\/em><\/strong> for this kind of problem is either <strong><em>0<\/em><\/strong> or <strong><em>1<\/em><\/strong>. That could be the case for a face detector, or a cat vs dog classifier or a model that predicts the presence or absence of a disease. How do we model a binary random variable? That\u2019s right\u200a\u2014\u200ait\u2019s a Bernoulli random variable. The probability distribution for a Bernoulli variable can be written as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A_5sGkYROdxFMQRz3FCvD4Q.png?ssl=1\"><\/figure>\n<p>where <strong><em>\u03c0<\/em><\/strong> is the probability of getting <strong><em>1<\/em><\/strong> i.e. <strong><em>p(Y=1) =\u00a0\u03c0<\/em><\/strong>.<\/p>\n<p>Since we want to model <strong><em>p(Y|X)<\/em><\/strong>, let\u2019s make <strong><em>\u03c0<\/em><\/strong> a function of <strong><em>X<\/em><\/strong> i.e. output of our model <strong><em>\u03c0(X) <\/em><\/strong>depends on input features <strong><em>X<\/em><\/strong>. In other words, our model takes in features <strong><em>X<\/em><\/strong> and predicts the probability of <strong><em>Y<\/em><\/strong>=1. Please note that in order to get a valid probability at the output of the model, it has to be constrained to be a number between <strong><em>0<\/em><\/strong> and <strong><em>1<\/em><\/strong>. This is achieved by applying a sigmoid non-linearity at the\u00a0output.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AuFY-AXCtCtMWzexW9lYEgg.png?ssl=1\"><\/figure>\n<p>To simplify, let\u2019s rewrite this explicitly in terms of true label and predicted label as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ApECYIPnobLfMrUZ8r0_LNQ.png?ssl=1\"><\/figure>\n<p>We can write the general loss function for this specific conditional distribution as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2APsdQCo2Fkgmi76Jsn_3Q2w.png?ssl=1\"><\/figure>\n<p>This is the commonly referred to as binary cross entropy (BCE)\u00a0loss.<\/p>\n<h3>Multi-class Classification<\/h3>\n<p>For a multi-class problem, the goal is to predict a category from <strong><em>C<\/em><\/strong> classes for each input feature <strong><em>X<\/em><\/strong>.<strong> <\/strong>In this case we can model the output <strong><em>Y<\/em><\/strong> as a categorical random variable, a random variable that takes on a state c out of all possible <strong><em>C<\/em><\/strong> states. As an example of categorical random variable, think of a six-faced die that can take on one of six possible states with each\u00a0roll.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ATKWhZBtYECC38iw0XmHnyw.png?ssl=1\"><\/figure>\n<p>We can see the above expression as easy extension of the case of binary random variable to a random variable having multiple categories. We can model the conditional distribution <strong><em>p(Y|X)<\/em><\/strong> by making <strong><em>\u03bb<\/em><\/strong>\u2019s as function of input features <strong><em>X<\/em><\/strong>. Based on this, let\u2019s we write the conditional categorical distribution of <strong><em>Y<\/em><\/strong> in terms of predicted probabilities as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AU2nB4L073Gqo_aDYntowWQ.png?ssl=1\"><\/figure>\n<p>Using this conditional model distribution we can write the loss function using the general loss function derived earlier in terms of cross-entropy as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A6hh5gPAocZUY0nlswNGJSg.png?ssl=1\"><\/figure>\n<p>This is referred to as Cross-Entropy loss in PyTorch. The thing to note here is that I have written this in terms of predicted probability of each class. In order to have a valid probability distribution over all <strong><em>C <\/em><\/strong>classes, a softmax non-linearity is applied at the output of the model. Softmax function is written as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AKYzACij6q0eZjEFdQPRNmQ.png?ssl=1\"><\/figure>\n<h3>Regression<\/h3>\n<p>Consider the case of data <strong><em>(X, Y)<\/em><\/strong> where <strong><em>X<\/em><\/strong> represents the input features and <strong><em>Y<\/em><\/strong> represents output that can take on any real number value. Since <strong><em>Y<\/em><\/strong> is real valued, we can model the its distribution using a Gaussian distribution.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AlyPl2MaqiBNKuvRT3hpqwA.png?ssl=1\"><\/figure>\n<p>Again, since we are interested in modelling the conditional distribution <strong><em>p(Y|X).<\/em><\/strong> We can capture the dependence on <strong><em>X<\/em><\/strong> by making the conditional mean of <strong><em>Y<\/em><\/strong> a function of <strong><em>X<\/em><\/strong>. For simplicity, we set variance equal to 1. The conditional distribution can be written as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A4wWmieiF7KjfkqID3ENNEg.png?ssl=1\"><\/figure>\n<p>We can now write our general loss function for this conditional model distribution as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AWOQISJP6WizaiuRZEJHepA.png?ssl=1\"><\/figure>\n<p>This is the famous MSE loss for training the regression model. Note that the constant factor is irrelevant here as we are only interest in finding the location of minima and can be\u00a0dropped.<\/p>\n<h3>Summary<\/h3>\n<p>In this short article, I introduced the concepts of entropy, cross-entropy, and KL-Divergence. These concepts are essential for computing similarities (or divergences) between distributions. By using these ideas, along with a probabilistic interpretation of the model, we can define the general loss function, also referred to as the objective function. Training the model, or \u2018learning,\u2019 then boils down to minimizing the loss with respect to the model\u2019s parameters. This optimization is typically carried out using gradient descent, which is mostly handled by deep learning frameworks like PyTorch. Hope this helps\u200a\u2014\u200ahappy learning!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=0f6a78dc58e2\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-neural-networks-learn-a-probabilistic-viewpoint-0f6a78dc58e2\">How Neural Networks Learn: A Probabilistic Viewpoint<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Bilal Ahmed<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-neural-networks-learn-a-probabilistic-viewpoint-0f6a78dc58e2\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How Neural Networks Learn: A Probabilistic Viewpoint Understanding loss functions for training neural\u00a0networks Machine learning is very hands-on, and everyone charts their own path. There isn\u2019t a standard set of courses to follow, as was traditionally the case. There\u2019s no \u2018Machine Learning 101,\u2019 so to speak. However, this sometimes leaves gaps in understanding. If you\u2019re [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,963,962,965,70,964],"tags":[582,966,118],"class_list":["post-839","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-binary-cross-entropy","category-cross-entropy-loss","category-kl-divergence","category-machine-learning","category-probabilistic-models","tag-distribution","tag-entropy","tag-neural"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/839"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=839"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/839\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=839"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=839"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=839"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}