{"id":970,"date":"2025-01-05T07:01:16","date_gmt":"2025-01-05T07:01:16","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/05\/mastering-the-basics-how-linear-regression-unlocks-the-secrets-of-complex-models-8aa33920c105\/"},"modified":"2025-01-05T07:01:16","modified_gmt":"2025-01-05T07:01:16","slug":"mastering-the-basics-how-linear-regression-unlocks-the-secrets-of-complex-models-8aa33920c105","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/05\/mastering-the-basics-how-linear-regression-unlocks-the-secrets-of-complex-models-8aa33920c105\/","title":{"rendered":"Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models"},"content":{"rendered":"<p>    Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Full explanation on Linear Regression and how it\u00a0learns<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/600\/1%2AiFAapeElplSxnySU4jufkA.jpeg?ssl=1\"><figcaption>The Crane Stance. Public Domain image from <a href=\"https:\/\/openverse.org\/image\/dfa430e5-882e-4758-ba73-3248fcfe9464?q=karate+kid&amp;p=11\">Openverse<\/a><\/figcaption><\/figure>\n<p>Just like Mr. Miyagi taught young Daniel LaRusso karate through repetitive simple chores, which ultimately transformed him into the Karate Kid, mastering foundational algorithms like linear regression lays the groundwork for understanding the most complex of AI architectures such as Deep Neural Networks and\u00a0LLMs.<\/p>\n<p>Through this deep dive into the simple yet powerful linear regression, you will learn many of the fundamental parts that make up the most advanced models built today by billion-dollar companies.<\/p>\n<h3>What is Linear Regression?<\/h3>\n<p>Linear regression is a simple mathematical method used to understand the relationship between two variables and make predictions. Given some data points, such as the one below, linear regression attempts to draw the <strong>line of best fit<\/strong> through these points. It\u2019s the \u201cwax on, wax off\u201d of data\u00a0science.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"An image showing many points on a graph being modelled by linear regression by tracing the line of best fit through those points\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AmqCmcG0OxZFKSgKYk2ks4A.jpeg?ssl=1\"><figcaption>Example of linear regression model on a graph. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>Once this line is drawn, we have a model that we can use to predict new values. In the above example, given a new house size, we could attempt to predict its price with the linear regression model.<\/p>\n<h4>The Linear Regression Formula<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"The formula of linear regression\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A3XDLZ8hTEywEO8cMj3H3qg.jpeg?ssl=1\"><figcaption>Labelled Linear Regression Formula. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p><em>Y<\/em> is the dependent variable, that which you want to calculate\u200a\u2014\u200athe house price in the previous example. Its value depends on other variables, hence its\u00a0name.<\/p>\n<p><em>X<\/em> are the independent variables. These are the factors that influence the value of <em>Y<\/em>. When modelling, the independent variables are the input to the model, and what the model spits out is the prediction or\u00a0<em>\u0176<\/em>.<\/p>\n<p>\u03b2 are parameters. We give the name parameter to those values that the model adjusts (or learns) to capture the relationship between the independent variables <em>X<\/em> and the dependent variable <em>Y<\/em>. So, as the model is trained, the input of the model will remain the same, but the parameters will be adjusted to better predict the desired\u00a0output.<\/p>\n<h4>Parameter Learning<\/h4>\n<p>We require a few things to be able to adjust the parameters and achieve accurate predictions.<\/p>\n<ol>\n<li>Training Data\u200a\u2014\u200athis data consists of input and output pairs. The inputs will be fed into the model and during training, the parameters will be adjusted in an attempt to output the target\u00a0value.<\/li>\n<li>Cost function\u200a\u2014\u200aalso known as the loss function, is a mathematical function that measures how well a model\u2019s prediction matches the target\u00a0value.<\/li>\n<li>Training Algorithm\u200a\u2014\u200ais a method used to adjust the parameters of the model to minimise the error as measured by the cost function.<\/li>\n<\/ol>\n<p>Let\u2019s go over a cost function and training algorithm that can be used in linear regression.<\/p>\n<h3>Cost Function: Mean Squared Error\u00a0(MSE)<\/h3>\n<p>MSE is a commonly used cost function in regression problems, where the goal is to predict a continuous value. This is different from classification tasks, such as predicting the next token in a vocabulary, as in Large Language Models. MSE focuses on numerical differences and is used in a variety of regression and neural network problems, this is how you calculate it:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"The formula of mean squared error (mse)\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AhIPaHDGmXa08ayV9fKJZQQ.jpeg?ssl=1\"><figcaption>Mean Squared Error (MSE) formula. Image captured by\u00a0Author<\/figcaption><\/figure>\n<ol>\n<li>Calculate the difference between the predicted value, <em>\u0176<\/em>, and the target value,\u00a0<em>Y<\/em>.<\/li>\n<li>Square this difference\u200a\u2014\u200aensuring all errors are positive and also penalising large errors more\u00a0heavily.<\/li>\n<li>Sum the squared differences for all data\u00a0samples<\/li>\n<li>Divide the sum by the number of samples, <em>n<\/em>, to get the average squared\u00a0error<\/li>\n<\/ol>\n<p>You will notice that as our prediction gets closer to the target value the MSE gets lower, and the further away they are the larger it grows. Both ways progress quadratically because the difference is\u00a0squared.<\/p>\n<h3>Training Algorithm: Gradient\u00a0Descent<\/h3>\n<p>The concept of gradient descent is that we can travel through the \u201ccost space\u201d in small steps, with the objective of arriving at the global minimum\u200a\u2014\u200athe lowest value in the space. The cost function evaluates how well the current model parameters predict the target by giving us the loss value. Randomly modifying the parameters does not guarantee any improvements. But, if we examine the gradient of the loss function with respect to each parameter, i.e. the direction of the loss after an update of the parameter, we can adjust the parameters to move towards a lower loss, indicating that our predictions are getting closer to the target\u00a0values.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Labelled graph showing the key concepts of the gradient descent algorithm. The local and global minimum, the learning rate and how it makes the position advance towards a lower cost\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/777\/1%2Ap-2XJATmeJvVFDrIbsO3Xg.jpeg?ssl=1\"><figcaption>Labelled graph showing the key concepts of the gradient descent algorithm. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>The steps in gradient descent must be carefully sized to balance progress and precision. If the steps are too large, we risk overshooting the global minimum and missing it entirely. On the other hand, if the steps are too small, the updates will become inefficient and time-consuming, increasing the likelihood of getting stuck in a local minimum instead of reaching the desired global\u00a0minimum.<\/p>\n<h4>Gradient Descent\u00a0Formula<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Labelled gradient descent formula\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AuuQ4gy28WLgV7mFby9ZAkA.jpeg?ssl=1\"><figcaption>Labelled Gradient Descent formula. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>In the context of linear regression, \u03b8 could be <em>\u03b2<\/em>0 or <em>\u03b21<\/em>. The gradient is the partial derivative of the cost function with respect to \u03b8, or in simpler terms, it is a measure of how much the cost function changes when the parameter \u03b8 is slightly adjusted.<\/p>\n<p>A large gradient indicates that the parameter has a significant effect on the cost function, while a small gradient suggests a minor effect. The sign of the gradient indicates the direction of change for the cost function. A negative gradient means the cost function will decrease as the parameter increases, while a positive gradient means it will increase.<\/p>\n<p>So, in the case of a large negative gradient, what happens to the parameter? Well, the negative sign in front of the learning rate will cancel with the negative sign of the gradient, resulting in an addition to the parameter. And since the gradient is large we will be adding a large number to it. So, the parameter is adjusted substantially reflecting its greater influence on reducing the cost function.<\/p>\n<h3>Practical Example<\/h3>\n<p>Let\u2019s take a look at the prices of the sponges Karate Kid used to wash Mr. Miyagi\u2019s car. If we wanted to predict their price (dependent variable) based on their height and width (independent variables), we could model it using linear regression.<\/p>\n<p>We can start with these three training data\u00a0samples.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Training data for the linear regression example modelling prices of sponges\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ab7qGmrD7x7kyzOFNCwVN7w.jpeg?ssl=1\"><figcaption>Training data for the linear regression example modelling prices of sponges. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>Now, let\u2019s use the Mean Square Error (MSE) as our cost function <em>J<\/em>, and linear regression as our\u00a0model.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Formula for the cost function derived from MSE and linear regression\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AKZVhncpjY-Dy1-U0_6tiMg.jpeg?ssl=1\"><figcaption>Formula for the cost function derived from MSE and linear regression. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>The linear regression formula uses X1 and X2 for width and height respectively, notice there are no more independent variables since our training data doesn\u2019t include more. That is the assumption we take in this example, that the width and height of the sponge are enough to predict its\u00a0price.<\/p>\n<p>Now, the first step is to initialise the parameters, in this case to 0. We can then feed the independent variables into the model to get our predictions, <em>\u0176<\/em>, and check how far these are from our target\u00a0<em>Y.<\/em><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Step 0 in gradient descent algorithm and the calculation of the mean squared error\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2APRIzcVQmBva4Q6tN9T-7Wg.jpeg?ssl=1\"><figcaption>Step 0 in gradient descent algorithm and the calculation of the mean squared error. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>Right now, as you can imagine, the parameters are not very helpful. But we are now prepared to use the Gradient Descent algorithm to update the parameters into more useful ones. First, we need to calculate the partial derivatives of each parameter, which will require some calculus, but luckily we only need to this once in the whole\u00a0process.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Working out of the partial derivatives of the linear regression parameters.\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AksYVkBfghXzS3EF6X_a-wQ.jpeg?ssl=1\"><figcaption>Working out of the partial derivatives of the linear regression parameters. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>With the partial derivatives, we can substitute in the values from our errors to calculate the gradient of each parameter.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Calculation of parameter gradients\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A1j9dhkJDC-lURNWbRwNtDQ.jpeg?ssl=1\"><figcaption>Calculation of parameter gradients. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>Notice there wasn\u2019t any need to calculate the MSE, as it\u2019s not directly used in the process of updating parameters, only its derivative is. It\u2019s also immediately apparent that all gradients are negative, meaning that all can be increased to reduce the cost function. The next step is to update the parameters with a learning rate, which is a hyper-parameter, i.e. a configuration setting in a machine learning model that is specified before the training process begins. Unlike model parameters, which are learned during training, hyper-parameters are set manually and control aspects of the learning process. Here we arbitrarily use\u00a00.01.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Parameter updating in the first iteration of gradient descent\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AQXVxdiF8X7qN5q3BPRcggg.jpeg?ssl=1\"><figcaption>Parameter updating in the first iteration of gradient descent. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>This has been the final step of our first iteration in the process of gradient descent. We can use these new parameter values to make new predictions and recalculate the MSE of our\u00a0model.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Last step in the first iteration of gradient descent, and recalculation of MSE after parameter updates\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AdnmdIQj9yZcC7c6DXQTF-w.jpeg?ssl=1\"><figcaption>Last step in the first iteration of gradient descent, and recalculation of MSE after parameter updates. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>The new parameters are getting closer to the true sponge prices, and have yielded a much lower MSE, but there is a lot more training left to do. If we iterate through the gradient descent algorithm 50 times, this time using Python instead of doing it by hand\u200a\u2014\u200asince Mr. Miyagi never said anything about coding\u200a\u2014\u200awe will reach the following values.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Results of some iterations of the gradient descent algorithm, and a graph showing the MSE over the gradient descent steps\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AAq7Nh6lwD37aJZ1WwiPmfA.jpeg?ssl=1\"><figcaption>Results of some iterations of the gradient descent algorithm, and a graph showing the MSE over the gradient descent steps. Image captured by\u00a0Author<\/figcaption><\/figure>\n<p>Eventually we arrived to a pretty good model. The true values I used to generate those numbers were [1, 2, 3] and after only 50 iterations, the model\u2019s parameters came impressively close. Extending the training to 200 steps, which is another hyper-parameter, with the same learning rate allowed the linear regression model to converge almost perfectly to the true parameters, demonstrating the power of gradient\u00a0descent.<\/p>\n<h3>Conclusions<\/h3>\n<p>Many of the fundamental concepts that make up the complicated martial art of artificial intelligence, like cost functions and gradient descent, can be thoroughly understood just by studying the simple \u201cwax on, wax off\u201d tool that linear regression is.<\/p>\n<p>Artificial intelligence is a vast and complex field, built upon many ideas and methods. While there\u2019s much more to explore, mastering these fundamentals is a significant first step. Hopefully, this article has brought you closer to that goal, one \u201cwax on, wax off\u201d at a\u00a0time.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=8aa33920c105\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/mastering-the-basics-how-linear-regression-unlocks-the-secrets-of-complex-models-8aa33920c105\">Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Miguel Cardona Polo<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fmastering-the-basics-how-linear-regression-unlocks-the-secrets-of-complex-models-8aa33920c105\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models Full explanation on Linear Regression and how it\u00a0learns The Crane Stance. Public Domain image from Openverse Just like Mr. Miyagi taught young Daniel LaRusso karate through repetitive simple chores, which ultimately transformed him into the Karate Kid, mastering foundational algorithms like linear regression [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,83,1117,803,70],"tags":[496,103,336],"class_list":["post-970","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-data-science","category-gradient-descent","category-linear-regression","category-machine-learning","tag-linear","tag-model","tag-regression"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/970"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=970"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/970\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=970"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=970"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=970"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}