{"id":1637,"date":"2025-02-04T07:02:58","date_gmt":"2025-02-04T07:02:58","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/04\/neural-networks-intuitively-and-exhaustively-explained-0153f85c1007\/"},"modified":"2025-02-04T07:02:58","modified_gmt":"2025-02-04T07:02:58","slug":"neural-networks-intuitively-and-exhaustively-explained-0153f85c1007","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/04\/neural-networks-intuitively-and-exhaustively-explained-0153f85c1007\/","title":{"rendered":"Neural Networks \u2013 Intuitively and Exhaustively Explained"},"content":{"rendered":"<p>    Neural Networks \u2013 Intuitively and Exhaustively Explained<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h3 class=\"wp-block-heading\">An in-depth exploration of the most fundamental architecture in modern AI<\/h3>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"b5af96\" data-has-transparency=\"false\" style=\"--dominant-color: #b5af96;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1024\" class=\"wp-image-597166 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1B7C0TLxpFO6fKhqXGRe2bg.png?resize=1024%2C1024&#038;ssl=1\" alt='\"The Thinking Part\" by Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained.' srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1B7C0TLxpFO6fKhqXGRe2bg.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1B7C0TLxpFO6fKhqXGRe2bg-300x300.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1B7C0TLxpFO6fKhqXGRe2bg-150x150.png 150w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1B7C0TLxpFO6fKhqXGRe2bg-768x768.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">&#8220;The Thinking Part&#8221; by Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on <a href=\"https:\/\/iaee.substack.com\/\">Intuitively and Exhaustively Explained<\/a>.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In this article we\u2019ll form a thorough understanding of the neural network, a cornerstone technology underpinning virtually all cutting edge <a href=\"https:\/\/towardsdatascience.com\/tag\/ai\/\" title=\"AI\">AI<\/a> systems. We\u2019ll first explore neurons in the human brain, and then explore how they formed the fundamental inspiration for neural networks in AI. We\u2019ll then explore back-propagation, the algorithm used to train neural networks to do cool stuff. Finally, after forging a thorough conceptual understanding, we\u2019ll implement a Neural Network ourselves from scratch and train it to solve a toy problem.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\">\n<p class=\"wp-block-paragraph\"><strong>Who is this useful for?<\/strong> Anyone who wants to form a complete understanding of the state of the art of AI.<\/p>\n<p class=\"wp-block-paragraph\"><strong>How advanced is this post?<\/strong> This article is designed to be accessible to beginners, and also contains thorough information which may serve as a useful refresher for more experienced readers.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Pre-requisites:<\/strong> None<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\">\n<h2 class=\"wp-block-heading\">Inspiration From the Brain<\/h2>\n<p class=\"wp-block-paragraph\">Neural networks take direct inspiration from the human brain, which is made up of billions of incredibly complex cells called neurons.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f3f2f3\" data-has-transparency=\"false\" style=\"--dominant-color: #f3f2f3;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"1612\" class=\"wp-image-597167 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0H6bVaEAMKw2WlsXB.png?resize=2500%2C1612&#038;ssl=1\" alt=\"The Neuron, source\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0H6bVaEAMKw2WlsXB.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0H6bVaEAMKw2WlsXB-300x193.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0H6bVaEAMKw2WlsXB-1024x660.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0H6bVaEAMKw2WlsXB-768x495.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0H6bVaEAMKw2WlsXB-1536x990.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0H6bVaEAMKw2WlsXB-2048x1321.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\"><figcaption class=\"wp-element-caption\">The Neuron, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Neuron#\/media\/File:Blausen_0657_MultipolarNeuron.png\">source<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The process of thinking within the human brain is the result of communication between neurons. You might receive stimulus in the form of something you saw, then that information is propagated to neurons in the brain via electrochemical signals.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f2f1f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f2f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"1230\" height=\"320\" class=\"wp-image-597168 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xfX_DPDEwe3sUE7N1wIZvw.png?resize=1230%2C320&#038;ssl=1\" alt=\"eye image generated with Midjourney\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xfX_DPDEwe3sUE7N1wIZvw.png 1230w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xfX_DPDEwe3sUE7N1wIZvw-300x78.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xfX_DPDEwe3sUE7N1wIZvw-1024x266.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xfX_DPDEwe3sUE7N1wIZvw-768x200.png 768w\" sizes=\"auto, (max-width: 1230px) 100vw, 1230px\"><figcaption class=\"wp-element-caption\">eye image generated with Midjourney<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The first neurons in the brain receive that stimulus, then each neuron may choose whether or not to &#8220;fire&#8221; based on how much stimulus it received. &#8220;Firing&#8221;, in this case, is a neurons decision to send signals to the neurons it\u2019s connected to.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eae8e8\" data-has-transparency=\"true\" style=\"--dominant-color: #eae8e8;\" loading=\"lazy\" decoding=\"async\" width=\"1230\" height=\"320\" class=\"wp-image-597169 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11_-vnr8Ax_TqknfAYA2A-w.png?resize=1230%2C320&#038;ssl=1\" alt=\"Imagine the signal from the eye directly feeds into three neurons, and two decide to fire.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11_-vnr8Ax_TqknfAYA2A-w.png 1230w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11_-vnr8Ax_TqknfAYA2A-w-300x78.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11_-vnr8Ax_TqknfAYA2A-w-1024x266.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11_-vnr8Ax_TqknfAYA2A-w-768x200.png 768w\" sizes=\"auto, (max-width: 1230px) 100vw, 1230px\"><figcaption class=\"wp-element-caption\">Imagine the signal from the eye directly feeds into three neurons, and two decide to fire.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Then the neurons which those Neurons are connected to may or may not choose to fire.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e9dfdf\" data-has-transparency=\"true\" style=\"--dominant-color: #e9dfdf;\" loading=\"lazy\" decoding=\"async\" width=\"1230\" height=\"454\" class=\"wp-image-597170 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1fIUUxfy2BMODCRmi1D3SBw.png?resize=1230%2C454&#038;ssl=1\" alt=\"Neurons receive stimulus from previous neurons and then choose whether or not to fire based on the magnitude of the stimulus.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1fIUUxfy2BMODCRmi1D3SBw.png 1230w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1fIUUxfy2BMODCRmi1D3SBw-300x111.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1fIUUxfy2BMODCRmi1D3SBw-1024x378.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1fIUUxfy2BMODCRmi1D3SBw-768x283.png 768w\" sizes=\"auto, (max-width: 1230px) 100vw, 1230px\"><figcaption class=\"wp-element-caption\">Neurons receive stimulus from previous neurons and then choose whether or not to fire based on the magnitude of the stimulus.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Thus, a &#8220;thought&#8221; can be conceptualized as a large number of neurons choosing to, or not to fire based on the stimulus from other neurons.<\/p>\n<p class=\"wp-block-paragraph\">As one navigates throughout the world, one might have certain thoughts more than another person. A cellist might use some neurons more than a mathematician, for instance.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eee5e1\" data-has-transparency=\"true\" style=\"--dominant-color: #eee5e1;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"676\" class=\"wp-image-597171 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v_6qjOhyH3DT0rEj95_zRg.png?resize=1442%2C676&#038;ssl=1\" alt=\"Different tasks require the use of different neurons. Images generated with Midjourney\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v_6qjOhyH3DT0rEj95_zRg.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v_6qjOhyH3DT0rEj95_zRg-300x141.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v_6qjOhyH3DT0rEj95_zRg-1024x480.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v_6qjOhyH3DT0rEj95_zRg-768x360.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">Different tasks require the use of different neurons. Images generated with Midjourney<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">When we use certain neurons more frequently, their connections become stronger, increasing the intensity of those connections. When we don\u2019t use certain neurons, those connections weaken. This general rule has inspired the phrase &#8220;Neurons that fire together, wire together&#8221;, and is the high-level quality of the brain which is responsible for the learning process.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"ede2dd\" data-has-transparency=\"true\" style=\"--dominant-color: #ede2dd;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"736\" class=\"wp-image-597172 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13opdx0rZVIavS3hd-FtJbw.png?resize=1442%2C736&#038;ssl=1\" alt=\"The process of using certain neurons strengthens their connections.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13opdx0rZVIavS3hd-FtJbw.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13opdx0rZVIavS3hd-FtJbw-300x153.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13opdx0rZVIavS3hd-FtJbw-1024x523.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13opdx0rZVIavS3hd-FtJbw-768x392.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">The process of using certain neurons strengthens their connections.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I\u2019m not a neurologist, so of course this is a tremendously simplified description of the brain. However, it\u2019s enough to understand the fundamental idea of a neural network.<\/p>\n<h2 class=\"wp-block-heading\">The Intuition of Neural Networks<\/h2>\n<p class=\"wp-block-paragraph\">Neural networks are, essentially, a mathematically convenient and simplified version of neurons within the brain. A neural network is made up of elements called &#8220;perceptrons&#8221;, which are directly inspired by neurons.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f8f7f8\" data-has-transparency=\"true\" style=\"--dominant-color: #f8f7f8;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"430\" class=\"wp-image-597173 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Vp3uVTyAcixfAlwcZWTTbQ.png?resize=1442%2C430&#038;ssl=1\" alt=\"A perceptron, on the left, vs a neuron, on the right. [source](https:\/\/en.wikipedia.org\/wiki\/Neuron#\/media\/File:Blausen_0657_MultipolarNeuron.png) 1, source 2\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Vp3uVTyAcixfAlwcZWTTbQ.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Vp3uVTyAcixfAlwcZWTTbQ-300x89.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Vp3uVTyAcixfAlwcZWTTbQ-1024x305.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Vp3uVTyAcixfAlwcZWTTbQ-768x229.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">A perceptron, on the left, vs a neuron, on the right. <a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:ArtificialNeuronModel_english.png\">[source](https:\/\/en.wikipedia.org\/wiki\/Neuron#\/media\/File:Blausen_0657_MultipolarNeuron.png) 1<\/a>, source 2<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Perceptrons take in data, like a neuron does,<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f4f4\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f4f4;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"324\" class=\"wp-image-597174 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Ixk7vi9afHRoRx5dG_E37A.png?resize=1442%2C324&#038;ssl=1\" alt=\"Perceptrons in AI work with numbers, while Neurons within the brain work with electrochemical signals.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Ixk7vi9afHRoRx5dG_E37A.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Ixk7vi9afHRoRx5dG_E37A-300x67.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Ixk7vi9afHRoRx5dG_E37A-1024x230.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Ixk7vi9afHRoRx5dG_E37A-768x173.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">Perceptrons in AI work with numbers, while Neurons within the brain work with electrochemical signals.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">aggregate that data, like a neuron does,<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f2f2f2\" data-has-transparency=\"true\" style=\"--dominant-color: #f2f2f2;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"324\" class=\"wp-image-597175 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1M4Hg_rE7WmLFqpMjfg5P3Q.png?resize=1442%2C324&#038;ssl=1\" alt=\"Perceptrons aggregate numbers to come up with an output, while neurons aggregate electrochemical signals to come up with an output.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1M4Hg_rE7WmLFqpMjfg5P3Q.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1M4Hg_rE7WmLFqpMjfg5P3Q-300x67.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1M4Hg_rE7WmLFqpMjfg5P3Q-1024x230.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1M4Hg_rE7WmLFqpMjfg5P3Q-768x173.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">Perceptrons aggregate numbers to come up with an output, while neurons aggregate electrochemical signals to come up with an output.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">then output a signal based on the input, like a neuron does.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f0f0f0\" data-has-transparency=\"true\" style=\"--dominant-color: #f0f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"324\" class=\"wp-image-597176 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1DHnCzMRpcBASTtxFMeAuFg.png?resize=1442%2C324&#038;ssl=1\" alt=\"Perceptrons output numbers, while neurons output electrochemical signals.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1DHnCzMRpcBASTtxFMeAuFg.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1DHnCzMRpcBASTtxFMeAuFg-300x67.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1DHnCzMRpcBASTtxFMeAuFg-1024x230.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1DHnCzMRpcBASTtxFMeAuFg-768x173.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">Perceptrons output numbers, while neurons output electrochemical signals.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">A neural network can be conceptualized as a big network of these perceptrons, just like the brain is a big network of neurons.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"cdc5b3\" data-has-transparency=\"true\" style=\"--dominant-color: #cdc5b3;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"574\" class=\"wp-image-597177 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1-7K5ccbzd9_UO6UIRzxbOA.png?resize=1442%2C574&#038;ssl=1\" alt=\"A neural network (left) vs the brain (right). src1 src2\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1-7K5ccbzd9_UO6UIRzxbOA.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1-7K5ccbzd9_UO6UIRzxbOA-300x119.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1-7K5ccbzd9_UO6UIRzxbOA-1024x408.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1-7K5ccbzd9_UO6UIRzxbOA-768x306.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">A neural network (left) vs the brain (right). <a href=\"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/e\/e4\/Artificial_neural_network.svg\/560px-Artificial_neural_network.svg.png\">src1<\/a> <a href=\"https:\/\/www.google.com\/url?sa=i&amp;url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FNeuron&amp;psig=AOvVaw16zt0cwZKdlPi6mQqgRQLT&amp;ust=1736975287779000&amp;source=images&amp;cd=vfe&amp;opi=89978449&amp;ved=0CBQQjRxqFwoTCNi7sKyP9ooDFQAAAAAdAAAAABAE\">src2<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">When a neuron in the brain fires, it does so as a binary decision. Or, in other words, neurons either fire or they don\u2019t. Perceptrons, on the other hand, don\u2019t &#8220;fire&#8221; per-se, but output a range of numbers based on the perceptrons input.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f1f1f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f1f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"302\" class=\"wp-image-597178 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11snQ4GLqxc-6tHjXaF6TtA.png?resize=1442%2C302&#038;ssl=1\" alt=\"Perceptrons output a continuous range of numbers, while Neurons either fire or they don't.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11snQ4GLqxc-6tHjXaF6TtA.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11snQ4GLqxc-6tHjXaF6TtA-300x63.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11snQ4GLqxc-6tHjXaF6TtA-1024x214.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11snQ4GLqxc-6tHjXaF6TtA-768x161.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">Perceptrons output a continuous range of numbers, while Neurons either fire or they don\u2019t.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Neurons within the brain can get away with their relatively simple binary inputs and outputs because thoughts exist over time. Neurons essentially <a href=\"https:\/\/www.youtube.com\/watch?v=Nxa19uWC_oA\">pulse at different rates<\/a>, with slower and faster pulses communicating different information.<\/p>\n<p class=\"wp-block-paragraph\">So, neurons have simple inputs and outputs in the form of on or off pulses, but the rate at which they pulse can communicate complex information. Perceptrons only see an input once per pass through the network, but their input and output can be a continuous range of values. If you\u2019re familiar with electronics, you might reflect on how this is similar to the relationship between digital and analogue signals.<\/p>\n<p class=\"wp-block-paragraph\">The way the math for a perceptron actually shakes out is pretty simple. A standard neural network consists of a bunch of weights connecting the perceptron\u2019s of different layers together.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f3f3f3\" data-has-transparency=\"true\" style=\"--dominant-color: #f3f3f3;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"570\" class=\"wp-image-597179 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1cLcbZkQ4KCmyqsoIsjEQCg.png?resize=1442%2C570&#038;ssl=1\" alt=\"A neural network, with the weights leading into and out of a particular perceptron highlighted.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1cLcbZkQ4KCmyqsoIsjEQCg.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1cLcbZkQ4KCmyqsoIsjEQCg-300x119.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1cLcbZkQ4KCmyqsoIsjEQCg-1024x405.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1cLcbZkQ4KCmyqsoIsjEQCg-768x304.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">A neural network, with the weights leading into and out of a particular perceptron highlighted.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">You can calculate the value of a particular perceptron by adding up all the inputs, multiplied by their respective weights.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f1f1f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f1f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"590\" class=\"wp-image-597180 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Sd0t4WYtLZmhvF6w6VcQ6A.png?resize=1442%2C590&#038;ssl=1\" alt=\"An example of how the value of a perceptron might be calculated. (0.3\u00d70.3) + (0.7\u00d70.1) +(-0.5\u00d70.5)=-0.09\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Sd0t4WYtLZmhvF6w6VcQ6A.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Sd0t4WYtLZmhvF6w6VcQ6A-300x123.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Sd0t4WYtLZmhvF6w6VcQ6A-1024x419.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Sd0t4WYtLZmhvF6w6VcQ6A-768x314.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">An example of how the value of a perceptron might be calculated. (0.3\u00d70.3) + (0.7\u00d70.1) +(-0.5\u00d70.5)=-0.09<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Many Neural Networks also have a &#8220;bias&#8221; associated with each perceptron, which is added to the sum of the inputs to calculate the perceptron\u2019s value.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f1f1f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f1f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"1692\" height=\"712\" class=\"wp-image-597181 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1gTOvMB96VO5qDn-VI7Me_w.png?resize=1692%2C712&#038;ssl=1\" alt=\"An example of how the value of a perceptron might be calculated when a bias term is included in the model. (0.3\u00d70.3) + (0.7\u00d70.1) +(-0.5\u00d70.5) + 0.01 =-0.08\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1gTOvMB96VO5qDn-VI7Me_w.png 1692w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1gTOvMB96VO5qDn-VI7Me_w-300x126.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1gTOvMB96VO5qDn-VI7Me_w-1024x431.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1gTOvMB96VO5qDn-VI7Me_w-768x323.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1gTOvMB96VO5qDn-VI7Me_w-1536x646.png 1536w\" sizes=\"auto, (max-width: 1692px) 100vw, 1692px\"><figcaption class=\"wp-element-caption\">An example of how the value of a perceptron might be calculated when a bias term is included in the model. (0.3\u00d70.3) + (0.7\u00d70.1) +(-0.5\u00d70.5) + 0.01 =-0.08<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Calculating the output of a neural network, then, is just doing a bunch of addition and multiplication to calculate the value of all the perceptrons.<\/p>\n<p class=\"wp-block-paragraph\">Sometimes data scientists refer to this general operation as a &#8220;linear projection&#8221;, because we\u2019re mapping an input into an output via linear operations (addition and multiplication). One problem with this approach is, even if you daisy chain a billion of these layers together, the resulting model will still just be a linear relationship between the input and output because it\u2019s all just addition and multiplication.<\/p>\n<p class=\"wp-block-paragraph\">This is a serious problem because not all relationships between an input and output are linear. To get around this, data scientists employ something called an &#8220;activation function&#8221;. These are non-linear functions which can be injected throughout the model to, essentially, sprinkle in some non-linearity.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f8f8f8\" data-has-transparency=\"false\" style=\"--dominant-color: #f8f8f8;\" loading=\"lazy\" decoding=\"async\" width=\"1442\" height=\"740\" class=\"wp-image-597182 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1q-oox60ZsuYVYbZkp2SpEg.png?resize=1442%2C740&#038;ssl=1\" alt=\"Examples of a variety of functions which, given some input, produce some output. The top three are linear, while the bottom three are non-linear.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1q-oox60ZsuYVYbZkp2SpEg.png 1442w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1q-oox60ZsuYVYbZkp2SpEg-300x154.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1q-oox60ZsuYVYbZkp2SpEg-1024x525.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1q-oox60ZsuYVYbZkp2SpEg-768x394.png 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\"><figcaption class=\"wp-element-caption\">Examples of a variety of functions which, given some input, produce some output. The top three are linear, while the bottom three are non-linear.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">by interweaving non-linear activation functions between linear projections, neural networks are capable of learning very complex functions,<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f3f2f2\" data-has-transparency=\"true\" style=\"--dominant-color: #f3f2f2;\" loading=\"lazy\" decoding=\"async\" width=\"1272\" height=\"550\" class=\"wp-image-597183 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1m_PdmmIOUeKQZDM59eehNw.png?resize=1272%2C550&#038;ssl=1\" alt=\"By placing non-linear activation functions within a neural network, neural networks are capable of modeling complex relationships.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1m_PdmmIOUeKQZDM59eehNw.png 1272w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1m_PdmmIOUeKQZDM59eehNw-300x130.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1m_PdmmIOUeKQZDM59eehNw-1024x443.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1m_PdmmIOUeKQZDM59eehNw-768x332.png 768w\" sizes=\"auto, (max-width: 1272px) 100vw, 1272px\"><figcaption class=\"wp-element-caption\">By placing non-linear activation functions within a neural network, neural networks are capable of modeling complex relationships.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In AI there are many popular activation functions, but the industry has largely converged on three popular ones: ReLU, Sigmoid, and Softmax, which are used in a variety of different applications. Out of all of them, ReLU is the most common due to its simplicity and ability to generalize to mimic almost any other function.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fcfcfc\" data-has-transparency=\"false\" style=\"--dominant-color: #fcfcfc;\" loading=\"lazy\" decoding=\"async\" width=\"1332\" height=\"750\" class=\"wp-image-597184 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ.png?resize=1332%2C750&#038;ssl=1\" alt=\"The ReLU activation function, where the output is equal to zero if the input is less than zero, and the output is equal to the input if the input is greater than zero.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ.png 1332w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ-1024x577.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ-768x432.png 768w\" sizes=\"auto, (max-width: 1332px) 100vw, 1332px\"><figcaption class=\"wp-element-caption\">The ReLU activation function, where the output is equal to zero if the input is less than zero, and the output is equal to the input if the input is greater than zero.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">So, that\u2019s the essence of how AI models make predictions. It\u2019s a bunch of addition and multiplication with some nonlinear functions sprinkled in between.<\/p>\n<p class=\"wp-block-paragraph\">Another defining characteristic of neural networks is that they can be trained to be better at solving a certain problem, which we\u2019ll explore in the next section.<\/p>\n<h2 class=\"wp-block-heading\">Back Propagation<\/h2>\n<p class=\"wp-block-paragraph\">One of the fundamental ideas of AI is that you can &#8220;train&#8221; a model. This is done by asking a neural network (which starts its life as a big pile of random data) to do some task. Then, you somehow update the model based on how the model\u2019s output compares to a known good answer.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eeeeee\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeeee;\" loading=\"lazy\" decoding=\"async\" width=\"1048\" height=\"480\" class=\"wp-image-597185 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1JYyCqZhkWAvTaZ7fPhnk9A.png?resize=1048%2C480&#038;ssl=1\" alt=\"The fundamental idea of training a neural network. You give it some data where you know what you want the output to be, compare the neural networks output with your desired result, then use how wrong the neural network was to update the parameters so it's less wrong.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1JYyCqZhkWAvTaZ7fPhnk9A.png 1048w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1JYyCqZhkWAvTaZ7fPhnk9A-300x137.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1JYyCqZhkWAvTaZ7fPhnk9A-1024x469.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1JYyCqZhkWAvTaZ7fPhnk9A-768x352.png 768w\" sizes=\"auto, (max-width: 1048px) 100vw, 1048px\"><figcaption class=\"wp-element-caption\">The fundamental idea of training a neural network. You give it some data where you know what you want the output to be, compare the neural networks output with your desired result, then use how wrong the neural network was to update the parameters so it\u2019s less wrong.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">For this section, let\u2019s imagine a neural network with an input layer, a hidden layer, and an output layer.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f1f1f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f1f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"1048\" height=\"464\" class=\"wp-image-597186 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v2Fwu8wCaSoKiTBBpZknuQ.png?resize=1048%2C464&#038;ssl=1\" alt=\"A neural network with two inputs and a single output, with a hidden layer in-between allowing the model to make more complex predictions.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v2Fwu8wCaSoKiTBBpZknuQ.png 1048w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v2Fwu8wCaSoKiTBBpZknuQ-300x133.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v2Fwu8wCaSoKiTBBpZknuQ-1024x453.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1v2Fwu8wCaSoKiTBBpZknuQ-768x340.png 768w\" sizes=\"auto, (max-width: 1048px) 100vw, 1048px\"><figcaption class=\"wp-element-caption\">A neural network with two inputs and a single output, with a hidden layer in-between allowing the model to make more complex predictions.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Each of these layers are connected together with, initially, completely random weights.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f1f2\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f1f2;\" loading=\"lazy\" decoding=\"async\" width=\"1458\" height=\"740\" class=\"wp-image-597187 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1UD0Z1G4QBuVvud3Sf8gOWg.png?resize=1458%2C740&#038;ssl=1\" alt=\"The neural network, with random weights and biases defined.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1UD0Z1G4QBuVvud3Sf8gOWg.png 1458w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1UD0Z1G4QBuVvud3Sf8gOWg-300x152.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1UD0Z1G4QBuVvud3Sf8gOWg-1024x520.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1UD0Z1G4QBuVvud3Sf8gOWg-768x390.png 768w\" sizes=\"auto, (max-width: 1458px) 100vw, 1458px\"><figcaption class=\"wp-element-caption\">The neural network, with random weights and biases defined.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">And we\u2019ll use a ReLU activation function on our hidden layer.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f1f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"1640\" height=\"796\" class=\"wp-image-597188 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1pLo_h8SZrJbAjd42Qeyv_A.png?resize=1640%2C796&#038;ssl=1\" alt=\"We'll apply the ReLU activation function to the value of our hidden perceptrons.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1pLo_h8SZrJbAjd42Qeyv_A.png 1640w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1pLo_h8SZrJbAjd42Qeyv_A-300x146.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1pLo_h8SZrJbAjd42Qeyv_A-1024x497.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1pLo_h8SZrJbAjd42Qeyv_A-768x373.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1pLo_h8SZrJbAjd42Qeyv_A-1536x746.png 1536w\" sizes=\"auto, (max-width: 1640px) 100vw, 1640px\"><figcaption class=\"wp-element-caption\">We\u2019ll apply the ReLU activation function to the value of our hidden perceptrons.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s say we have some training data, in which the desired output is the average value of the input.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f7f7f7\" data-has-transparency=\"true\" style=\"--dominant-color: #f7f7f7;\" loading=\"lazy\" decoding=\"async\" width=\"1182\" height=\"368\" class=\"wp-image-597189 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NI1jil4RRhvW8m9yPrWTnQ.png?resize=1182%2C368&#038;ssl=1\" alt=\"An example of the data that we'll be training off of.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NI1jil4RRhvW8m9yPrWTnQ.png 1182w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NI1jil4RRhvW8m9yPrWTnQ-300x93.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NI1jil4RRhvW8m9yPrWTnQ-1024x319.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NI1jil4RRhvW8m9yPrWTnQ-768x239.png 768w\" sizes=\"auto, (max-width: 1182px) 100vw, 1182px\"><figcaption class=\"wp-element-caption\">An example of the data that we\u2019ll be training off of.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">And we pass an example of our training data through the model, generating a prediction.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f2f2\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f2f2;\" loading=\"lazy\" decoding=\"async\" width=\"1538\" height=\"690\" class=\"wp-image-597190 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15fKc61r8UC33OQqn5v0Gnw.png?resize=1538%2C690&#038;ssl=1\" alt=\"Calculating the value of the hidden layer and output based on the input, including all major intermediary steps.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15fKc61r8UC33OQqn5v0Gnw.png 1538w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15fKc61r8UC33OQqn5v0Gnw-300x135.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15fKc61r8UC33OQqn5v0Gnw-1024x459.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15fKc61r8UC33OQqn5v0Gnw-768x345.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15fKc61r8UC33OQqn5v0Gnw-1536x689.png 1536w\" sizes=\"auto, (max-width: 1538px) 100vw, 1538px\"><figcaption class=\"wp-element-caption\">Calculating the value of the hidden layer and output based on the input, including all major intermediary steps.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">To make our neural network better at the task of calculating the average of the input, we first compare the predicted output to what our desired output is.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f2f0f0\" data-has-transparency=\"true\" style=\"--dominant-color: #f2f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"1538\" height=\"690\" class=\"wp-image-597191 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NNXP11WMJkbRzxtcEI7eZQ.png?resize=1538%2C690&#038;ssl=1\" alt=\"The training data has an input of 0.1 and 0.3, and the desired output (the average of the input) is 0.2. The prediction from the model was -0.1. Thus, the difference between the output and the desired output is 0.3.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NNXP11WMJkbRzxtcEI7eZQ.png 1538w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NNXP11WMJkbRzxtcEI7eZQ-300x135.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NNXP11WMJkbRzxtcEI7eZQ-1024x459.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NNXP11WMJkbRzxtcEI7eZQ-768x345.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NNXP11WMJkbRzxtcEI7eZQ-1536x689.png 1536w\" sizes=\"auto, (max-width: 1538px) 100vw, 1538px\"><figcaption class=\"wp-element-caption\">The training data has an input of 0.1 and 0.3, and the desired output (the average of the input) is 0.2. The prediction from the model was -0.1. Thus, the difference between the output and the desired output is 0.3.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now that we know that the output should increase in size, we can look back through the model to calculate how our weights and biases might change to promote that change.<\/p>\n<p class=\"wp-block-paragraph\">First, let\u2019s look at the weights leading immediately into the output: w\u2087, w\u2088, w\u2089. Because the output of the third hidden perceptron was -0.46, the activation from ReLU was 0.00.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eeeee9\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeee9;\" loading=\"lazy\" decoding=\"async\" width=\"1498\" height=\"476\" class=\"wp-image-597192 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1dIbrD12PNayTGNhOD75nmg.png?resize=1498%2C476&#038;ssl=1\" alt=\"The ultimate, activated output of the third perceptron, 0.00\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1dIbrD12PNayTGNhOD75nmg.png 1498w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1dIbrD12PNayTGNhOD75nmg-300x95.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1dIbrD12PNayTGNhOD75nmg-1024x325.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1dIbrD12PNayTGNhOD75nmg-768x244.png 768w\" sizes=\"auto, (max-width: 1498px) 100vw, 1498px\"><figcaption class=\"wp-element-caption\">The ultimate, activated output of the third perceptron, 0.00<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As a result, there\u2019s no change to w\u2089 that could result us getting closer to our desired output, because every value of w\u2089 would result in a change of zero in this particular example.<\/p>\n<p class=\"wp-block-paragraph\">The second hidden neuron, however, does have an activated output which is greater than zero, and thus adjusting w\u2088 will have an impact on the output for this example.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eeeee8\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeee8;\" loading=\"lazy\" decoding=\"async\" width=\"1498\" height=\"476\" class=\"wp-image-597193 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13Whf1AhkWz6clG06cztpCw.png?resize=1498%2C476&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13Whf1AhkWz6clG06cztpCw.png 1498w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13Whf1AhkWz6clG06cztpCw-300x95.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13Whf1AhkWz6clG06cztpCw-1024x325.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/13Whf1AhkWz6clG06cztpCw-768x244.png 768w\" sizes=\"auto, (max-width: 1498px) 100vw, 1498px\"><\/figure>\n<p class=\"wp-block-paragraph\">The way we actually calculate how much w\u2088 should change is by multiplying how much the output should change, times the input to w\u2088.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"faf9f9\" data-has-transparency=\"true\" style=\"--dominant-color: #faf9f9;\" loading=\"lazy\" decoding=\"async\" width=\"1622\" height=\"800\" class=\"wp-image-597194 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1aT1An9_PHK2UdYPg4P396g.png?resize=1622%2C800&#038;ssl=1\" alt='How we calculate how the weight should change. Here the symbol \u0394(delta) means \"change in\", so \u0394w\u2088 means the \"change in w\u2088\"' srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1aT1An9_PHK2UdYPg4P396g.png 1622w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1aT1An9_PHK2UdYPg4P396g-300x148.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1aT1An9_PHK2UdYPg4P396g-1024x505.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1aT1An9_PHK2UdYPg4P396g-768x379.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1aT1An9_PHK2UdYPg4P396g-1536x758.png 1536w\" sizes=\"auto, (max-width: 1622px) 100vw, 1622px\"><figcaption class=\"wp-element-caption\">How we calculate how the weight should change. Here the symbol \u0394(delta) means &#8220;change in&#8221;, so \u0394w\u2088 means the &#8220;change in w\u2088&#8221;<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The easiest explanation of why we do it this way is &#8220;because calculus&#8221;, but if we look at how all weights get updated in the last layer, we can form a fun intuition.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"faf9f8\" data-has-transparency=\"true\" style=\"--dominant-color: #faf9f8;\" loading=\"lazy\" decoding=\"async\" width=\"1622\" height=\"794\" class=\"wp-image-597195 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kmHCi-J7Fg3IV6aEynIagQ.png?resize=1622%2C794&#038;ssl=1\" alt=\"Calculating how the weights leading into the output should change.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kmHCi-J7Fg3IV6aEynIagQ.png 1622w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kmHCi-J7Fg3IV6aEynIagQ-300x147.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kmHCi-J7Fg3IV6aEynIagQ-1024x501.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kmHCi-J7Fg3IV6aEynIagQ-768x376.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1kmHCi-J7Fg3IV6aEynIagQ-1536x752.png 1536w\" sizes=\"auto, (max-width: 1622px) 100vw, 1622px\"><figcaption class=\"wp-element-caption\">Calculating how the weights leading into the output should change.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Notice how the two perceptrons that &#8220;fire&#8221; (have an output greater than zero) are updated together. Also, notice how the stronger a perceptrons output is, the more its corresponding weight is updated. This is somewhat similar to the idea that &#8220;Neurons that fire together, wire together&#8221; within the human brain.<\/p>\n<p class=\"wp-block-paragraph\">Calculating the change to the output bias is super easy. In fact, we\u2019ve already done it. Because the bias is how much a perceptrons output should change, the change in the bias is just the changed in the desired output. So, \u0394b\u2084=0.3<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fafafa\" data-has-transparency=\"true\" style=\"--dominant-color: #fafafa;\" loading=\"lazy\" decoding=\"async\" width=\"1588\" height=\"494\" class=\"wp-image-597196 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1yU1vGFGDcYF_bsYpi1EViQ.png?resize=1588%2C494&#038;ssl=1\" alt=\"how the bias of the output should be updated.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1yU1vGFGDcYF_bsYpi1EViQ.png 1588w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1yU1vGFGDcYF_bsYpi1EViQ-300x93.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1yU1vGFGDcYF_bsYpi1EViQ-1024x319.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1yU1vGFGDcYF_bsYpi1EViQ-768x239.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1yU1vGFGDcYF_bsYpi1EViQ-1536x478.png 1536w\" sizes=\"auto, (max-width: 1588px) 100vw, 1588px\"><figcaption class=\"wp-element-caption\">how the bias of the output should be updated.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now that we\u2019ve calculated how the weights and bias of the output perceptron should change, we can &#8220;back propagate&#8221; our desired change in output through the model. Let\u2019s start with back propagating so we can calculate how we should update w\u2081.<\/p>\n<p class=\"wp-block-paragraph\">First, we calculate how the activated output of the of the first hidden neuron should change. We do that by multiplying the change in output by w\u2087.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fbfafa\" data-has-transparency=\"true\" style=\"--dominant-color: #fbfafa;\" loading=\"lazy\" decoding=\"async\" width=\"1608\" height=\"690\" class=\"wp-image-597197 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15bZbQeeglMZ8eNSAMBADJg.png?resize=1608%2C690&#038;ssl=1\" alt=\"Calculating how the activated output of the first hidden neuron should have changed by multiplying the desired change in the output by w\u2087.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15bZbQeeglMZ8eNSAMBADJg.png 1608w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15bZbQeeglMZ8eNSAMBADJg-300x129.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15bZbQeeglMZ8eNSAMBADJg-1024x439.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15bZbQeeglMZ8eNSAMBADJg-768x330.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15bZbQeeglMZ8eNSAMBADJg-1536x659.png 1536w\" sizes=\"auto, (max-width: 1608px) 100vw, 1608px\"><figcaption class=\"wp-element-caption\">Calculating how the activated output of the first hidden neuron should have changed by multiplying the desired change in the output by w\u2087.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">For values that are greater than zero, ReLU simply multiplies those values by 1. So, for this example, the change we want the un-activated value of the first hidden neuron is equal to the desired change in the activated output<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fafafa\" data-has-transparency=\"true\" style=\"--dominant-color: #fafafa;\" loading=\"lazy\" decoding=\"async\" width=\"1608\" height=\"690\" class=\"wp-image-597198 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1MG5zTdstAdlxDYriLA1dMw.png?resize=1608%2C690&#038;ssl=1\" alt=\"How much we want to change the un-activated value of the first hidden perceptron, based on back-propagating from the output.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1MG5zTdstAdlxDYriLA1dMw.png 1608w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1MG5zTdstAdlxDYriLA1dMw-300x129.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1MG5zTdstAdlxDYriLA1dMw-1024x439.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1MG5zTdstAdlxDYriLA1dMw-768x330.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1MG5zTdstAdlxDYriLA1dMw-1536x659.png 1536w\" sizes=\"auto, (max-width: 1608px) 100vw, 1608px\"><figcaption class=\"wp-element-caption\">How much we want to change the un-activated value of the first hidden perceptron, based on back-propagating from the output.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Recall that we calculated how to update w\u2087 based on multiplying it\u2019s input by the change in its desired output. We can do the same thing to calculate the change in w\u2081.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f9f9f8\" data-has-transparency=\"true\" style=\"--dominant-color: #f9f9f8;\" loading=\"lazy\" decoding=\"async\" width=\"1794\" height=\"898\" class=\"wp-image-597199 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1rb2kMoJK5HkgeBIvWeXYZA.png?resize=1794%2C898&#038;ssl=1\" alt=\"Now that we've calculated how the first hidden neuron should change, we can calculate how we should update w\u2081 the same way we calculated how w\u2087 should be updated previously.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1rb2kMoJK5HkgeBIvWeXYZA.png 1794w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1rb2kMoJK5HkgeBIvWeXYZA-300x150.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1rb2kMoJK5HkgeBIvWeXYZA-1024x513.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1rb2kMoJK5HkgeBIvWeXYZA-768x384.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1rb2kMoJK5HkgeBIvWeXYZA-1536x769.png 1536w\" sizes=\"auto, (max-width: 1794px) 100vw, 1794px\"><figcaption class=\"wp-element-caption\">Now that we\u2019ve calculated how the first hidden neuron should change, we can calculate how we should update w\u2081 the same way we calculated how w\u2087 should be updated previously.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">It\u2019s important to note, we\u2019re not actually updating any of the weights or biases throughout this process. Rather, we\u2019re taking a tally of how we should update each parameter, assuming no other parameters are updated.<\/p>\n<p class=\"wp-block-paragraph\">So, we can do those calculations to calculate all parameter changes.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f5f2f3\" data-has-transparency=\"true\" style=\"--dominant-color: #f5f2f3;\" loading=\"lazy\" decoding=\"async\" width=\"1758\" height=\"782\" class=\"wp-image-597200 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15zQm78JT5IjAvQSI0YdEdw.png?resize=1758%2C782&#038;ssl=1\" alt=\"By back propagating through the model, using a combination of values from the forward passes and desired changes from the backward pass at various points of the model, we can calculate how all parameters should change\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15zQm78JT5IjAvQSI0YdEdw.png 1758w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15zQm78JT5IjAvQSI0YdEdw-300x133.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15zQm78JT5IjAvQSI0YdEdw-1024x455.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15zQm78JT5IjAvQSI0YdEdw-768x342.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/15zQm78JT5IjAvQSI0YdEdw-1536x683.png 1536w\" sizes=\"auto, (max-width: 1758px) 100vw, 1758px\"><figcaption class=\"wp-element-caption\">By back propagating through the model, using a combination of values from the forward passes and desired changes from the backward pass at various points of the model, we can calculate how all parameters should change<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">A fundamental idea of back propagation is called &#8220;Learning Rate&#8221;, which concerns the size of the changes we make to neural networks based on a particular batch of data. To explain why this is important, I\u2019d like to use an analogy.<\/p>\n<p class=\"wp-block-paragraph\">Imagine you went outside one day, and everyone wearing a hat gave you a funny look. You probably don\u2019t want to jump to the conclusion that <code>wearing hat = funny look<\/code> , but you might be a bit skeptical of people wearing hats. After three, four, five days, a month, or even a year, if it seems like the vast majority of people wearing hats are giving you a funny look, you may start considering that a strong trend.<\/p>\n<p class=\"wp-block-paragraph\">Similarly, when we train a neural network, we don\u2019t want to completely change how the neural network thinks based on a single training example. Rather, we want each batch to only incrementally change how the model thinks. As we expose the model to many examples, we would hope that the model would learn important trends within the data.<\/p>\n<p class=\"wp-block-paragraph\">After we\u2019ve calculated how each parameter should change as if it were the only parameter being updated, we can multiply all those changes by a small number, like <code>0.001<\/code> , before applying those changes to the parameters. This small number is commonly referred to as the &#8220;learning rate&#8221;, and the exact value it should have is dependent on the model we\u2019re training on. This effectively scales down our adjustments before applying them to the model.<\/p>\n<p class=\"wp-block-paragraph\">At this point we covered pretty much everything one would need to know to implement a neural network. Let\u2019s give it a shot!<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"d3a064\" data-has-transparency=\"false\" style=\"--dominant-color: #d3a064;\" loading=\"lazy\" decoding=\"async\" width=\"1800\" height=\"649\" class=\"wp-image-597201 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0SOOewKyrD86bJlSQ.jpeg?resize=1800%2C649&#038;ssl=1\" alt=\"Join IAEE\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0SOOewKyrD86bJlSQ.jpeg 1800w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0SOOewKyrD86bJlSQ-300x108.jpeg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0SOOewKyrD86bJlSQ-1024x369.jpeg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0SOOewKyrD86bJlSQ-768x277.jpeg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0SOOewKyrD86bJlSQ-1536x554.jpeg 1536w\" sizes=\"auto, (max-width: 1800px) 100vw, 1800px\"><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/iaee.substack.com\/\">Join IAEE<\/a><\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Implementing a Neural Network from Scratch<\/h2>\n<p class=\"wp-block-paragraph\">Typically, a data scientist would just use a library like <code>PyTorch<\/code> to implement a neural network in a few lines of code, but we\u2019ll be defining a neural network from the ground up using NumPy, a numerical computing library.<\/p>\n<p class=\"wp-block-paragraph\">First, let\u2019s start with a way to define the structure of the neural network.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">\"\"\"Blocking out the structure of the Neural Network\n\"\"\"\n\nimport numpy as np\n\nclass SimpleNN:\n    def __init__(self, architecture):\n        self.architecture = architecture\n        self.weights = []\n        self.biases = []\n\n        # Initialize weights and biases\n        np.random.seed(99)\n        for i in range(len(architecture) - 1):\n            self.weights.append(np.random.uniform(\n                low=-1, high=1,\n                size=(architecture[i], architecture[i+1])\n            ))\n            self.biases.append(np.zeros((1, architecture[i+1])))\n\narchitecture = [2, 64, 64, 64, 1]  # Two inputs, two hidden layers, one output\nmodel = SimpleNN(architecture)\n\nprint('weight dimensions:')\nfor w in model.weights:\n    print(w.shape)\n\nprint('nbias dimensions:')\nfor b in model.biases:\n    print(b.shape)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"393939\" data-has-transparency=\"false\" style=\"--dominant-color: #393939;\" loading=\"lazy\" decoding=\"async\" width=\"2018\" height=\"404\" class=\"wp-image-597202 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Bn8-3lou0sBs8y2cf1sY8g.png?resize=2018%2C404&#038;ssl=1\" alt=\"The weight and bias matrix defined in a sample neural network.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Bn8-3lou0sBs8y2cf1sY8g.png 2018w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Bn8-3lou0sBs8y2cf1sY8g-300x60.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Bn8-3lou0sBs8y2cf1sY8g-1024x205.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Bn8-3lou0sBs8y2cf1sY8g-768x154.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1Bn8-3lou0sBs8y2cf1sY8g-1536x308.png 1536w\" sizes=\"auto, (max-width: 2018px) 100vw, 2018px\"><figcaption class=\"wp-element-caption\">The weight and bias matrix defined in a sample neural network.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">While we typically draw neural networks as a dense web in reality we represent the weights between their connections as matrices. This is convenient because matrix multiplication, then, is equivalent to passing data through a neural network.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eeeeef\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeeef;\" loading=\"lazy\" decoding=\"async\" width=\"1260\" height=\"370\" class=\"wp-image-597203 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/02kqALjWcLyiGL9YQ.png?resize=1260%2C370&#038;ssl=1\" alt=\"Thinking of a dense network as weighted connections on the left, and as matrix multiplication on the right. On the right hand side diagram, the vector on the left would be the input, the matrix in the center would be the weight matrix, and the vector on the right would be the output. Only a portion of values are included for readability. From my article on LoRA.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/02kqALjWcLyiGL9YQ.png 1260w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/02kqALjWcLyiGL9YQ-300x88.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/02kqALjWcLyiGL9YQ-1024x301.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/02kqALjWcLyiGL9YQ-768x226.png 768w\" sizes=\"auto, (max-width: 1260px) 100vw, 1260px\"><figcaption class=\"wp-element-caption\">Thinking of a dense network as weighted connections on the left, and as matrix multiplication on the right. On the right hand side diagram, the vector on the left would be the input, the matrix in the center would be the weight matrix, and the vector on the right would be the output. Only a portion of values are included for readability. From my article on <a href=\"https:\/\/medium.com\/towards-data-science\/lora-intuitively-and-exhaustively-explained-e944a6bff46b\">LoRA<\/a>.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We can make our model make a prediction based on some input by passing the input through each layer.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">\"\"\"Implementing the Forward Pass\n\"\"\"\n\nimport numpy as np\n\nclass SimpleNN:\n    def __init__(self, architecture):\n        self.architecture = architecture\n        self.weights = []\n        self.biases = []\n\n        # Initialize weights and biases\n        np.random.seed(99)\n        for i in range(len(architecture) - 1):\n            self.weights.append(np.random.uniform(\n                low=-1, high=1,\n                size=(architecture[i], architecture[i+1])\n            ))\n            self.biases.append(np.zeros((1, architecture[i+1])))\n\n    @staticmethod\n    def relu(x):\n        #implementing the relu activation function\n        return np.maximum(0, x)\n\n    def forward(self, X):\n        #iterating through all layers\n        for W, b in zip(self.weights, self.biases):\n\n            #applying the weight and bias of the layer\n            X = np.dot(X, W) + b\n\n            #doing ReLU for all but the last layer\n            if W is not self.weights[-1]:\n                X = self.relu(X)\n\n        #returning the result\n        return X\n\n    def predict(self, X):\n        y = self.forward(X)\n        return y.flatten()\n\n#defining a model\narchitecture = [2, 64, 64, 64, 1]  # Two inputs, two hidden layers, one output\nmodel = SimpleNN(architecture)\n\n# Generate predictions\nprediction = model.predict(np.array([0.1,0.2]))\nprint(prediction)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"3a3a3a\" data-has-transparency=\"false\" style=\"--dominant-color: #3a3a3a;\" loading=\"lazy\" decoding=\"async\" width=\"2018\" height=\"48\" class=\"wp-image-597204 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/12gnRkuco9SpHeEpKaynhrQ.png?resize=2018%2C48&#038;ssl=1\" alt=\"the result of passing our data through the model. Our model is randomly defined, so this isn't a useful prediction, but it confirms that the model is working.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/12gnRkuco9SpHeEpKaynhrQ.png 2018w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/12gnRkuco9SpHeEpKaynhrQ-300x7.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/12gnRkuco9SpHeEpKaynhrQ-1024x24.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/12gnRkuco9SpHeEpKaynhrQ-768x18.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/12gnRkuco9SpHeEpKaynhrQ-1536x37.png 1536w\" sizes=\"auto, (max-width: 2018px) 100vw, 2018px\"><figcaption class=\"wp-element-caption\">the result of passing our data through the model. Our model is randomly defined, so this isn\u2019t a useful prediction, but it confirms that the model is working.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We need to be able to train this model, and to do that we\u2019ll first need a problem to train the model on. I defined a random function that takes in two inputs and results in an output:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">\"\"\"Defining what we want the model to learn\n\"\"\"\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n# Define a random function with two inputs\ndef random_function(x, y):\n    return (np.sin(x) + x * np.cos(y) + y + 3**(x\/3))\n\n# Generate a grid of x and y values\nx = np.linspace(-10, 10, 100)\ny = np.linspace(-10, 10, 100)\nX, Y = np.meshgrid(x, y)\n\n# Compute the output of the random function\nZ = random_function(X, Y)\n\n# Create a 2D plot\nplt.figure(figsize=(8, 6))\ncontour = plt.contourf(X, Y, Z, cmap='viridis')\nplt.colorbar(contour, label='Function Value')\nplt.title('2D Plot of Objective Function')\nplt.xlabel('X-axis')\nplt.ylabel('Y-axis')\nplt.show()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"708da6\" data-has-transparency=\"true\" style=\"--dominant-color: #708da6;\" loading=\"lazy\" decoding=\"async\" width=\"1422\" height=\"1116\" class=\"wp-image-597205 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA.png?resize=1422%2C1116&#038;ssl=1\" alt=\"The modeling objective. Given two inputs (here plotted as x and y), the model needs to predict an output (here represented as color). This is a completely arbitrary function\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA.png 1422w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA-300x235.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA-1024x804.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA-768x603.png 768w\" sizes=\"auto, (max-width: 1422px) 100vw, 1422px\"><figcaption class=\"wp-element-caption\">The modeling objective. Given two inputs (here plotted as x and y), the model needs to predict an output (here represented as color). This is a completely arbitrary function<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In the real world we wouldn\u2019t know the underlying function. We can mimic that reality by creating a dataset consisting of random points:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\n# Define a random function with two inputs\ndef random_function(x, y):\n    return (np.sin(x) + x * np.cos(y) + y + 3**(x\/3))\n\n# Define the number of random samples to generate\nn_samples = 1000\n\n# Generate random X and Y values within a specified range\nx_min, x_max = -10, 10\ny_min, y_max = -10, 10\n\n# Generate random values for X and Y\nX_random = np.random.uniform(x_min, x_max, n_samples)\nY_random = np.random.uniform(y_min, y_max, n_samples)\n\n# Evaluate the random function at the generated X and Y values\nZ_random = random_function(X_random, Y_random)\n\n# Create a dataset\ndataset = pd.DataFrame({\n    'X': X_random,\n    'Y': Y_random,\n    'Z': Z_random\n})\n\n# Display the dataset\nprint(dataset.head())\n\n# Create a 2D scatter plot of the sampled data\nplt.figure(figsize=(8, 6))\nscatter = plt.scatter(dataset['X'], dataset['Y'], c=dataset['Z'], cmap='viridis', s=10)\nplt.colorbar(scatter, label='Function Value')\nplt.title('Scatter Plot of Randomly Sampled Data')\nplt.xlabel('X-axis')\nplt.ylabel('Y-axis')\nplt.show()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"bec2c3\" data-has-transparency=\"true\" style=\"--dominant-color: #bec2c3;\" loading=\"lazy\" decoding=\"async\" width=\"1482\" height=\"1186\" class=\"wp-image-597206 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nw8UuI7bSKM_Qth-LYCeMA.png?resize=1482%2C1186&#038;ssl=1\" alt=\"This is the data we'll be training on to try to learn our function.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nw8UuI7bSKM_Qth-LYCeMA.png 1482w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nw8UuI7bSKM_Qth-LYCeMA-300x240.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nw8UuI7bSKM_Qth-LYCeMA-1024x819.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nw8UuI7bSKM_Qth-LYCeMA-768x615.png 768w\" sizes=\"auto, (max-width: 1482px) 100vw, 1482px\"><figcaption class=\"wp-element-caption\">This is the data we\u2019ll be training on to try to learn our function.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Recall that the back propagation algorithm updates parameters based on what happens in a forward pass. So, before we implement backpropagation itself, let\u2019s keep track of a few important values in the forward pass: The inputs and outputs of each perceptron throughout the model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np\n\nclass SimpleNN:\n    def __init__(self, architecture):\n        self.architecture = architecture\n        self.weights = []\n        self.biases = []\n\n        #keeping track of these values in this code block\n        #so we can observe them\n        self.perceptron_inputs = None\n        self.perceptron_outputs = None\n\n        # Initialize weights and biases\n        np.random.seed(99)\n        for i in range(len(architecture) - 1):\n            self.weights.append(np.random.uniform(\n                low=-1, high=1,\n                size=(architecture[i], architecture[i+1])\n            ))\n            self.biases.append(np.zeros((1, architecture[i+1])))\n\n    @staticmethod\n    def relu(x):\n        return np.maximum(0, x)\n\n    def forward(self, X):\n        self.perceptron_inputs = [X]\n        self.perceptron_outputs = []\n\n        for W, b in zip(self.weights, self.biases):\n            Z = np.dot(self.perceptron_inputs[-1], W) + b\n            self.perceptron_outputs.append(Z)\n\n            if W is self.weights[-1]:  # Last layer (output)\n                A = Z  # Linear output for regression\n            else:\n                A = self.relu(Z)\n            self.perceptron_inputs.append(A)\n\n        return self.perceptron_inputs, self.perceptron_outputs\n\n    def predict(self, X):\n        perceptron_inputs, _ = self.forward(X)\n        return perceptron_inputs[-1].flatten()\n\n#defining a model\narchitecture = [2, 64, 64, 64, 1]  # Two inputs, two hidden layers, one output\nmodel = SimpleNN(architecture)\n\n# Generate predictions\nprediction = model.predict(np.array([0.1,0.2]))\n\n#looking through critical optimization values\nfor i, (inpt, outpt) in enumerate(zip(model.perceptron_inputs, model.perceptron_outputs[:-1])):\n    print(f'layer {i}')\n    print(f'input: {inpt.shape}')\n    print(f'output: {outpt.shape}')\n    print('')\n\nprint('Final Output:')\nprint(model.perceptron_outputs[-1].shape)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"3a3a3a\" data-has-transparency=\"false\" style=\"--dominant-color: #3a3a3a;\" loading=\"lazy\" decoding=\"async\" width=\"1482\" height=\"454\" class=\"wp-image-597207 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/17hmH0UrcyzSjmfLAABusZA.png?resize=1482%2C454&#038;ssl=1\" alt=\"The values throughout various layers of the model as a result of the forward pass. This will allow us to compute the necessary changes to update the model.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/17hmH0UrcyzSjmfLAABusZA.png 1482w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/17hmH0UrcyzSjmfLAABusZA-300x92.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/17hmH0UrcyzSjmfLAABusZA-1024x314.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/17hmH0UrcyzSjmfLAABusZA-768x235.png 768w\" sizes=\"auto, (max-width: 1482px) 100vw, 1482px\"><figcaption class=\"wp-element-caption\">The values throughout various layers of the model as a result of the forward pass. This will allow us to compute the necessary changes to update the model.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now that we have a record stored of critical intermediary value within the network, we can use those values, along with the error of a model for a particular prediction, to calculate the changes we should make to the model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np\n\nclass SimpleNN:\n    def __init__(self, architecture):\n        self.architecture = architecture\n        self.weights = []\n        self.biases = []\n\n        # Initialize weights and biases\n        np.random.seed(99)\n        for i in range(len(architecture) - 1):\n            self.weights.append(np.random.uniform(\n                low=-1, high=1,\n                size=(architecture[i], architecture[i+1])\n            ))\n            self.biases.append(np.zeros((1, architecture[i+1])))\n\n    @staticmethod\n    def relu(x):\n        return np.maximum(0, x)\n\n    @staticmethod\n    def relu_as_weights(x):\n        return (x &gt; 0).astype(float)\n\n    def forward(self, X):\n        perceptron_inputs = [X]\n        perceptron_outputs = []\n\n        for W, b in zip(self.weights, self.biases):\n            Z = np.dot(perceptron_inputs[-1], W) + b\n            perceptron_outputs.append(Z)\n\n            if W is self.weights[-1]:  # Last layer (output)\n                A = Z  # Linear output for regression\n            else:\n                A = self.relu(Z)\n            perceptron_inputs.append(A)\n\n        return perceptron_inputs, perceptron_outputs\n\n    def backward(self, perceptron_inputs, perceptron_outputs, target):\n        weight_changes = []\n        bias_changes = []\n\n        m = len(target)\n        dA = perceptron_inputs[-1] - target.reshape(-1, 1)  # Output layer gradient\n\n        for i in reversed(range(len(self.weights))):\n            dZ = dA if i == len(self.weights) - 1 else dA * self.relu_as_weights(perceptron_outputs[i])\n            dW = np.dot(perceptron_inputs[i].T, dZ) \/ m\n            db = np.sum(dZ, axis=0, keepdims=True) \/ m\n            weight_changes.append(dW)\n            bias_changes.append(db)\n\n            if i &gt; 0:\n                dA = np.dot(dZ, self.weights[i].T)\n\n        return list(reversed(weight_changes)), list(reversed(bias_changes))\n\n    def predict(self, X):\n        perceptron_inputs, _ = self.forward(X)\n        return perceptron_inputs[-1].flatten()\n\n#defining a model\narchitecture = [2, 64, 64, 64, 1]  # Two inputs, two hidden layers, one output\nmodel = SimpleNN(architecture)\n\n#defining a sample input and target output\ninput = np.array([[0.1,0.2]])\ndesired_output = np.array([0.5])\n\n#doing forward and backward pass to calculate changes\nperceptron_inputs, perceptron_outputs = model.forward(input)\nweight_changes, bias_changes = model.backward(perceptron_inputs, perceptron_outputs, desired_output)\n\n#smaller numbers for printing\nnp.set_printoptions(precision=2)\n\nfor i, (layer_weights, layer_biases, layer_weight_changes, layer_bias_changes)\nin enumerate(zip(model.weights, model.biases, weight_changes, bias_changes)):\n    print(f'layer {i}')\n    print(f'weight matrix: {layer_weights.shape}')\n    print(f'weight matrix changes: {layer_weight_changes.shape}')\n    print(f'bias matrix: {layer_biases.shape}')\n    print(f'bias matrix changes: {layer_bias_changes.shape}')\n    print('')\n\nprint('The weight and weight change matrix of the second layer:')\nprint('weight matrix:')\nprint(model.weights[1])\nprint('change matrix:')\nprint(weight_changes[1])<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"3d3d3d\" data-has-transparency=\"false\" style=\"--dominant-color: #3d3d3d;\" loading=\"lazy\" decoding=\"async\" width=\"1738\" height=\"1298\" class=\"wp-image-597208 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1J7TFKgd0gBikRMuiMin2Rw.png?resize=1738%2C1298&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1J7TFKgd0gBikRMuiMin2Rw.png 1738w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1J7TFKgd0gBikRMuiMin2Rw-300x224.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1J7TFKgd0gBikRMuiMin2Rw-1024x765.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1J7TFKgd0gBikRMuiMin2Rw-768x574.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1J7TFKgd0gBikRMuiMin2Rw-1536x1147.png 1536w\" sizes=\"auto, (max-width: 1738px) 100vw, 1738px\"><\/figure>\n<p class=\"wp-block-paragraph\">This is probably the most complex implementation step, so I want to take a moment to dig through some of the details. The fundamental idea is exactly as we described in previous sections. We\u2019re iterating over all layers, from back to front, and calculating what change to each weight and bias would result in a better output.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># calculating output error\ndA = perceptron_inputs[-1] - target.reshape(-1, 1)\n\n#a scaling factor for the batch size.\n#you want changes to be an average across all batches\n#so we divide by m once we've aggregated all changes.\nm = len(target)\n\nfor i in reversed(range(len(self.weights))):\n  dZ = dA #simplified for now\n\n  # calculating change to weights\n  dW = np.dot(perceptron_inputs[i].T, dZ) \/ m\n  # calculating change to bias\n  db = np.sum(dZ, axis=0, keepdims=True) \/ m\n\n  # keeping track of required changes\n  weight_changes.append(dW)\n  bias_changes.append(db)\n  ...<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Calculating the change to bias is pretty straight forward. If you look at how the output of a given neuron should have impacted all future neurons, you can add up all those values (which are both positive and negative) to get an idea of if the neuron should be biased in a positive or negative direction.<\/p>\n<p class=\"wp-block-paragraph\">The way we calculate the change to weights, by using matrix multiplication, is a bit more mathematically complex.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-ini\">dW = np.dot(perceptron_inputs[i].T, dZ) \/ m<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Basically, this line says that the change in the weight should be equal to the value going into the perceptron, times how much the output should have changed. If a perceptron had a big input, the change to its outgoing weights should be a large magnitude, if the perceptron had a small input, the change to its outgoing weights will be small. Also, if a weight points towards an output which should change a lot, the weight should change a lot.<\/p>\n<p class=\"wp-block-paragraph\">There is another line we should discuss in our back propagation implement.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-lua\">dZ = dA if i == len(self.weights) - 1 else dA * self.relu_as_weights(perceptron_outputs[i])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In this particular network, there are activation functions throughout the network, following all but the final output. When we do back propagation, we need to back-propagate through these activation functions so that we can update the neurons which lie before them. We do this for all but the last layer, which doesn\u2019t have an activation function, which is why <code>dZ = dA if i == len(self.weights) - 1<\/code> .<\/p>\n<p class=\"wp-block-paragraph\">In fancy math speak we would call this a derivative, but because I don\u2019t want to get into calculus, I called the function <code>relu_as_weights<\/code> . Basically, we can treat each of our ReLU activations as something like a tiny neural network, who\u2019s weight is a function of the input. If the input of the ReLU activation function is less than zero, then that\u2019s like passing that input through a neural network with a weight of zero. If the input of ReLU is greater than zero, then that\u2019s like passing the input through a neural netowork with a weight of one.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fcfcfc\" data-has-transparency=\"false\" style=\"--dominant-color: #fcfcfc;\" loading=\"lazy\" decoding=\"async\" width=\"1332\" height=\"750\" class=\"wp-image-597184 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ.png?resize=1332%2C750&#038;ssl=1\" alt=\"Recall the ReLU activation function.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ.png 1332w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ-1024x577.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1U2_yx_M8XCD93RPOxvg9hQ-768x432.png 768w\" sizes=\"auto, (max-width: 1332px) 100vw, 1332px\"><figcaption class=\"wp-element-caption\">Recall the ReLU activation function.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">This is exactly what the <code>relu_as_weights<\/code> function does.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def relu_as_weights(x):\n        return (x &gt; 0).astype(float)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Using this logic we can treat back propagating through ReLU just like we back propagate through the rest of the neural network.<\/p>\n<p class=\"wp-block-paragraph\">Again, I\u2019ll be covering this concept from a more robust mathematical prospective soon, but that\u2019s the essential idea from a conceptual perspective.<\/p>\n<p class=\"wp-block-paragraph\">Now that we have the forward and backward pass implemented, we can implement training the model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import numpy as np\n\nclass SimpleNN:\n    def __init__(self, architecture):\n        self.architecture = architecture\n        self.weights = []\n        self.biases = []\n\n        # Initialize weights and biases\n        np.random.seed(99)\n        for i in range(len(architecture) - 1):\n            self.weights.append(np.random.uniform(\n                low=-1, high=1,\n                size=(architecture[i], architecture[i+1])\n            ))\n            self.biases.append(np.zeros((1, architecture[i+1])))\n\n    @staticmethod\n    def relu(x):\n        return np.maximum(0, x)\n\n    @staticmethod\n    def relu_as_weights(x):\n        return (x &gt; 0).astype(float)\n\n    def forward(self, X):\n        perceptron_inputs = [X]\n        perceptron_outputs = []\n\n        for W, b in zip(self.weights, self.biases):\n            Z = np.dot(perceptron_inputs[-1], W) + b\n            perceptron_outputs.append(Z)\n\n            if W is self.weights[-1]:  # Last layer (output)\n                A = Z  # Linear output for regression\n            else:\n                A = self.relu(Z)\n            perceptron_inputs.append(A)\n\n        return perceptron_inputs, perceptron_outputs\n\n    def backward(self, perceptron_inputs, perceptron_outputs, y_true):\n        weight_changes = []\n        bias_changes = []\n\n        m = len(y_true)\n        dA = perceptron_inputs[-1] - y_true.reshape(-1, 1)  # Output layer gradient\n\n        for i in reversed(range(len(self.weights))):\n            dZ = dA if i == len(self.weights) - 1 else dA * self.relu_as_weights(perceptron_outputs[i])\n            dW = np.dot(perceptron_inputs[i].T, dZ) \/ m\n            db = np.sum(dZ, axis=0, keepdims=True) \/ m\n            weight_changes.append(dW)\n            bias_changes.append(db)\n\n            if i &gt; 0:\n                dA = np.dot(dZ, self.weights[i].T)\n\n        return list(reversed(weight_changes)), list(reversed(bias_changes))\n\n    def update_weights(self, weight_changes, bias_changes, lr):\n        for i in range(len(self.weights)):\n            self.weights[i] -= lr * weight_changes[i]\n            self.biases[i] -= lr * bias_changes[i]\n\n    def train(self, X, y, epochs, lr=0.01):\n        for epoch in range(epochs):\n            perceptron_inputs, perceptron_outputs = self.forward(X)\n            weight_changes, bias_changes = self.backward(perceptron_inputs, perceptron_outputs, y)\n            self.update_weights(weight_changes, bias_changes, lr)\n\n            if epoch % 20 == 0 or epoch == epochs - 1:\n                loss = np.mean((perceptron_inputs[-1].flatten() - y) ** 2)  # MSE\n                print(f\"EPOCH {epoch}: Loss = {loss:.4f}\")\n\n    def predict(self, X):\n        perceptron_inputs, _ = self.forward(X)\n        return perceptron_inputs[-1].flatten()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The <code>train<\/code> function:<\/p>\n<ul class=\"wp-block-list\">\n<li>iterates through all the data some number of times (defined by <code>epoch<\/code> )<\/li>\n<li>passes the data through a forward pass<\/li>\n<li>calculates how the weights and biases should change<\/li>\n<li>updates the weights and biases, by scaling their changes by the learning rate ( <code>lr<\/code> )<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">And thus we\u2019ve implemented a neural network! Let\u2019s train it.<\/p>\n<h2 class=\"wp-block-heading\">Training and Evaluating the Neural Network.<\/h2>\n<p class=\"wp-block-paragraph\">Recall that we defined an arbitrary 2D function we wanted to learn how to emulate,<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"708da6\" data-has-transparency=\"true\" style=\"--dominant-color: #708da6;\" loading=\"lazy\" decoding=\"async\" width=\"1422\" height=\"1116\" class=\"wp-image-597205 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA.png?resize=1422%2C1116&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA.png 1422w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA-300x235.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA-1024x804.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/11f28YdNHVOI7goP3REmrfA-768x603.png 768w\" sizes=\"auto, (max-width: 1422px) 100vw, 1422px\"><\/figure>\n<p class=\"wp-block-paragraph\">and we sampled that space with some number of points, which we\u2019re using to train the model.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"e5e9eb\" data-has-transparency=\"true\" style=\"--dominant-color: #e5e9eb;\" loading=\"lazy\" decoding=\"async\" width=\"1284\" height=\"994\" class=\"wp-image-597209 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NESTsjN_9dOKVmWgP12ZGA.png?resize=1284%2C994&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NESTsjN_9dOKVmWgP12ZGA.png 1284w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NESTsjN_9dOKVmWgP12ZGA-300x232.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NESTsjN_9dOKVmWgP12ZGA-1024x793.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1NESTsjN_9dOKVmWgP12ZGA-768x595.png 768w\" sizes=\"auto, (max-width: 1284px) 100vw, 1284px\"><\/figure>\n<p class=\"wp-block-paragraph\">Before feeding this data into our model, it\u2019s vital that we first &#8220;normalize&#8221; the data. Certain values of the dataset are very small or very large, which can make training a neural network very difficult. Values within the neural network can quickly grow to absurdly large values, or diminish to zero, which can inhibit training. Normalization squashes all of our inputs, and our desired outputs, into a more reasonable range averaging around zero with a standardized distribution called a &#8220;normal&#8221; distribution.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-ini\"># Flatten the data\nX_flat = X.flatten()\nY_flat = Y.flatten()\nZ_flat = Z.flatten()\n\n# Stack X and Y as input features\ninputs = np.column_stack((X_flat, Y_flat))\noutputs = Z_flat\n\n# Normalize the inputs and outputs\ninputs_mean = np.mean(inputs, axis=0)\ninputs_std = np.std(inputs, axis=0)\noutputs_mean = np.mean(outputs)\noutputs_std = np.std(outputs)\n\ninputs = (inputs - inputs_mean) \/ inputs_std\noutputs = (outputs - outputs_mean) \/ outputs_std<\/code><\/pre>\n<p class=\"wp-block-paragraph\">If we want to get back predictions in the actual range of data from our original dataset, we can use these values to essentially &#8220;un-squash&#8221; the data.<\/p>\n<p class=\"wp-block-paragraph\">Once we\u2019ve done that, we can define and train our model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-makefile\"># Define the architecture: [input_dim, hidden1, ..., output_dim]\narchitecture = [2, 64, 64, 64, 1]  # Two inputs, two hidden layers, one output\nmodel = SimpleNN(architecture)\n\n# Train the model\nmodel.train(inputs, outputs, epochs=2000, lr=0.001)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"404040\" data-has-transparency=\"false\" style=\"--dominant-color: #404040;\" loading=\"lazy\" decoding=\"async\" width=\"1284\" height=\"1008\" class=\"wp-image-597210 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nXquA-yHCZrfdwSHSWwvzw.png?resize=1284%2C1008&#038;ssl=1\" alt=\"As can be seen, the value of loss is going down consistently, implying the model is improving.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nXquA-yHCZrfdwSHSWwvzw.png 1284w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nXquA-yHCZrfdwSHSWwvzw-300x236.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nXquA-yHCZrfdwSHSWwvzw-1024x804.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1nXquA-yHCZrfdwSHSWwvzw-768x603.png 768w\" sizes=\"auto, (max-width: 1284px) 100vw, 1284px\"><figcaption class=\"wp-element-caption\">As can be seen, the value of loss is going down consistently, implying the model is improving.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Then we can visualize the output of the neural network\u2019s prediction vs the actual function.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import matplotlib.pyplot as plt\n\n# Reshape predictions to grid format for visualization\nZ_pred = model.predict(inputs) * outputs_std + outputs_mean\nZ_pred = Z_pred.reshape(X.shape)\n\n# Plot comparison of the true function and the model predictions\nfig, axes = plt.subplots(1, 2, figsize=(14, 6))\n\n# Plot the true function\naxes[0].contourf(X, Y, Z, cmap='viridis')\naxes[0].set_title(\"True Function\")\naxes[0].set_xlabel(\"X-axis\")\naxes[0].set_ylabel(\"Y-axis\")\naxes[0].colorbar = plt.colorbar(axes[0].contourf(X, Y, Z, cmap='viridis'), ax=axes[0], label=\"Function Value\")\n\n# Plot the predicted function\naxes[1].contourf(X, Y, Z_pred, cmap='plasma')\naxes[1].set_title(\"NN Predicted Function\")\naxes[1].set_xlabel(\"X-axis\")\naxes[1].set_ylabel(\"Y-axis\")\naxes[1].colorbar = plt.colorbar(axes[1].contourf(X, Y, Z_pred, cmap='plasma'), ax=axes[1], label=\"Function Value\")\n\nplt.tight_layout()\nplt.show()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"9682af\" data-has-transparency=\"true\" style=\"--dominant-color: #9682af;\" loading=\"lazy\" decoding=\"async\" width=\"2498\" height=\"1078\" class=\"wp-image-597211 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1WD5ABUHvbEoC2Bg8Z-Ob6w.png?resize=2498%2C1078&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1WD5ABUHvbEoC2Bg8Z-Ob6w.png 2498w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1WD5ABUHvbEoC2Bg8Z-Ob6w-300x129.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1WD5ABUHvbEoC2Bg8Z-Ob6w-1024x442.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1WD5ABUHvbEoC2Bg8Z-Ob6w-768x331.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1WD5ABUHvbEoC2Bg8Z-Ob6w-1536x663.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1WD5ABUHvbEoC2Bg8Z-Ob6w-2048x884.png 2048w\" sizes=\"auto, (max-width: 2498px) 100vw, 2498px\"><\/figure>\n<p class=\"wp-block-paragraph\">This did an ok job, but not as great as we might like. This is where a lot of data scientists spend their time, and there are a ton of approaches to making a neural network fit a certain problem better. Some obvious ones are:<\/p>\n<ul class=\"wp-block-list\">\n<li>use more data<\/li>\n<li>play around with the learning rate<\/li>\n<li>train for more epochs<\/li>\n<li>change the structure of the model<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">It\u2019s pretty easy for us to crank up the amount of data we\u2019re training on. Let\u2019s see where that leads us. Here I\u2019m sampling our dataset 10,000 times, which is 10x more training samples than our previous dataset.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"a0afc0\" data-has-transparency=\"true\" style=\"--dominant-color: #a0afc0;\" loading=\"lazy\" decoding=\"async\" width=\"1274\" height=\"1002\" class=\"wp-image-597212 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1V9NO7sBco4qsMgpzPU6e9w.png?resize=1274%2C1002&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1V9NO7sBco4qsMgpzPU6e9w.png 1274w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1V9NO7sBco4qsMgpzPU6e9w-300x236.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1V9NO7sBco4qsMgpzPU6e9w-1024x805.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1V9NO7sBco4qsMgpzPU6e9w-768x604.png 768w\" sizes=\"auto, (max-width: 1274px) 100vw, 1274px\"><\/figure>\n<p class=\"wp-block-paragraph\">And then I trained the model just like before, except this time it took a lot longer because each epoch now analyses 10,000 samples rather than 1,000.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-makefile\"># Define the architecture: [input_dim, hidden1, ..., output_dim]\narchitecture = [2, 64, 64, 64, 1]  # Two inputs, two hidden layers, one output\nmodel = SimpleNN(architecture)\n\n# Train the model\nmodel.train(inputs, outputs, epochs=2000, lr=0.001)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"3c3c3c\" data-has-transparency=\"false\" style=\"--dominant-color: #3c3c3c;\" loading=\"lazy\" decoding=\"async\" width=\"2494\" height=\"668\" class=\"wp-image-597213 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/14RK1Zw7_QvVMncehuB5aLg.png?resize=2494%2C668&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/14RK1Zw7_QvVMncehuB5aLg.png 2494w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/14RK1Zw7_QvVMncehuB5aLg-300x80.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/14RK1Zw7_QvVMncehuB5aLg-1024x274.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/14RK1Zw7_QvVMncehuB5aLg-768x206.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/14RK1Zw7_QvVMncehuB5aLg-1536x411.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/14RK1Zw7_QvVMncehuB5aLg-2048x549.png 2048w\" sizes=\"auto, (max-width: 2494px) 100vw, 2494px\"><\/figure>\n<p class=\"wp-block-paragraph\">I then rendered the output of this model, the same way I did before, but it didn\u2019t really look like the output got much better.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"9682af\" data-has-transparency=\"true\" style=\"--dominant-color: #9682af;\" loading=\"lazy\" decoding=\"async\" width=\"2494\" height=\"1074\" class=\"wp-image-597214 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xi2uWp2nG_RW745IEOplPw.png?resize=2494%2C1074&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xi2uWp2nG_RW745IEOplPw.png 2494w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xi2uWp2nG_RW745IEOplPw-300x129.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xi2uWp2nG_RW745IEOplPw-1024x441.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xi2uWp2nG_RW745IEOplPw-768x331.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xi2uWp2nG_RW745IEOplPw-1536x661.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1xi2uWp2nG_RW745IEOplPw-2048x882.png 2048w\" sizes=\"auto, (max-width: 2494px) 100vw, 2494px\"><\/figure>\n<p class=\"wp-block-paragraph\">Looking back at the loss output from training, it seems like the loss is still steadily declining. Maybe I just need to train for longer. Let\u2019s try that.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-makefile\"># Define the architecture: [input_dim, hidden1, ..., output_dim]\narchitecture = [2, 64, 64, 64, 1]  # Two inputs, two hidden layers, one output\nmodel = SimpleNN(architecture)\n\n# Train the model\nmodel.train(inputs, outputs, epochs=4000, lr=0.001)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"3c3c3c\" data-has-transparency=\"false\" style=\"--dominant-color: #3c3c3c;\" loading=\"lazy\" decoding=\"async\" width=\"2494\" height=\"668\" class=\"wp-image-597215 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1LuD6WcQFUxI3-kLmOxZUkw.png?resize=2494%2C668&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1LuD6WcQFUxI3-kLmOxZUkw.png 2494w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1LuD6WcQFUxI3-kLmOxZUkw-300x80.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1LuD6WcQFUxI3-kLmOxZUkw-1024x274.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1LuD6WcQFUxI3-kLmOxZUkw-768x206.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1LuD6WcQFUxI3-kLmOxZUkw-1536x411.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1LuD6WcQFUxI3-kLmOxZUkw-2048x549.png 2048w\" sizes=\"auto, (max-width: 2494px) 100vw, 2494px\"><\/figure>\n<p class=\"wp-block-paragraph\">The results seem to be a bit better, but they aren\u2019t\u2019 amazing.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"9983ae\" data-has-transparency=\"true\" style=\"--dominant-color: #9983ae;\" loading=\"lazy\" decoding=\"async\" width=\"2494\" height=\"1078\" class=\"wp-image-597216 has-transparency\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1QOcHdFMieEoEKvOdPjzMzg.png?resize=2494%2C1078&#038;ssl=1\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1QOcHdFMieEoEKvOdPjzMzg.png 2494w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1QOcHdFMieEoEKvOdPjzMzg-300x130.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1QOcHdFMieEoEKvOdPjzMzg-1024x443.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1QOcHdFMieEoEKvOdPjzMzg-768x332.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1QOcHdFMieEoEKvOdPjzMzg-1536x664.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/1QOcHdFMieEoEKvOdPjzMzg-2048x885.png 2048w\" sizes=\"auto, (max-width: 2494px) 100vw, 2494px\"><\/figure>\n<p class=\"wp-block-paragraph\">I\u2019ll spare you the details. I ran this a few times, and I got some decent results, but never anything 1 to 1. I\u2019ll be covering some more advanced approaches data scientists use, like annealing and dropout, in future articles which will result in a more consistent and better output. Still, though, we made a neural network from scratch and trained it to do something, and it did a decent job! Pretty neat!<\/p>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">In this article we avoided calculus like the plague while simultaneously forging an understanding of Neural Networks. We explored their theory, a little bit about the math, the idea of back propagation, and then implemented a neural network from scratch. We then applied a neural network to a toy problem, and explored some of the simple ideas data scientists employ to actually train neural networks to be good at things.<\/p>\n<p class=\"wp-block-paragraph\">In future articles we\u2019ll explore a few more advanced approaches to Neural Networks, so stay tuned! For now, you might be interested in a more thorough analysis of Gradients, the fundamental math behind back propagation.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><a href=\"https:\/\/towardsdatascience.com\/what-are-gradients-and-why-do-they-explode-add23264d24b\"><strong>What Are Gradients, and Why Do They Explode?<\/strong><\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">You might also be interested in this article, which covers training a neural network using more conventional <a href=\"https:\/\/towardsdatascience.com\/tag\/data-science\/\" title=\"Data Science\">Data Science<\/a> tools like PyTorch.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><a href=\"https:\/\/towardsdatascience.com\/ai-for-the-absolute-novice-intuitively-and-exhaustively-explained-7b353a31e6d7\"><strong>AI for the Absolute Novice \u2013 Intuitively and Exhaustively Explained<\/strong><\/a><\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Join Intuitively and Exhaustively Explained<\/h2>\n<p class=\"wp-block-paragraph\">At IAEE you can find:<\/p>\n<ul class=\"wp-block-list\">\n<li>Long form content, like the article you just read<\/li>\n<li>Conceptual breakdowns of some of the most cutting-edge AI topics<\/li>\n<li>By-Hand walkthroughs of critical mathematical operations in AI<\/li>\n<li>Practical tutorials and explainers<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"d4a46a\" data-has-transparency=\"false\" style=\"--dominant-color: #d4a46a;\" loading=\"lazy\" decoding=\"async\" width=\"1260\" height=\"434\" class=\"wp-image-597217 not-transparent\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0tBoYEtqCyiht-7gI.jpeg?resize=1260%2C434&#038;ssl=1\" alt=\"Join IAEE\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0tBoYEtqCyiht-7gI.jpeg 1260w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0tBoYEtqCyiht-7gI-300x103.jpeg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0tBoYEtqCyiht-7gI-1024x353.jpeg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/0tBoYEtqCyiht-7gI-768x265.jpeg 768w\" sizes=\"auto, (max-width: 1260px) 100vw, 1260px\"><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/iaee.substack.com\/\">Join IAEE<\/a><\/figcaption><\/figure>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/neural-networks-intuitively-and-exhaustively-explained-0153f85c1007\/\">Neural Networks \u2013 Intuitively and Exhaustively Explained<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Daniel Warfield<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/neural-networks-intuitively-and-exhaustively-explained-0153f85c1007\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Neural Networks \u2013 Intuitively and Exhaustively Explained An in-depth exploration of the most fundamental architecture in modern AI &#8220;The Thinking Part&#8221; by Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained. In this article we\u2019ll form a thorough understanding of the neural network, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[151,62,83,240,673,699],"tags":[1620,118,1619],"class_list":["post-1637","post","type-post","status-publish","format-standard","hentry","category-ai","category-aimldsaimlds","category-data-science","category-editors-pick","category-neural-networks","category-software-development","tag-brain","tag-neural","tag-neurons"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1637"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1637"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1637\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1637"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1637"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1637"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}