{"id":1232,"date":"2025-01-16T07:02:36","date_gmt":"2025-01-16T07:02:36","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/16\/a-12-step-visual-guide-to-understanding-nerf-representing-scenes-as-neural-radiance-fields-24a36aef909a\/"},"modified":"2025-01-16T07:02:36","modified_gmt":"2025-01-16T07:02:36","slug":"a-12-step-visual-guide-to-understanding-nerf-representing-scenes-as-neural-radiance-fields-24a36aef909a","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/16\/a-12-step-visual-guide-to-understanding-nerf-representing-scenes-as-neural-radiance-fields-24a36aef909a\/","title":{"rendered":"A 12-step visual guide to understanding NeRF (Representing Scenes as Neural Radiance Fields)"},"content":{"rendered":"<p>    A 12-step visual guide to understanding NeRF (Representing Scenes as Neural Radiance Fields)<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AuGydi7m_DdbStaPJYEugyQ.png?ssl=1\"><figcaption>NeRF overview\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<h3>A Beginner\u2019s 12-Step Visual Guide to Understanding NeRF: Neural Radiance Fields for Scene Representation and View Synthesis<\/h3>\n<h4>A basic understanding of NeRF\u2019s workings through visual representations<\/h4>\n<h4><strong>Who should read this\u00a0article?<\/strong><\/h4>\n<p>This article aims to provide a basic beginner level understanding of NeRF\u2019s workings through visual representations. While various blogs offer detailed explanations of NeRF, these are often geared toward readers with a strong technical background in volume rendering and 3D graphics. In contrast, this article seeks to explain NeRF with minimal prerequisite knowledge, with an optional technical snippet at the end for curious readers. For those interested in the mathematical details behind NeRF, a list of further readings is provided at the\u00a0end.<\/p>\n<h4><strong>What is NeRF and How Does It\u00a0Work?<\/strong><\/h4>\n<p>NeRF, short for <em>Neural Radiance Fields<\/em>, is a 2020 paper introducing a novel method for rendering 2D images from 3D scenes. Traditional approaches rely on physics-based, computationally intensive techniques such as ray casting and ray tracing. These involve tracing a ray of light from each pixel of the 2D image back to the scene particles to estimate the pixel color. While these methods offer high accuracy (e.g., images captured by phone cameras closely approximate what the human eye perceives from the same angle), they are often slow and require significant computational resources, such as GPUs, for parallel processing. As a result, implementing these methods on edge devices with limited computing capabilities is nearly impossible.<\/p>\n<p>NeRF addresses this issue by functioning as a scene compression method. It uses an overfitted multi-layer perceptron (MLP) to encode scene information, which can then be queried from any viewing direction to generate a 2D-rendered image. When properly trained, NeRF significantly reduces storage requirements; for example, a simple 3D scene can typically be compressed into about 5MB of\u00a0data.<\/p>\n<p>At its core, NeRF answers the following question using an\u00a0MLP:<\/p>\n<blockquote><p><em>What will I see if I view the scene from this direction?<\/em><\/p><\/blockquote>\n<p>This question is answered by providing the viewing direction (in terms of two angles (\u03b8, \u03c6), or a unit vector) to the MLP as input, and MLP provides RGB (directional emitted color) and volume density, which is then processed through volumetric rendering to produce the final RGB value that the pixel sees. To create an image of a certain resolution (say HxW), the MLP is queried HxW times for each pixel\u2019s viewing direction, and the image is created. Since the release of the first NeRF paper, numerous updates have been made to enhance rendering quality and speed. However, this blog will focus on the original NeRF\u00a0paper.<\/p>\n<h4><strong>Step 1: Multi-view input\u00a0images<\/strong><\/h4>\n<p>NeRF needs various images from different viewing angles to compress a scene. MLP learns to interpolate these images for unseen viewing directions (novel views). The information on the viewing direction for an image is provided using the camera&#8217;s intrinsic and extrinsic matrices. The more images spanning a wide range of viewing directions, the better the NeRF reconstruction of the scene is. In short, the basic NeRF takes input camera images, and their associated camera intrinsic and extrinsic matrices. (You can learn more about the camera matrices in the blog\u00a0below)<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/what-are-intrinsic-and-extrinsic-camera-parameters-in-computer-vision-7071b72fb8ec\">What are Intrinsic and Extrinsic Camera Parameters in Computer Vision?<\/a><\/p>\n<h4><strong>Step2 to 4: Sampling, Pixel iteration, and Ray\u00a0casting<\/strong><\/h4>\n<p>Each image in the input images is processed independently (for the sake of simplicity). From the input, an image and its associated camera matrices are sampled. For each camera image pixel, a ray is traced from the camera center to the pixel and extended outwards. If the camera center is defined as o, and the viewing direction as directional vector d, then the ray r(t) can be defined as r(t)=o+td where t is the distance of the point r(t) from the center of the\u00a0camera.<\/p>\n<p>Ray casting is done to identify the parts of the scene that contribute to the color of the\u00a0pixel.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AhE6L5t4ma85_AnKSYfCVHA.jpeg?ssl=1\"><figcaption>Understanding NeRF\u200a\u2014\u200aSteps 1\u20134, Input, Sampling, Pixel iteration and ray casting\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<h4><strong>Step 5: Ray\u00a0Marching<\/strong><\/h4>\n<p>Once the ray is cast, we sample n point along the ray. Theoretically, the ray can extend out infinitely, so to limit the ray we define a near r(t_n) and far plane r(t_f) which are t_n and t_f distance away from the camera center. These planes limit our search space. Only the space within these planes is considered for scene reconstruction, hence the planes need to be defined by the scene under consideration.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Ab1ANy5k4cSFRJgl3sJv1Ug.png?ssl=1\"><figcaption>Near and far plane for NeRF\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AQy3BMG8G1d9lR6ZPF-frZQ.jpeg?ssl=1\"><figcaption>Understanding NeRF\u200a\u2014\u200aSteps 5 &amp; 6, Ray marching, Input to the MLP\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<h4><strong>Step 6, 7: Multi layer perceptron (MLP)<\/strong><\/h4>\n<p>Now for each pixel in the camera image, we have a viewing direction (\u03b8, \u03c6\u200a\u2014\u200awhich is the same) and n number of 3D points from the scene that lie in that viewing direction ((x1, y1,z1), (x2, y2, z2),\u00a0\u2026, (xn, yn, zn)). From these parameters, we create n number of 5D vectors which is used as input to the MLP as shown above The MLP then predicts n number of 4D vectors that contain the directional emitted color c (i.e. the RGB color c=(ri, gi, bi) contributed by the 3D position xi, yi, zi towards the pixels when viewed from the direction \u03b8i, \u03c6i), and a volumetric density \u03c3 (a scalar value used to determine the probability of a ray interacting with a particular point in space). \u03c3 indicates how \u201copaque\u201d a point in space is. High values of \u03c3 mean that the space is dense (e.g., part of an object), while low values indicate empty or transparent regions.<\/p>\n<p>Formally the MLP F_\u0398 does the following<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/530\/1%2Av5Z-DIlhV530oAL_WFEumg.png?ssl=1\"><figcaption>MLP used for\u00a0NeRF<\/figcaption><\/figure>\n<p>where d is the viewing direction (either (\u03b8, \u03c6), or a 3D unit vector) of the ray, and <strong>x = <\/strong>(x, y, z) is the 3D position of the sampled point along the\u00a0ray.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AYgCSVGNMA-K158ZyR60HPg.png?ssl=1\"><figcaption>Understanding NeRF\u200a\u2014\u200aSteps 7 &amp; 8, MLP, Pixel Reconstruction\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<h4><strong>Step 8: Pixel reconstruction<\/strong><\/h4>\n<p>The pixel color is reconstructed by integrating contributions along the ray that passes through the scene. For a ray parameterized as r(t)=o+td, the color C(r) of the ray (and thus the pixel) is computed using the volume rendering equation as\u00a0follows<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ASu5Lz-biZi95Y_5lKuAx9g.png?ssl=1\"><figcaption>Equation 1\u200a\u2014\u200aVolumetric rendering<\/figcaption><\/figure>\n<p>where sigma(r(t)) is the volumetric density of the point r(t) on the ray cast, c(r(t), d) is the directional emitted color of the point r(t), t_f and t_n are the limits defined by the near and the far plane. T(t) is the transmittance, representing the probability that light travels from the camera to depth t without being absorbed, and is calculated as\u00a0follows<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/984\/1%2A4TOHPgnjihWK4AkDra0xAA.png?ssl=1\"><figcaption>Equation 2: Transmittance<\/figcaption><\/figure>\n<p>Let&#8217;s first understand what transmittance is. The farther you move along the ray, the higher the probability that the ray is absorbed within the scene. Consequently, the transmittance is determined by the negative exponent of the cumulative volumetric density integrated from the near plane to the point t where the transmittance is being calculated.<\/p>\n<p>Equation 1 can be interpreted as follows: the color of the ray (and hence the pixel) is computed as a weighted sum of the emitted color at each point along the ray. Each point\u2019s contribution is weighted by two\u00a0factors:<\/p>\n<ol>\n<li>The probability that the ray reaches the point t without being absorbed (transmittance, T(t))<\/li>\n<li>The probability that the point contains material capable of emitting or reflecting light (volumetric density, \u03c3(r(t))).<\/li>\n<\/ol>\n<p>This combination ensures that the rendered color accounts for both the visibility of the point and the physical presence of light-emitting or light-reflecting material.<\/p>\n<p>Since from the MLP, we don&#8217;t have access to all the points that lie on the ray r(t), we discretized the volume rendering equation above and applied it using the n number of points (determined during the ray marching)<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/740\/1%2AyMvhXmojGz7qQ-OwIAm8EQ.png?ssl=1\"><figcaption>Equation 3\u200a\u2014\u200aDiscretized volumetric rendering<\/figcaption><\/figure>\n<p>This formula is similar, except that instead of directly using \u03c3 as a weight, we use \u03b1. While \u03c3 represents the <strong>volumetric density<\/strong> at a single point in space (which works for continuous spaces), \u03b1 represents the <strong>opacity<\/strong> over a discrete segment of the ray (to take into account the discrete nature of the equation), taking into account both the local density \u03c3 and the sampling step size\u00a0\u0394ti.<\/p>\n<p>The volume rendering takes the MLP output and calculates the pixel RGB color which is then compared to the input pixel color. An important advantage of the volume rendering equation is its differentiability, enabling the MLP to be efficiently trained through backpropagation.<\/p>\n<h4><strong>Steps 9, 10, and 11: Image reconstruction, Loss calculation &amp; Optimization<\/strong><\/h4>\n<p>After estimating the pixel color through volume rendering, the same process is repeated for all pixels in the image to reconstruct the complete image. The reconstructed image is then compared to the input image, and a pixel-wise Mean Squared Error (MSE) loss is computed as\u00a0follows.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A53Les3b-8LgpOwC8g_kLzQ.png?ssl=1\"><figcaption>Equation 4\u200a\u2014\u200aLoss\u00a0function<\/figcaption><\/figure>\n<p>where N is the total number of pixels in the image, C_pred is the predicted pixel color, C_true is the actual pixel\u00a0color.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A_52DY_tkTXBpdQeQXjmdQQ.jpeg?ssl=1\"><figcaption>Understanding NeRF\u200a\u2014\u200aSteps 9 &amp; 10, Image reconstruction and loss calculation\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<p>The two primary components involved in reconstructing an image from the input images in NeRF are the MLP and the volume rendering module, both of which are differentiable. This differentiability enables the use of backpropagation to optimize the system. Based on the calculated loss (e.g., pixel-wise Mean Squared Error), the gradient is propagated back through the volume rendering process to the MLP. The weights of the MLP are updated iteratively until the loss converges and the MLP is effectively trained to encode the\u00a0scene.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AfEzjpcqfCMkOZKvhF3t82A.jpeg?ssl=1\"><figcaption>Understanding NeRF\u200a\u2014\u200aStep 11, Optimzation\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<p>The NeRF paper enhances performance with techniques like <strong>stratified sampling<\/strong>, <strong>positional encoding<\/strong>, and <strong>separate dependencies for volumetric density (\u03c3) and emitted color (c)<\/strong>. Stratified sampling ensures robust ray integration, while positional encoding captures high-frequency details by mapping inputs to a higher-dimensional space. \u03c3 depends only on spatial position (x), modeling scene geometry, whereas c depends on both position (x) and viewing direction (d), capturing view-dependent effects like reflections. As this is a beginner\u2019s guide, the article will not delve into the details of these techniques, but they can be explored further in the original\u00a0paper.<\/p>\n<h4><strong>Step 12: Rendering image from a novel viewpoint (inference)<\/strong><\/h4>\n<p>Now that we have a trained, scene-specific MLP that overfits to the scene under consideration, we can render 2D images from novel viewpoints. This is achieved by casting rays through each pixel of the target view, sampling points along these rays, and feeding their 3D coordinates and viewing directions into the MLP. The MLP predicts the volumetric density (\u03c3) and emitted color (c) for each sampled point, which are then aggregated using the classical volume rendering equation to compute the final pixel color. By repeating this process for every pixel in the image, the full 2D image is reconstructed, producing a photo-realistic rendering of the scene from the novel\u00a0view.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Arh8IhXkmLOdo5x4Xss_vng.jpeg?ssl=1\"><figcaption>Understanding NeRF\u200a\u2014\u200aStep 12, Rendering image from differemt viewpoint\u200a\u2014\u200aImage by\u00a0Author<\/figcaption><\/figure>\n<p>as opposed to other ML approaches of making a generalizable model (foundation models) that can solve a wide range of problems, the MLP in NeRF is trained for specificity. The MLP is overfitted to only work for the given scene (for example an object under consideration)<\/p>\n<h4>Summary:<\/h4>\n<p>This article provides a visual guide to understanding NeRF for beginners. The article breaks down NeRF\u2019s workflow into 12 simple, easy-to-follow steps. Here\u2019s a\u00a0summary:<\/p>\n<ol>\n<li>\n<strong>Input<\/strong>: NeRF requires multi-view images of a scene, along with their corresponding camera matrices.<\/li>\n<li>\n<strong>Sampling<\/strong>: Start by selecting an image and its camera matrix to begin the\u00a0process.<\/li>\n<li>\n<strong>Pixel Iteration<\/strong>: For each pixel in the image, repeat the following steps.<\/li>\n<li>\n<strong>Ray Casting<\/strong>: Cast a ray r from the camera center through the pixel, as defined by the camera\u00a0matrix.<\/li>\n<li>\n<strong>Ray Marching<\/strong>: Sample n points along the ray r, between a near and a far\u00a0plane.<\/li>\n<li>\n<strong>Input to the MLP<\/strong>: Construct n 5D vectors, each containing the sampled position (x,y,z) and viewing direction (\u03b8,\u03d5), and feed them into the\u00a0MLP.<\/li>\n<li>\n<strong>MLP Output<\/strong>: The MLP predicts the color (r, g, b) and volumetric density \u03c3 for each sampled\u00a0point.<\/li>\n<li>\n<strong>Pixel Reconstruction<\/strong>: Use differentiable volume rendering to combine the predicted color and density of the sampled points to reconstruct the pixel\u2019s\u00a0color.<\/li>\n<li>\n<strong>Image Reconstruction<\/strong>: Iterate over all pixels to predict the entire\u00a0image.<\/li>\n<li>\n<strong>Loss Calculation<\/strong>: Compute the reconstruction loss between the predicted image and the ground truth input\u00a0image.<\/li>\n<li>\n<strong>Optimization<\/strong>: Leverage the differentiable nature of all components to use backpropagation, training the MLP to overfit the scene for all input\u00a0views.<\/li>\n<li>\n<strong>Rendering from Novel Viewpoints<\/strong>: Query the trained MLP to generate pixel colors for a new viewpoint and reconstruct the\u00a0image.<\/li>\n<\/ol>\n<p><strong>If this article was helpful to you or you want to learn more about Machine Learning and Data Science, follow <\/strong><a href=\"https:\/\/medium.com\/u\/a7cc4f201fb5\"><strong>Aqeel Anwar<\/strong><\/a><strong>, or connect with me on <\/strong><a href=\"https:\/\/www.linkedin.com\/in\/aqeelanwarmalik\/\"><strong><em>LinkedIn<\/em><\/strong><\/a><strong><em>. <\/em>You can also subscribe to my mailing\u00a0list.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=24a36aef909a\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/a-12-step-visual-guide-to-understanding-nerf-representing-scenes-as-neural-radiance-fields-24a36aef909a\">A 12-step visual guide to understanding NeRF (Representing Scenes as Neural Radiance Fields)<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Aqeel Anwar<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fa-12-step-visual-guide-to-understanding-nerf-representing-scenes-as-neural-radiance-fields-24a36aef909a\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A 12-step visual guide to understanding NeRF (Representing Scenes as Neural Radiance Fields) NeRF overview\u200a\u2014\u200aImage by\u00a0Author A Beginner\u2019s 12-Step Visual Guide to Understanding NeRF: Neural Radiance Fields for Scene Representation and View Synthesis A basic understanding of NeRF\u2019s workings through visual representations Who should read this\u00a0article? This article aims to provide a basic beginner level [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,1326,221,70,1327],"tags":[1328,1329,1236],"class_list":["post-1232","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-computer-graphics","category-computer-vision","category-machine-learning","category-nerf","tag-nerf","tag-scene","tag-visual"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1232"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1232"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1232\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}