Understanding Neural Radiance Fields (NeRFs)
Imagine being able to generate photorealistic 3D models of objects and scenes that can be viewed from any angle, with details so realistic that they are indistinguishable from reality. That's what the Neural Radiance Fields (NeRF) is capable of doing and much more. With more than 50 papers related to NeRFs in the CVPR 2022, it is one of the most influential papers of all time.
Neural fields
A neural field is a neural network that parametrizes a signal. In our case, this signal is either a single 3D scene or an object. It is important to note that a single network needs to be trained to encode (capture) a single scene. It is worth mentioning that, unlike standard machine learning, the objective is to overfit the neural network to a particular scene. Essentially, neural fields embed the scene into the weights of the network.
3D scenes are usually stored in computer graphics using voxel grids or polygon meshes. However, voxels are costly to store and polygon meshes are limited to hard surfaces. Neural fields are gaining popularity as they are efficient and compact representations of objects or scenes that are differentiable, continuous, and can have arbitrary dimensions and resolutions. Neural radiance fields are a special case of Neural fields that solve the view synthesis problem.
Neural Radiance Fields (NeRFs)
NeRFs as proposed by Mildenhall et al accept a single continuous 5D coordinate as input, which consists of a spatial location
The network’s weights are optimized to encode the representation of the scene so that the model can easily render novel views seen from any point in space.
Ray Marching
To gain a better grasp of the different stages in NeRF training, let's use this 3D scene as an instance.
The training dataset includes images (
Now, for each camera pose, we "shoot" a ray from the camera (or viewer's eye), through every pixel of the image, resulting in
-
: a vector denoting the origin of the ray. -
: a normalized vector denoting the direction of the ray.
An arbitrary point on the ray can then be defined as
This process of ray tracing is known as backward tracing, as it involves tracing the path of light rays from the camera to the object, as opposed to tracing from the light source to the object.
Input : A set of camera poses
Output : A bundle of rays for every pose
Sampling Query points
You may wonder, what is done with the rays? We trace them from the camera through the scene by adjusting the parameter
By querying a trained neural network at these 3D points along the viewing ray, we can determine if they belong to the object volume and obtain their visual properties to render an image. However, sampling points along a ray is challenging, as too many non-object points won't provide useful information, and focusing only on high-density regions may miss interesting areas.
In our toy example, we uniformly sample along the ray by taking
Input : A bundle of rays for every pose
Output : A set of 3D query points for every ray
Positional Encoding
Once we collect the query points for every ray, we are potentially ready to feed them into our neural network. However, the authors found that resulting renderings perform poorly at representing high-frequency variations in color and geometry that make images perceptually sharp and vivid for the human eye.
This observation is consistent with Rahaman et al. who show that deep networks have a tendency to learn lower-frequency functions. They claim that mapping the inputs to a higher dimensional space using high-frequency functions before passing them to the network enables better fitting of data that contains high-frequency variation.
The authors use the positional encoding containing sine and cosine functions of varying frequencies.
The function
Input : A set of 3D query points for every ray
Output : Embedings of query points
Neural Network inference
To achieve multiview consistency in a neural network, we restrict the network to predict the volume density
The MLP architecture consists of 8 fully-connected layers, each with 256 channels and ReLU activations. A skip connection is included in the network that concatenates the input to the fifth layer's activation. It takes the encoded query points
This feature vector is then concatenated with the embedded viewing direction
Both of these pieces of information combined allow us to compute the volume density profile for every ray as shown in the figure below.
![]() |
Example volume density profile of a single ray, output from a trained network that learns to represent the yellow lego bulldozer |
Input : Embedings of query points
Output : RGB color and volume density for every query point
Volume Rendering
Now, it's time to turn this volume density profile along a ray into an rgb pixel value. Once we have computed the pixel values for all the rays, we will have a full
The volume rendering process involves computing the accumulated radiance along the viewing ray as it passes through the neural radiance field. This is done by integrating the radiance values at each sampled point along the ray, weighted by the transmittance, or opacity of the medium at that point.
The expected color of a camera ray
These complex-looking integrals can be approximated via numerical quadrature. We use a stratified sampling approach where we select a random set of quadrature points
The use of stratified sampling allows for a continuous representation of the scene, despite using a discrete set of samples to estimate the integral. This is because the MLP is evaluated at continuous positions during optimization.
The approximated color of each ray is computed as,
Input : RGB color and volume density for every query point
Output : Rendered Images
Computing loss
The final step is to calculate the loss between the rendered image and the ground truth for each viewpoint in the dataset. This loss function is a simple L2 loss between each pixel of the rendered image and the corresponding pixel of the ground truth image.
Credits: While writing this blog, I referred to dtransposed's blog post (the images are also from this post), and AI Summer's post.
Comments
Post a Comment