This project explores how to generate convincing pictures from new viewing directions using Neural Radiance Fields.
The model consists of a positional encoding layer that converts a B x 2 tensor of coordinates into a tensor of positional encodings. Each coordinate (x, y) is encoded separately and then concatenated into the final B x 4L+2 output tensor. This positional encoding layer is then followed by three hidden linear layers of size 256 and a final output linear layer of size 3. Each linear layer is followed by a nn.ReLU activation function, except for the last one, which is instead followed by a nn.Sigmoid activation in order to clamp the output RGB pixel values to be between 0 and 1.
Varying the learning rate produced the following loss and PSNR curves. I tried changing the original learning rate of 0.01 by factors of 10, but none of the other learning rates achieved a lower loss or higher PSNR. When the learning rate was increased to 0.1, the model failed to converge at all, and in the other cases it converged much more slowly compared to lr=0.01. In the end, a learning rate of 0.01 achieved the best PSNR of 26.39.
| Loss | PSNR |
|---|---|
![]() |
![]() |
Varying the length of the positional encoding produced the following loss and PSNR curves. I tried encoding lengths of 5, 10, 15, 20, and 25. As the encoding length increased from 5 to 10, the PSNR increased, but as encoding length increased from 10 to 25, the loss increased again and the PSNR decreased. L=10 produced the best final PSNR of 26.54, although L=15 produced similarly good results.
| Loss | PSNR |
|---|---|
![]() |
![]() |
The final result was generated using lr=0.01 and L=10 with the original model described above, trained over 3000 iterations. Here are the output images observed every 100 iterations in the first 1000 iterations.
| Original Image | Final Result |
|---|---|
![]() |
![]() |
A hyperparameter sweep was performed to find the best combination of learning rate and positional encoding length, and produced the following loss and PSNR curves. From the sweep, lr=0.001 and L=10 appear to be the best combination, producing a PSNR of 23.47.
| Loss | PSNR |
|---|---|
![]() |
![]() |
However, because the image was quite large, I ended up having to train the model over 100_000 iterations with lr=0.001 and L=10 in order to achieve a PSNR of 25.85:
| Loss | PSNR |
|---|---|
![]() |
![]() |
Here are the output images observed at every 100 iterations in the first 1000 iterations. Just like with the image of the fox, there weren't huge changes past the first 1000 iterations other than the image becoming
| Original Image | Final Result |
|---|---|
![]() |
![]() |
The transform function is just a simple torch.matmul between the batched 3D homogeneous coordinates x_c and the camrea-to-world matrix c2w. However, because the rows of x_c are the coordinates, transform right-multiplies x_cby the transpose of c2w.
The pixel_to_camera(K, uv, s) function takes in a B x 2 matrix of pixel locations uvs and first turns it into a B x 3 matrix of 2D homogeneous coordinates. It then multiplies s_uvs = uvs * s and finally calls transform(s_uvs, K_inv.T) to get the camera-space vectors of depth s. The matrix K is the camera's intrinsic matrix and is built once at the start of the script using data['focal'] and o_x = o_y = 100. K_inv is computed using torch.inverse.
In pixel_to_ray(K, c2ws, uvs), the camera's world-space coordinates ray_o are extracted from the batched c2ws matrices through ray_o = c2ws[:, :3, -1]. This is equivalent to doing ray_o = -inv(R_3x3) @ t. Then, camera ray endpoints c_ray for each point in uvs are calculated with a placeholder depth of s=1 using c_ray = pixel_to_camera(K, uvs, 1). The endpoints c_ray are transformed into word-space coordinates using X_w = torch.bmm(c_ray, c2ws.transpose(1, 2)). Then, ray_o is subtracted from X_w, and this difference is divided by its norm to produce the ray's direction unit-vector ray_d.
In order to batch the sampling process in NeRFDataloader, three random B x 1 vectors are generated: i, y, and x. First, i contains values in the range [0, n) where n is the number of images in the dataset. y contains values in the range [0, height) and x contains values in the range [0, width). The ground truth pixel values are sampled using values = self.images[i, y, x], and the corresponding camera-to-world matrices are sampled using c2ws = self.c2ws[i]. The desired pixel coordinates x and y are concatenated together to form uvs, which is then passed along with c2ws into pixel_to_ray to generated batched rays_o and rays_d. In the end, rays_o, rays_d, and values are returned as a batch.
In sample_along_rays, the distances t along the rays to sample are generated using torch.linspace(near, far, n_samples). If perturb=True, then t is additionally jittered with some noise = torch.rand(batch_size, n_samples, 1) through t = t + noise * t_width. Finally, the batched samples along each ray are calculated using rays_o + rays_d * t and returned.
For convenience, the deltas between consecutive values in t are also returned. First, diff = t[:, 1:] - t[:, :-1] is calculated and then a column of t_width = (far - near) / n_samples values is appended to keep the dimensions of deltas consistent with n_samples and easily broadcastable later on.
With all the above functions implemented, the NeRFDataloader class was simple to implement. It is initialized with images and c2ws from the desired dataset, along with a length and batch_size. It implements __iter__ and stops iteration when it has sampled and returned length number of batches.
With the visualization code, the NeRFDataloader was able to produce the following plot with perturb=True. Because perturbation adds a random offset to each value in t, the last values in t get pushed beyond far whereas the plotted rays stop exactly at far, so some points appear to be floating off the end of the ray.
The network was split into several smaller subnets at each of the concatenation points using nn.Sequential as follows:
The PositionalEncoding module from Part 1 was modified to support any number of dimensions. The inputs x and rd are encoded using the 3D positional encoding to get x_enc and rd_enc respectively. Then, x_enc is passed through ffn1 to produce out1, which is concatenated with the original positional encoding using out1 = torch.concat([out1, x_enc], dim=-1). This is fed into ffn2 to produce out2. After this, the network splits off into two branches. For the density calculation, out2 is passed into density_ffn to produce a B x 1 tensor of density predictions density_out. For the rgb calculation, out2 is first passed through rgb_ffn1 to get rgb_out1. This is then concatenated with the original positional encoding for ray direction using rgb_out1 = torch.concat([rgb_out1, rd_enc], dim=-1) and passed into rgb_ffn2 to produce a B x 3 tensor of RGB color predictions rgb_out. Both density_out and rgb_out are returned from the forward pass.
The volumetric_render function takes in batched tensors sigmas, rgbs, and deltas. For batch size B and number of samples along the ray N, the sigmas (B x N x 1) and rgbs (B x N x 3) tensors come from the model predictions and represent the densities and corresponding colors at the predicted locations. The deltas (B x N x 1), representing the lengths of each segment along the ray, come from sample_along_rays.
First, for convenience, the sigmas and deltas are multiplied togther to get prod = sigmas * deltas, which is reused in the following calculations. Next, the exponents of the T_i elements are computed using cumsum = torch.cumsum(prod, dim=1). However, because the exponent of T_i must be the sum of the sigma-delta products up until but excluding i, the current products must be subtracted from the cumulative sum to produce prev_sum = cumsum - prod. Thus, the T matrix containing all T_i's can be calculated using T = torch.exp(-prev_sum). The weight for the color at the current step is calculated using p = 1 - torch.exp(-prod) and the weighted colors at each step along the ray are produced using batched operations like so: colors = T * p * rgbs. Lastly, the weighted colors are summed together for each ray to produce a B x 3 tensor of colors corresponding to each of the B original rays.
Given rays_o and rays_d, the forward function first calls sample_along_rays to get points and deltas. The points and rays_d are passed into the model, which returns density_preds and color_preds as output. These two tensors are then passed along with deltas into volumetric_render to produce a prediction of the color of each ray.
Rendering an image from a view defined by a c2w matrix combines the functions implemented in the previous parts. First, the x and y positions of each pixel in the desired final image are concatenated together into a tensor uvs. This is passed into pixel_to_ray along with the c2w matrix to produce rays_o and rays_d. These are passed into forward and the result of the volumetric_render on the model's outputs is reshaped into the final image and returned.
The model was trained across 5_000 gradient steps with a batchsize of 10_000 rays per step. The 3D positions x were encoded using X_ENC_LEN=10 and the ray directions rd were encoded using RD_ENC_LEN=4. The Adam optimizer was used with a learning rate of 5e-4. The model predictions were passed into volumetric_render and nn.MSELoss was used to compare the render results with the ground truth pixel values for each ray.
After 20 minutes of training over 5000 gradient steps, the model achieved 0.0032 loss and 24.97 PSNR on the training set, and 0.0028 loss and 25.59 PSNR on the validation set.
| Training Loss | Training PSNR |
|---|---|
![]() |
![]() |
| Validation Loss | Validation PSNR |
|---|---|
![]() |
![]() |
Here are images of validation-set camera 0 taken at iteration 0, 100, 200, 500, 1000, 2000, and 5000:
Here is a video of the the model's predictions on the c2ws from the test set:
After 3 hours of training over 50,000 gradient steps, the model achieved 0.00130 loss and 28.88 PSNR on the training set, and 0.00133 loss and 28.76 PSNR on the validation set.
| Training Loss | Training PSNR |
|---|---|
![]() |
![]() |
| Validation Loss | Validation PSNR |
|---|---|
![]() |
![]() |
Here is a video of the the model's predictions on the c2ws from the test set:
The 50_000 iteration model checkpoint was finetuned using double the number of ray samples (NUM_SAMPLES=128) using the same sampling method described above. After finetuning for 5_000 iterations, the model achieved a training PSNR of 31.75 and a validation PSNR of 30.94:
| Training Loss | Training PSNR |
|---|---|
![]() |
![]() |
| Validation Loss | Validation PSNR |
|---|---|
![]() |
![]() |
Below is the final result of the finetuned model compared to the result at 50_000 iterations. In the results from the finetuned model, the visual clarity improves slightly, but the amount of noise across the images is significantly lower and the finer details in the frontloader are visibly more stable.
| 50,000 Iterations | Finetuned |
|---|---|
![]() |
![]() |
In order to produce a depth map, the volumetric rendering step was modified to use a white-to-black gradient of colors rgbs = torch.linspace(1, 0, num_samples) from near to far along the ray instead of the predicted colors from the model. However, the densities predicted by the model are still used in weighting these depth colors in order to calculate the expected depth color for each ray. In the following result, lighter colors represent positions closer to the camera whereas darker colors represent positions farther away from the camera:
| Original | Depth Map |
|---|---|
![]() |
![]() |
The background color was injected into the render by weighting the desired background color bg_color by T_{n+1}, which is the probability that the ray does not terminate between near and far and adding that weighted color to the output of the volumetric render for each ray. This is accomplished by first calculating bg_weights = torch.exp(-bg_prod) with bg_prod being the sum of all the sigma-delta products within each ray. Then, bg_weights * bg_coloris added to the final output of the original volumetric render to produce the results shown below:
![]() |
![]() |