The StyleTransfer model takes in four parameters: the content image c_img, the style image s_img, the layers of the models to use in calculating content loss c_layers, and those used for calculating style loss s_layers. First, in order to preserve the proportions of the content image, the s_img was resized to be the same size as c_img using torchvision.transforms.Resize. Following the methods in the paper, the VGG-19 weights (pretrained on ImageNet) were loaded from PyTorch's torchvision library using model = vgg19(weights=VGG19_Weights.IMAGENET1K_V1). Then, two types of loss probes ContentLossProbe and StyleLossProbe (implemented as nn.Module), are inserted after the layers specified in c_layers and s_layers respectively. This is accomplished by iterating through the layers in the original VGG-19 model and checking whether each layer is one of the layers we want to probe, and inserting the appropriate probe after it. Any layers after the last layer we want to probe are discarded to save computation time. Furthermore, because we want to train the image instead of the model weights, I froze the model weights using model.requires_grad_(False).
Each probe is initialized with the activation from the layer it is probing (the layer before the probe), which is later used in calculating the two types of losses. For example, if we want to insert a ContentLossProbe after layer N, the content image c_img is fed through all the layers up to and including N, and the output content_features at layer N is used to initialize a ContentLossProbe. The same process is used to initialize the StyleLossProbe probes using s_img.
In the forward pass of a ContentLossProbe, the output from the previous layer x and the (constant) extracted features from the content image content_features are used in a simple MSE loss function to compute the content loss at this layer of the model. (I found this to converge much faster than the loss = torch.sum((x - content_features) ** 2) / 2 proposed in the paper.) This loss is stored in the probe's self.loss variable and x is returned so that the probe acts like an identity layer.
In the forward pass of a StyleLossProbe, there is an additional step of computing Gram matrices. From an input tensor x of size (B, C, W, H), the Gram matrix is computed by first flattening the feature maps into feature vectors using flat = x.view(C, W * H). Then, the feature correlations are computed using a matrix multiplication like so: corr = torch.matmul(flat, flat.transpose(0, 1)). Lastly, the feature correlation matrix is divided by a factor of C * W * H and returned as the Gram matrix. Now, to compute the style loss, both x and style_features are passed into get_gram_matrix to compute two Gram matrices x_gram and s_gram. These are then passed into the MSE loss function to calculate the final style loss and stored in the probe's self.loss variable. (I found MSE in general to be much better at converging compared to the loss = torch.sum((x_gram - s_gram) ** 2) / (4 * W * W * H * H) proposed in the paper, which did not produce visually appealing results.)
During training, after the final image is fed forward through the model, the total content and style losses are computed by summing together all the losses from their respective probes and weighting them by w_l, which was set to 1 / num_probes in the paper and is equivalent to taking the average. These totals c_loss and s_loss are weighted by their respective factors CONTENT_WEIGHT and STYLE_WEIGHT and then summed together to produce the final total_loss.
The final image was initialized by cloning the content image and adding various amounts of white noise to it. I originally tried starting from a completely white noise image, but the Adam had trouble replicating the structure of the content image. (LBFGS did not have this issue, see the next section for more details.) The pixel values in the image were optimized using Adam with learning rates determined from hyperparameter sweeps for each style-content pair. Every image was optimized across 1000 iterations to produce the following results:
|
Content Images →
Style Images ↓ |
Ballerina
|
Tübingen, Germany
|
Femme nue assise, Picasso
|
|
|
The Starry Night, Picasso
|
|
|
The Shipwreck of the Minotaur, J. M. W. Turner
|
|
|
Here are some videos that show how the images change over the training process:
|
Content Images →
Style Images ↓ |
|
|
|
|
|
|
|
|
|
|
|
One of the authors recommended using LBFGS instead of Adam as the optimizer (link), so I gave that a try. Unlike Adam, LBFGS was able to optimize the image starting from either complete white noise or the original content image with some added noise, producing similar results. In the following results, I initialized the final image to be the content image c_img with some added white noise determined by config['noise'], which I found to be much more consistent than starting from complete white noise. Overall, LBFGS produced far more stylistic results compared to Adam and more accurately captured the textures in the style image. The following images results were from 600 iterations using conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1 as the style layers and conv4_2 as the content layer. The style and content weights were tuned until I generated a visually appealing image. (The exact values are omitted for visual clarity. Please see the .json files for the exact weights and other training configurations.)
As a sidenote, I learned that LBFGS tries to approximate the Hessian of the loss function in order to take better steps, which makes it far less efficient for higher-dimension optimization compared to Adam. However, because we are optimizing a small image and not a deep neural network, the dimensionality of the problem should intuitively be low enough for LBFGS to outperform Adam.
|
Content Images →
Style Images ↓ |
Ballerina
|
Tübingen, Germany
|
Femme nue assise, Picasso
|
|
|
The Starry Night, Picasso
|
|
|
The Shipwreck of the Minotaur, J. M. W. Turner
|
|
|
Der Schrei, Edvard Munch
|
|
|
Composition VII, Wassily Kandinsky
|
|
|
Here are some videos that show how the images change over the training process:
|
Content Images →
Style Images ↓ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The style transfer was performed with style losses on increasing subsets of layers, from [conv1_1] to [conv1_1, conv2_1, conv3_1, conv4_1, conv5_1]. As more style layers are probed, the texture of the style image becomes more apparent. When only conv1_1 is used in computing the style loss, the color easily transfers over, but the texture of the background and the shading on the ballerina still roughly look like that of the content image. However, when layers up to conv5_1 are used, the ballerina and background starts to replicate the sharp lines, blocky texture, and splotchy colors of the Picasso painting.
| Style Layers | Result |
|---|---|
conv_1 |
![]() |
conv_1, conv_2 |
![]() |
conv_1, conv_2, conv_3 |
![]() |
conv_1, conv_2, conv_3, conv_4 |
![]() |
conv_1, conv_2, conv_3, conv_4, conv_5 |
![]() |
The style weight was varied from 1E3 to 1E6. At lower style weights, the color was easily transferred (although noisily) but none of the characteristics of the style image's texture were transferred. However, as the style weight increased, coherent textures from the style image began to appear in some parts of the image, and finally all throughout the image at a style weight of 1E6.
| Style Weight | Result |
|---|---|
1E3 |
![]() |
1E4 |
![]() |
1E5 |
![]() |
1E6 |
![]() |
1E7 |
![]() |
One interesting question I wanted to answer was the effect of image size on the quality of transferred textures and features. Because VGG-19 uses convolutional layers with kernels of fixed size, I suspected that there were limits to the scale invariance and receptive field of the convolutional neural network.
I ran the following experiment on tubingen.jpg and starry_night.jpg, which have an initial size of 512x384. The size of the images, and consequently the size of the final image, were varied to be both larger and smaller than the original. Layer conv4_2 was used as the content layer, and layers [conv1_1, conv2_1, conv3_1, conv4_1, conv5_1] were used for the style layers. The style weight was set to 1E8 and a small bit of uniform white noise was added to the content image to initialize the final image. These experiments produced the following results:
| Scale Factor | Result |
|---|---|
0.5 |
![]() |
0.75 |
![]() |
1.0 |
![]() |
1.5 |
![]() |
2.0 |
![]() |
3.0 |
![]() |
Interestingly, the quality, intensity, and colors of the style transfer do not change much, and all the images do have the distinct color and brush strokes from the painting. However, as the image size increases, the brush strokes become visibly smaller, as do the features such as the swirls in the sky. Furthermore, the style features such as the bright yellow spots on the walls and the dark strokes near the river stay in the same position in each image, regardless of scale.
From this, we can conclude that convolutional neural networks for style transfer have strong shift invariance but limited scale invariance. While the color and relative positions of the transferred styles are preserved, the scale of the details in the style are not.
Overall, algorithm generalized well to custom landscapes and textures. Interestingly, it was able to not only capture the texture of the brush strokes from the style of each painting, but also the texture of the canvas as well!
| Content Image | Style Image | Final Result |
|---|---|---|
Seattle, Washington
|
Les étangs, Claude Monet
|
|
Zhangjiajie, China
|
The Seine at Argenteuil, Claude Monet
|
|
The Sognefjord, Norway
|
Saatchi Art
|
|
Here are some videos that show how the images change over the training process:
|
|
|
As described in the paper, the discretized form of the synthetic photography equation when beta is fixed at 1 is equivalent to shifting and averaging the sub-aperture images. The datasets already provide these sub-aperture images, which were retrieved by sampling corresponding pixels across all microlenses. In order to refocus the image, the dataset was first reorganized into a tensor of size (17, 17, H, W, 3). The first two dimensions represent the location of the sampled pixel within each microlens, or the 2D location of the sub-aperture relative to the aperture. The other three dimensions are the contents of the image.
The refocusing algorithm is as follows: First, the position of the center-most sub-aperture is determined, which was center = (8, 8) across all the datasets. Then, for each sub-aperture image (positioned at (i, j)) in the dataset, the horizontal and vertical distances between that sub-aperture and center were calculated using x_dist = j - center[1] and y_dist = i - center[0]. For simplicity, these distances were then scaled by an arbitrary depth factor depth_factor in order to produce the shift offsets dx = int(x_dist * depth_factor) and dy = -int(y_dist * depth_factor) in pixels. Each sub-aperture image is shifted by dx and dy, and the average of the pixel values are computed using np.mean across a list of the shifted images.
Although the synthetic photography equation uses an alpha term in the denominator of the shift, the alpha term scales the shifts exponentially, which was hard to work with. Thus, I decided to instead scale the distances by a linear factor depth_factor, which proved to be far easier to control.
| depth_factor=0.000 | depth_factor=1.034 | depth_factor=2.069 | depth_factor=3.000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |
| depth_factor=0.000 | depth_factor=0.897 | depth_factor=1.883 | depth_factor=2.600 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |
| depth_factor=-1.200 | depth_factor=2.607 | depth_factor=5.779 | depth_factor=8.000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |
| depth_factor=-3.000 | depth_factor=-1.138 | depth_factor=1.138 | depth_factor=3.000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |
In order to change the apparent aperture of the camera, the refocus function from the previous part was modified to support an aperture parameter. When iterating through the 289 sub-aperture images, an additional check is added to skip images that are more than aperture grid positions away from the center position at (8, 8) (either horizontally or vertically). For some sub-aperture at position (i, j), if abs(i - center[0]) > aperture or abs(j - center[1]) > aperture is true, then this image is skipped during processing and omitted from the final average. For sub-aperture images within the aperture range, the processing follows the exact algorithm described in the previous part.
Having more sub-aperture images in the final render introduces more noise into the image because sub-apertures that are farther apart have slightly different views, which contributes to the blurring effect in the above results. Having fewer images centered around the central sub-aperture maintains the sharpness of the render near the center but also reduces the blurring near the edges of the image, giving the impression of a smaller aperture and a larger depth of field.
depth_factor = 1.5.
| aperture=1.000 | aperture=3.069 | aperture=5.138 | aperture=7.000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |
depth_factor = 2.6.
| aperture=1.000 | aperture=3.069 | aperture=5.138 | aperture=7.000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |
depth_factor = 8.0.
| aperture=0.483 | aperture=2.414 | aperture=4.586 | aperture=7.000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |
depth_factor = 3.0.
| aperture=0.483 | aperture=2.414 | aperture=4.586 | aperture=7.000 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Video |
|---|
![]() |