While developing a product from scratch based on deep learning you always end up asking you this question: "How will I ship and maintain my deep learning models in production?". Given you are a data scientist or a deep learning researcher, maintaining deployed products is by far the less exciting part of the process.
In this guide, I'll show you how I managed to ship my image super-resolution project with minimal devops and maintenance (see the final demo here).

Here you'll learn how I used Algorithmia, Pytorch, Django, Docker as well as my own library to create an end to end product that people can enjoy and use whenever they want.

I'm doing this blog post series on the recommendation of Jeremy Howard and Rachel Thomas who are the author of fast.ai, an amazing course on deep learning.

This post is divided into 3 parts:

All these parts are independent so you may want to jump directly on the one of your interest if you only want to know about the devops or the paper implementation part.

1. Understanding the big picture

When I started this journey I told myself: "Hey I think today I'm confident enough so I can start implementing papers from scratch so let's have some fun and jump right onto a GAN paper" and this is what I did with SRGAN and SRPGAN. Later I found that I should have started with papers based on more traditional deep learning architectures, but I'll come back to that later.

Before starting to implement the papers you'll probably want to understand how a vanilla GAN works as our papers lie onto the foundation of these kinds of networks. Here is a good introduction to how they work.

As I used to read in a lot of places when you start reading a paper, start by reading the abstract and do a first reading pass on the whole paper while assuming all the maths are correct. What you want to do in this first reading pass is to understand the general idea of how everything works and is glued together. So I would advise you to do exactly that for the 2 papers. Start with SRGAN for now, I'll warn you when to jump onto SRPGAN as the second paper uses the work of the first one.

So let's start with SRGAN.
In the paper you can understand that:

  • There is a whole family of papers for single image super-resolution (SISR) and the authors differentiate their work from the rest by saying that while most SISR papers get a high score for a given metric like SSIM or PSNR (metrics which compares similarity between the model outputs with the original high-resolution images) they lack perceptual details (details perceived by the naked human eyes) and tend to be blurry. In other words, researchers publishing papers on SISR tend to get a high score on the metric used by their peers but don't have really convincing high-resolution images that us humans would qualify as being an "HD resolution". On the contrary, the SRGAN authors propose a new approach to restitute the lost image details by using generative networks (GANs) which in turn recreate the lost image details.
  • The loss function is based on the use of features extracted from the VGG network combined with an MSE loss instead of using the MSE loss exclusively which will smooth the images and lead to poor perceptual quality.
  • There are 2 networks, a generator, and a discriminator as you are used to find with paper using GANs. The generators take as input low-resolution images (LR images) and output super-resolution images (SR images). The discriminator takes SR images and outputs labels between 0 and 1 and also take high-resolution images (HR images) and output a label between 0 and 1.
  • To obtain an LR image you downscale an HR image by a chosen factor such that once this LR image goes through the generator, the output of this same generator (outputting the SR image) is equal to the dimensions of the original HR image.
  • The SR and HR images are compared to calculate two metrics: The structural similarity (SSIM) and the Peak Signal to Noise Ratio (PSNR). These 2 metrics are our watchdogs for ensuring the network are indeed going the right direction.
  • There are 5 different losses:
    • The MSE loss (help finding pixel averages but tend to create overly smooth images)
    • The VGG loss as noted above (will restore high-level details of the images by calculating the Euclidian distance between the feature maps of the VGG network)
    • The Adversarial loss (is used to fool the discriminator network)
    • The Content loss (is a combination of the MSE loss + the VGG loss)
    • The Perceptual loss (is a combination of the Adversarial loss + the Content loss)
  • The MOS (Mean opinion score) testing is a perceptual test of images with human participants who are asked to rate pictures quality (we cannot really reproduce this on our side).

2. Gluing the components together

As we read the whole paper once we pretty much understood what is the purpose of every component. Now we need to glue them all together. The first thing I would advise people to do is to implement the model architectures and ensure the inputs gives the desired output dimensions (for instance the SR images should be of the same size as the HR images for the generator and the output should be a scalar from a sigmoid function for the discriminator).
srgan_architecture.

Once you do that you'll want your generator to start generating "okay" outputs by just feeding it with LR images, getting SR images, optimizing on the MSE metric and calculating the SSIM and PSNR on few training passes as you learnt on section 3.2 of the paper that they "employed the trained MSE-based SRResNet network as initialization for the generator when training the actual GAN to avoid undesired local optima.". In other words, they optimized their generator on the MSE loss before doing any adversarial training. So you should ensure that this works first. At the same time, you'll ensure your generator behave as you want.

The second thing you want to do is to test your discriminator architecture. The job of the discriminator is simple: It outputs a single number between 0 and 1, 0 meaning "my image is a fake" and 1 meaning "my image is legit". So now as you have the generator outputs and the discriminator outputs you want to see if by training the discriminator it indeed classify the images as so. Of course, when you will start running the first epochs it won't be good at classifying the right things but after optimizing it a bit against the cross-entropy loss you will start to see some improvements.
Here is the pseudo code to illustrate what I mean:

# hr_images is a Tensor of size [batch_size, channels, height, width] 
# sr_images is a Tensor of the same size as hr_images
sr_images = self.netG(lr_images) # Your generator output
d_hr_out = self.netD(hr_images)  # Sigmoid output
d_sr_out = self.netD(sr_images)  # Sigmoid output

d_hr_loss = F.binary_cross_entropy(d_hr_out, torch.ones_like(d_hr_out))
d_sr_loss = F.binary_cross_entropy(d_sr_out, torch.zeros_like(d_sr_out))
d_loss = d_hr_loss + d_sr_loss

As you can see we are using F.binary_cross_entropy to flag the HR images as ones and the SR images as zeros, then you add together the 2 losses to get the discriminator loss. Once you do that you want to launch a backward pass to see if your discriminator did train correctly (by training it for say, 20 epochs then checking if d_hr_out indeed goes toward 1 and d_sr_out goes toward 0). Once you ensured your generator and discriminator works correctly you can start writing the other losses and tunning your models the same way the authors of the paper did.

Now onto adversarial training. I have to admit I helped myself a bit here by finding an already existing implementation of the paper in tensorflow and mapping the python code to the maths. So basically with the rest of the code (adversarial, vgg, perceptual and content losses) I just tried to replicate what this guy did in Pytorch instead of Tensorflow. Why? Simply because reinventing the wheel was, in my opinion, a waste of time and I rather put my focus elsewhere. It's still important to understand how everything works by doing little experiments but once you get the "Aha" moment and you understand how the little pieces work together then you're ready to move to stage 2.

3. Stage 2

Speaking of which, what do I mean by "Stage 2" exactly? Well, it's the part where you have no already existing implementation to help you cheat. Of course, translating an implementation from one framework to another is fun, but this isn't the real challenge here. The real challenge is to implement a paper with NO existing implementation.

If you started by reading the SRPGAN paper directly like I did you would have understood that you needed to have a working implementation of the SRGAN paper first, I've avoided you to read the papers in the wrong order as you would have been confused seeing references to the original paper without knowing them (such as the Generator/Discriminator architectures).

From now on I'll only talk about the SRPGAN paper and I will consider you've read it. You can find my implementation of the paper here.

Once you've read the paper you'll probably figure few things out such as:

  • The paper lies on the SRGAN implementation I've linked above
  • They modified the network architecture a bit (by using InstanceNorm instead of BatchNorm, leaky ReLu, and strided convolutions)
  • They removed the VGGLoss and used their own loss (called Perceptual loss) made of the difference between the feature maps of the SR image and HR images from the Discriminator.
  • They used the charbonnier loss instead of MSE for the content loss

So now let's move onto the interesting part and start implementing the points above.

Making sense of the maths

Ah finally, the part you people were waiting for.
As a coder, I'm still not very at ease with the math literature so while I'm struggling to understand the meaning of greek letters I found it more convenient to tie a concept to some python code. This way my mind understands the patterns between the two and the more I do that the more I start to make sense of the Greek letters.

There is also some hacking involved, by hacking I mean: "Don't try to reinvent the wheel". For instance, in section 3.1.2 of the paper, they explain how Instance normalization works. While you could try to implement it by yourself and waste hours and a lot of sweat, you could as well just use the already implemented pytorch layer.

The content loss (charbonnier loss)

In their paper instead of using the MSE loss they used the charbonnier loss which is defiled by:

$$l_y(y, \hat{y}) = E_{z,y~P_{data}(z,y)}(\rho(y-G(z)))$$

As said in the paper \(G(z)\) is the output of the generator (in other words, the SR image) and \(y\) the ground truth (or the HR image). So we subtract those two and pass the result into \(\rho\) which is defined by \(\rho(x) = \sqrt{x^2 + \varepsilon^2}\) where \(\varepsilon = 10^{−8}\) (paper section 4.1). From the output of \(\rho(x)\) you take the entropy of the distribution \(P_{data}(z,y)\) defined by \(E_{z,y~P_{data}(z,y)}\). So this part I'm not sure how I should implement it. Should I just use F.binary_cross_entropy(z, y) where \(z\) is the LR image and \(y\) the HR image? I don't feel confident about this. Whatever let's just consider the charbonnier loss to be a drop-in replacement for the MSE loss and forget about cross entropy. By the way, we could just google about it, maybe some guy implemented it for us? Oh jeez, here you have it, and it's from an implementation of a SISR paper, let's just adapt this code to ours!

The perceptual loss

Now onto the perceptual loss, no escape from this one! So the perceptual loss is defined by:

$$l_p=\sum\limits_{i=1}^LE_{z,y~P_{data}(z,y)}(\Vert\phi_i(y)-\phi_i(G(z))\Vert)$$

For me it basically says: take the feature maps of the discriminator given an HR image minus the feature maps of the discriminator given the output of the generator, and pass them into the charbonnier loss (the double \(\Vert\) is the charbonnier loss) them sum the result. In python you get a code like this:

# Perceptual loss
perceptual_loss = 0
for hr_feat_map, sr_feat_map in zip(d_hr_feat_maps, d_sr_feat_maps):
    perceptual_loss += self.charbonnier(sr_feat_map, hr_feat_map, eps=1e-8)

Where d_hr_feat_maps and d_sr_feat_maps are the features maps of the discriminator. Here is what the discriminator model looks like:

class Discriminator(nn.Module):
    def __init__(self, input_shape):
        super(Discriminator, self).__init__()

        self.block1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=(4, 4), stride=2, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block2 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=(4, 4), stride=2, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block3 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=(4, 4), stride=2, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block4 = nn.Sequential(
            nn.Conv2d(256, 512, kernel_size=(4, 4), stride=2, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block5 = nn.Sequential(
            nn.Conv2d(512, 1024, kernel_size=(4, 4), stride=2, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block6 = nn.Sequential(
            nn.Conv2d(1024, 2048, kernel_size=(4, 4), stride=2, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block7 = nn.Sequential(
            nn.Conv2d(2048, 1024, kernel_size=(1, 1), stride=1, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block8 = nn.Conv2d(1024, 512, kernel_size=(1, 1), stride=1, padding=1)

        self.block9 = nn.Sequential(
            nn.Conv2d(512, 128, kernel_size=(1, 1), stride=1, padding=1),
            nn.LeakyReLU(0.2),
        )

        self.block10 = nn.Sequential(
            nn.Conv2d(128, 128, kernel_size=(3, 3), stride=1, padding=0),
            nn.LeakyReLU(0.2)
        )

        self.block11 = nn.Conv2d(128, 512, kernel_size=(3, 3), stride=1, padding=1)

        in_size = self.infer_lin_size(input_shape)

        self.out_block = nn.Sequential(
            Flatten(),
            nn.Linear(in_size, 1),
            nn.Sigmoid(),
        )

    def infer_lin_size(self, shape):
        bs = 1
        input = torch.autograd.Variable(torch.rand(bs, *shape))
        model = nn.Sequential(
            self.block1,
            self.block2,
            self.block3,
            self.block4,
            self.block5,
            self.block6,
            self.block7,
            self.block8,
            self.block9,
            self.block10,
            self.block11,
        )
        size = model(input).data.view(bs, -1).size(1)
        return size

    def forward(self, x):
        feature_maps = []

        x = self.block1(x)
        feature_maps.append(x)

        x = self.block2(x)
        feature_maps.append(x)

        x = self.block3(x)
        feature_maps.append(x)

        x = self.block4(x)
        feature_maps.append(x)

        x = self.block5(x)
        feature_maps.append(x)

        x = self.block6(x)
        feature_maps.append(x)

        x = self.block7(x)
        feature_maps.append(x)

        block8 = self.block8(x)

        x = self.block9(block8)
        feature_maps.append(x)

        x = self.block10(x)
        feature_maps.append(x)

        block11 = self.block11(x)

        final_block = F.leaky_relu(block8 + block11, 0.2)
        feature_maps.append(final_block)

        out = self.out_block(final_block)
        return out, feature_maps

So now you may wonder: "How did you know the \(\Vert\) was the charbonnier loss?" Well... I didn't at first glance, I had to ask on a forum and a guy told me the \(\Vert\) represent the \(L^2\) norm... but wait for a second, earlier in the paper (section 3.2.2) the author said they compared \(L^1\), \(L^2\) and the Charbonnier loss but we are actually using the latter... So the \(\Vert something \Vert\) must be the Charbonnier loss!

The adversarial loss

In the paper the adversarial loss is defined by:
$$l_a(G, D)=E_{z,y~P_{data}(z,y)}[logD(z,y)] + E_{z~P_{data}(z)}[log(1-D(z,G(z)))]$$

Compared to the original SRGAN paper the only difference is that the discriminator is now receiving the LR image so it can "encourage the generator to generate the solution that resides on the manifold of the HR image".
To be honest, here I was not sure how I should pass \(z\) (the SR image) to the discriminator as well as the HR image as their size doesn't match.

I could have studied this a bit deeper by reading about how the Conditional GANs works for instance but here I chose to keep the unconditional approach GAN which will probably work too!

The optimization

Now the final piece. The discriminator loss is defined by:

$$l_d=-l_a(G,D)+\lambda l_p$$

where \(l_a(G,D)\) is the adversarial loss given the generator and discriminator output, \(l_p\) being the perceptual loss.
Here instead of choosing to follow the paper by the letter, I chose once again to keep the discriminator loss as it was in the original SRGAN paper. At least I'm sure it works and given the little experiments I made according to the formula above I didn't get convincing results.

Now for the generator loss:

$$l_g=l_a(G,D)+l_p+l_y$$

(I removed \(\lambda_1\) and \(\lambda_2\) as they both equal to 1 (section 4.1)).

This translates directly to:

g_loss = adversarial_loss + perceptual_loss + content_loss

With my adversarial loss defined by:

adversarial_loss = 0.001 * F.binary_cross_entropy(d_sr_out, torch.ones_like(d_sr_out))

As you can see I added the scaling factor to put it on the same scale of the other losses. So maybe they wanted to write:

$$l_g=\lambda l_a(G,D)+\lambda_1l_p+\lambda_2l_y$$

?
I can't tell. Anyway, I found my results to be pretty stable with this scaling factor.

Final results and paper reproducibility

There you have it!
Now onto the testing. As on the original paper, I used the same number of training steps for the initial generator and adversarial training.
After running the model for 2000 epochs I've got convincing results but not with an SSIM and PSNR score as high as the paper had. There is probably some parts of my implementation which need to be optimized and given the fact that I skipped few details of the paper (like the conditional GANs) this is to be expected. (Don't hesitate to submit a Pull request if you want to add few optimizations to the algorithm or to fix few things). But, it works! And the results are convincing enough so I can be glad I didn't spend 1 month and a half on these 2 papers for nothing!

That's funny though as Jeremy from fast.ai said in lesson 1 of part 2 of his course: "The research level code is just good enough that they were able to run their particular experiments" as I was actually stuck for a long time for trying to implement what is not there... Let me explain.
As a picture is worth a thousand words let me show you some:
collage2

collage1

The image on the left is the original image scaled to 4x via bicubic interpolation and the one on the right is the super-resolution image with the same dimensions.
When you compare the caiman on the first picture you can clearly see a difference between the two pictures, the one on the right is what we would call "HD".
But for the second picture... well it's not really obvious which one should be considered to be of HD quality. Honestly, I would classify them both as "crappy". So you may wonder: Why is that? Why it is that the algorithm worked for one image and not the other?
Well, simple answer: The mess-free setup of academic papers is not reflecting the real world (I'm kidding, people from academia are doing an awesome job and we wouldn't have got that far without them :) ).

So basically what is happening here is that the first image was originally a high definition picture which was downscaled to a low-resolution image using bicubic interpolation. This low-resolution image was then passed through our GAN generator which reconstructed the lost details while upscaling it to a given factor (x4 for instance).

The same process went for the second image with one crucial exception: This low-resolution image is not a result of a bicubic downscaling. It's just low resolution by nature and by I don't know what process made it that way. The sad part of this story is: Most images out there will be like the second picture so the result won't be very stunning...

Another sad story is the fact that I lost a lot of time comparing the result from the second image. My network was converging, my losses seemed ok, and everything seemed to work overall, but the perceptual result was not good on the second picture.
At some point, I told myself: "I'm tired of this, let's try a different implementation, download the code and see what the results are".
And to my surprise, the results were more or less the same as what I had, and at that very moment, I realized there was something wrong with my evaluation image. Then I tried to use images from the validation set of the DIV2K dataset and magically everything worked... At least that's a lesson I won't easily forget.

Conlusion

That was a really long journey but I'm glad I finally made it. There are few lessons I learned along the way that should be useful to people trying to implement their first papers:

  • Don't start with a complicated paper like the ones dealing with GANs. By nature, these models are complicated to train and you don't want to add yourself this burden on top of the complexity of implementing a paper if you are not used to it.
  • Try to find shortcuts and hacks to reach your goals. Don't reinvent the wheel, don't try to reimplement InstanceNorm for instance even if the paper gives you the formula to do so. By using an already implemented concept by a well-known library you ensure that you didn't mess up with its implementation.
  • List the different concepts of the papers (like the different loss functions) and again, don't reinvent the wheel. Are they using a GAN architecture? Fine, let's find a GAN implementation for my framework. They used the charbonnier loss? No problem lets google it. Of course, it's important to not just copy paste the code blindly, you still have to understand how the code you just copied works and fit the one you already have. Also be skeptic about the code you find on the internet. Create yourself a mind map of what the code and the math formula on the paper says.
  • Don't trust the paper results blindly. As said earlier maybe the author had a particular setup or omitted to mention few optimizations that lead them to their results.
  • Try to replicate the authors' work based on their experiments and datasets, otherwise you will lose countless hours trying to figure out why it is not working when in fact... it is.

I hope you enjoyed reading this blog post. Stay tuned for part 2 and 3. In part 2 we will talk about the journey of creating an endpoint for Pytorch models (a lot of interesting stuff in there).