While developing a product from scratch based on deep learning you always end up asking you this question: "How will I ship and maintain my deep learning models in production?". Given you are a data scientist or a deep learning researcher, maintaining deployed products is by far the less exciting part of the process.
In this guide, I'll show you how I managed to ship my image super-resolution project with minimal devops and maintenance (see the final demo here).

Here you'll learn how I used Algorithmia, Pytorch, Django, Docker as well as my own library to create an end to end product that people can enjoy and use whenever they want.

This post is divided into 3 parts:

All these parts are independent so you may want to jump directly on the one of your interest.

Creating the entry point

At this point, I assume you have a working algorithm that you want to serve through an API. The first thing you want to do is to create the entry point of your algorithm. Basically, the function which will load your model, run the inference on a given input and return the result.

What you want here is to create a script with minimal dependencies, nothing too complicated. For example take a look at my script for the SRPGAN implementation:

def srpgan_eval(images, generator_file, upscale_factor, use_cuda, num_workers=os.cpu_count()):
    """
    Turn a list of images to super resolution and returns them
    Args:
        num_workers (int): Number of processors to use
        use_cuda (bool): Whether or not to use the GPU
        upscale_factor (int): Either 2, 4 or 8
        images (list): List of Pillow images
        generator_file (file): The generator saved model file
    Returns:
        list: A list of SR images
    """
    netG = Generator(upscale_factor)
    learner = Learner(ClassifierCore(netG, None, None), use_cuda=use_cuda)
    ModelSaverCallback.restore_model_from_file(netG, generator_file, load_with_cpu=not use_cuda)
    eval_ds = EvalDataset(images)
    # One batch at a time as the pictures may differ in size
    eval_dl = DataLoader(eval_ds, 1, shuffle=False, num_workers=num_workers)

    images_pred = []
    predictions = learner.predict(eval_dl, flatten_predictions=False)
    tfs = transforms.Compose([
        transforms.ToPILImage(),
    ])
    for pred in predictions:
        pred = pred.view(pred.size()[1:])  # Remove batch size == 1
        images_pred.append(tfs(pred.cpu()))

    return images_pred

Here the function just take in a list of inputs (images) a serialized model (generator_file), and few parameters for the model and runtime (upscale_factor for the model to know how to upscale the inputs, use_cuda to run the model on the GPU and num_workers for the preprocessing step). Once you have a function like this you can move to the next step.

Choosing a platform to host the model

Now onto the fun part, serving your model. Here you may want to use Flask which is probably the easiest framework to use to quickly deploy your model. But here I'll talk about Algorithmia instead. Basically, my primary concern while deploying into production was that I wanted to use a serverless platform which will automatically scale as requests come in. I also wanted, if possible, to use a platform with monthly free credits so if my API is not called a lot for a given month I would not lose a penny as this project won't generate any income.

My first option for a serverless platform was of course AWS Lambda but that wouldn't work for my case because of its restrictions. These restrictions include not using above 3gb of RAM or your instance get automatically killed. At first, I told myself that maybe I could tweak my Generator model a bit to stay into this limit but I quickly realized that this would never work because Pytorch actually takes an insane amount of RAM when run on the CPU. See the issue I opened here (TLDR: with an input of size [16, 64, 224, 224] (batch_size, channels, height, width) passed through a simple convnet you use ~1.4gb of GPU memory or 17gb of RAM when run on the CPU.) And even if the Pytorch guys say they are working on a solution that will probably not be anytime soon given the release cycle of Pytorch and the fact that this issue is known by the devs since November 2015.

At this point you may wonder: Why don't you simply convert your model to the Onnx format and use another framework like Caffe2 which doesn't have these kinds of issues? Well, that seems like a perfect solution excepted that it's not... In fact, to export my model to ONNX I must respect these scenarios:

  • I must not have a model with dynamic input sizes
  • The model must not be dynamic (an example of such models are RNNs)

I fall into the first category... You can imagine the look on my face...

giphy

So my only choice was to run inference on GPUs, considering my criteria (monthly free credits and a serverless platform) the only remaining choice I found was Algorithmia. It reveals to not be the most optimal choice actually, but I couldn't know what pitfalls I would fall into before trying the platform. If you don't want to commit the same mistakes I made, read more below.

Serving the model on Algorithmia

So now onto writing the model entry point. When you create a new algorithm on Algorithmia you are asked to chose between multiples options. I always check "Requires full access to the internet", "Can call other algorithms" and "Advanced GPU" as I didn't find a way to change these options after the algorithm is created you better off having them activated. Once you do that you can clone your repository to get a script like this:

import Algorithmia

# API calls will begin at the apply() method, with the request body passed as 'input'
# For more details, see algorithmia.com/developers/algorithm-development/languages
def apply(input):
    return "hello {}".format(input)

When an API call is triggered the json end up in that input parameter as a dict object. When the apply() function returns the client get the API call response back.
From there you'll need to call the model entry point you created in the last chapter. Simply copy your model code into the algorithmia project folder and call it from there. As for me, my goal was to make my project available though a pypi library so I just changed the requirements.txt file in the root of the algorithmia folder to include the dependency.
Once I did that I had to write the code for the apply function.
Here is what it looks like:

from datetime import datetime
import base64
import requests
from io import BytesIO
import Algorithmia
from Algorithmia.acl import ReadAcl
from torchlite import eval
from pathlib import Path
from PIL import Image
import uuid


class AlgorithmError(Exception):
    """Define error handling class."""

    def __init__(self, value):
        self.value = value

    def __str__(self):
        return repr(self.value).replace("\\n", "\n")


# Note that you don't pass in your API key when creating an algorithm
client = Algorithmia.client()

def save_img_to_folder(local_image, local_image_name, cloud_directory, local_img_name_prefix):
    start_time = datetime.now()
    local_image_name = Path(local_image_name)
    format = local_image.format if local_image.format else 'PNG'
    suffix = local_image_name.suffix if format != 'PNG' else '.png'
    img_byte_arr = BytesIO()
    local_image.save(img_byte_arr, format)
    uri = "data://" + cloud_directory + "/" + local_image_name.stem + "_" + local_img_name_prefix + suffix
    data = client.file(uri).put(img_byte_arr.getvalue())
    print('save_img_to_folder time (hh:mm:ss.ms) {}\n'.format(datetime.now() - start_time))
    return "https://algorithmia.com" + data.url, uri


def apply(input):
    """
    Takes a json input in this form:
        {
            "image_url": "https://www.cnewyork.net/wp-content/uploads/2015/02/GeoffreyWojciechowski3.jpg",
            "upscale_factor": "4"
        }
        or in this form:
        {
            "image_base64": "base64_img",
            "image_name": "image_name",
            "upscale_factor": "4"
        }
        Where "image_base64" is an encoded image in base64 with UTF-8 encoding.
        Ex to encode with python:
            with file as image_file:
                base64_bytes = base64.b64encode(image_file.read())
                base64_string = base64_bytes.decode('utf-8')

            json = {"image_base64": base64_string, "image_name": "my_image", "upscale_factor": "4"}
    Args:
        input (dict): The parsed json

    Returns:
        dict: A dict in the form :
            {"sr_image_url": url, "sr_image_uri": uri,
            "original_image_url": url, "original_image_uri": uri,
            "upscale_factor": upscale_factor}
    """

    total_time = datetime.now()
    valid_json = False
    image = None
    image_name = None
    train_start_time = datetime.now()
    generator_model = client.file("data://Ekami/torchlite_srpgan/Generator-0.1.pth").getFile()
    print('Total generator retrieve time (hh:mm:ss.ms) {}\n'.format(datetime.now() - train_start_time))

    if "image_url" in input and ("image_base64" in input or "image_name" in input):
        raise AlgorithmError("You provided both an image_url and image_base64. Please choose only one.")
    elif "image_url" in input:
        valid_json = True
        image_url = input["image_url"]
        valid_json = True
        image_response = requests.get(image_url)
        # Retrieve input information
        image = Image.open(BytesIO(image_response.content))
        image_name = Path(image_url).name
    elif "image_base64" in input and "image_name" in input:
        valid_json = True
        image_name = input["image_name"]
        image_base64 = input["image_base64"]
        image = Image.open(BytesIO(base64.b64decode(image_base64)))

    if valid_json:
        # Instantiate a DataDirectory object, set your data URI and call create
        srgan_directory = client.dir("data://.my/srgan_results")
        # Create your data collection if it does not exist
        if srgan_directory.exists() is False:
            srgan_directory.create(acl=ReadAcl.public)

        upscale_factor = input.get("upscale_factor")
        if not upscale_factor:
            upscale_factor = 4

        upscale_factor = int(upscale_factor)
        short_uuid = uuid.uuid4().hex[:8]

        # Resize original image to bicubic x upscale_factor
        bicubic_original_img = image.resize((image.width * upscale_factor, image.height * upscale_factor),
                                            Image.BICUBIC)

        # Save original image in dir
        origin_url, origin_uri = save_img_to_folder(bicubic_original_img, image_name,
                                                    srgan_directory.path, "original-" + short_uuid)

        # Frozen inference graph method:
        sr_img = eval.srpgan_eval([image], generator_model.buffer, upscale_factor, use_cuda=True)[0]

        # Save SR image
        sr_url, sr_uri = save_img_to_folder(sr_img, image_name, srgan_directory.path, "sr-" + short_uuid)

        print('Total time (hh:mm:ss.ms) {}\n'.format(datetime.now() - total_time))
        return {"sr_image_url": sr_url, "sr_image_uri": sr_uri,
                "original_image_url": origin_url, "original_image_uri": origin_uri,
                "upscale_factor": upscale_factor}
    else:
        raise AlgorithmError("Please provide a well formed API call")

The code basically loads the received image (url or in binary format) in memory, load the pretrained model, execute it on the given input and return the result.

As you can see my pretrained model is saved in the data://Ekami/torchlite_srpgan/Generator-0.1.pth folder which is basically an online storage offered by Algorithmia. On their advanced algorithm design page they tell you to preload your model outside the scope of the apply() function. I don't recommend you to do that if you don't want to run into the same problems as me where I got random errors popping out like this one:

Algorithmia.algo_response.AlgoException: invalid load key, '@'.

At this point, you may say: "That's great! Everything works now right?" Well yeah... kind of. Few things Algorithmia doesn't tell you are how their serverless stack is working or what hardware they are using.

During development, the API calls were really really slow and I found later on what was the source of this slowdown.
An engineer from Algorithmia told me that each time an API call is made a zip file containing all the dependencies from your project is unzipped into a Docker container created for your this call. That makes the serverless stack initialization very very slow as the Pytorch dependency is ~1.3gb. So, in the end, I get slow API calls (taking like 2min for each call) while the inference by itself take less than 2s...

Another kink I found is about the feedback of the API calls. When you launch a request which takes more than 10s to run you usually want to get some sort of feedback on the progress of the execution. But the way Algorithmia is made doesn't give you this flexibility. The way I wanted it to be is that I make an API call, the API call immediately return to the client with a pub/sub channel in it, the client subscribe to this channel and get constant feedback about the progression of the call execution as well as the response output when everything is done.

Lastly, you do not know what GPU they are using in the background. It's very important to know as models with too much memory requirements won't work. I found my algorithm did not accept pictures above 700x700 in size (as they get upscaled to 4x) which I believe give us a memory of about roughly 12gb based on the experiments on my 1080Ti.

A new alternative I found to Algorithmia is paperspace gradient (affiliate link) but it was released when I just finished my whole project...

Conclusion

So here we are. A barely working model but hey... it works! I can say it's acceptable for a free demo or MVP but definitely not for a real production system.

So to recapitulate, if there are things you need to take with you from this blog post:

As of version 0.3.1 of Pytorch you don't want to use it for production if:

  • You have a model which accept dynamic input sizes and/or use dynamic neural network architectures such as RNN and plan to use another framework like Caffe2 to serve your model in production. Onnx won't save you here.
  • You want to run inference on CPU because using GPU instances or buying physical GPUs is too expensive. Plus if your model takes more RAM than it was initially designed for, you just add more RAM instead of buying a new GPU (the issue is still there on Pytorch 0.4.0).

On the contrary, if you have the latest DGX-2 from Nvidia then you're probably fine with Pytorch.

As for Algorithmia, you don't want to use their v1 if:

  • You're using Pytorch (or any framework which size is above ~10mb)
  • You need to get a progressive feedback about your algorithm execution
  • You want to use custom metrics to monitor your model

Pytorch is a framework that I enjoy to work with compared to Tensorflow but it's definitely not production ready for a lot of use case.

Again don't hesitate to test the demo here. Here is an example of what you could get:

collage4

I hoped you enjoyed the reading, stay tuned for the final part!