/2021sp-pset-4-CalebEverett

Problem set implementing data pipeline using Luigi.

Primary LanguagePython

Pset 4

*Apologies for the links to the private classroom repo.

Overview

This assignment was about about developing robust data processing pipelines with Luigi and machine learning operations. The end product is a pipeline that downloads an image from S3 and conducts neural style transfer on it using pre-trained deep learning models.

New Atomic Write

The first problem of this assignment was the development of a new atomic write function. This function was written as a Luigi Task to be able to write atomically to an intermediate temporary file with the same extension as the target file. This solution is implemented in csci_utils package to be able to use it in other problem sets.

External Tasks

The next problem was to develop external tasks to verify the existence of files in S3. The original assignment called for two separate tasks, one for an image and one for a pre-trained model, but I ended up consolidating these into a single S3FileExistsTask that can be used for both purposes.

Luigi Downloads

We were also required to develop Luigi tasks to download an image and model from S3 to a local host. I also decided to consolidate these into a single S3DownloadTask that can be used to download any file from S3. Both this task and the previous one are parameterized to take an S3 bucket and paths to be generalizable to future problem sets.

Stylizing - Microsciences Approach

My solution to the stylizing inclues the following functionality:

  1. Pytorch models are packaged as MLflow models using package_models.py. This downloads the required pre-trained models, loads them as Pytorch models and then saves them as MLflow models.
  2. Use stylize.py to call the MLflow models for predictions, pre and post-processing.
  3. Create Dockerfile to create an image with packaged models stylize.py module in it.
  4. Create a Stylize Luigi Task that builds the image and runs it using the Docker SDK, accepting command line arguments, to stylize an image, downloading it from S3 if it exists there but isn't present locally.

A couple of cool features of this set up:

  • Using the Docker SDK to call docker from within the Luigi task avoids having to use an external task.
  • Using the Docker feature of treating additional arguments passed to the run command beyond those specified in the entry point is an easy way to be able to pass additional arguments to the stylize module inside the container.
  • Packaging the models inside the Docker container as the image is being built, completely encapsulates them. Having the models be in the MLflow format in addition to being inside a Docker container is probably overkill, but was useful to gain experience with both approaches.
  • The Stylize task calls for the image to be built every time it is run, to make sure that it exits before calling it to stylize an image, but if it exists, Docker won't rebuild it again.

Updown

The ability to force a download of the original image file even if it already exists was implemented first in csci_utils by including a ForceableTask that other tasks inherit from. It has an optional parameter, force that if set to True will remove any of the outputs upon init so that when Luigi checks to see if its requirements have been satisfied, it will find that they are not and run the task. This functionality is include in Pset 4 in the main Stylize task, which also takes the optional force parameter, which in turn gets passed through to the S3DownloadTask.

Submit

Submitting the quiz required uploading a file. The canvasapi uploader class was broken. I added a file upload method to my SubmissionManager class in csci_utils using requests to access the Canvas REST API directly.

Testing

All of the above functionality is fully tested, including the model packaging code.

  • The file upload method test mocks the Canvas Api for the required multi-step post process.
  • The tests for the S3 Luigi tasks mock the S3 api
  • While most of the functionality called in the Stylize task consists of composed functions tested elsewhere, the test is a full cycle from downloading an image all the way through to stylizing. The final test takes a hash of the stylized image and asserts it is equal to a known hash value.
  • Includes testing of model packaging and stylizing code in similar manner with the final test being of a hash of the stylized image.