Finding a model

As we began the project, the first question was “Which model should we use?” I’ve never participated in a scientific field that is moving as fast as AI is. It is nearly a full-time job just to stay up to date with research. So I read a pile of papers and settled on TridentNet, a recent variant of a model called Detectron2 that was created by Facebook Research. I figured it would be yesterday’s news soon enough, but it seemed to have the right balance between achieving state-of-the-art results and being mature enough to make deployment possible.

Getting started in the cloud

Microsoft’s cloud computing architecture, Azure, like Amazon’s AWS or the Google Cloud, has a huge variety of offerings aimed at everyone from individuals to the multinational companies that are the world’s heaviest data users. The AI4E program was very helpful in getting me oriented to Azure’s machine learning offerings. The lessons I learned were:

Use a Data Science Virtual Machine to start with. It’s a VM that has one or more GPUs and comes with most of the libraries pre-installed that one needs for machine learning. VM’s are great for developing code (and Google’s Colab is even easier). Treat VMs as disposable. Python libraries seem to get outdated and corrupted over time if you are doing heavy development, and it’s nice to be able to start again with a fresh machine. It makes sense to add an SSD data disk, but you’ll discover that storage is one of the largest costs on your account. Blob storage is cheaper but not as convenient, and it’s not really worth switching to blob storage unless a) you have a vast amount of data or b) you are working with an AML workspace.
I work in Python and Pytorch, and I use the wonderful interactive Jupyter Notebook ecosystem (Jupyterlab, to be specific) to develop code. However, Jupyterlab occasionally introduces bugs or incompatibilities and it certainly slows code down, so once your program is working and you need it to run fast, you’re better off calling Python code from the command line. I also use a special package called nbdev that makes it easy to develop in a Jupyter notebook and then compile the code as .py files. However, nbdev requires you to work in a git repository that is set up with a special template, and getting used to its workflow requires a signficant mind shift. The best general editor (really more of an IDE) that I’ve found is Microsoft’s Visual Studio Code. It can be used on your local machine and also can connect to a remote headless VM. Supposedly, it can even be used to debug code running in a container that is running on a remote VM, but I haven’t tried the latter (many dense and forbidding pages of instructions in Azure). At a minimum, it’s a good way to explore code.
Once I had gotten a model running, I gradually discovered the benefits of using an Azure Machine Learning Workspace. It required quite a bit of work to learn how to use the Python SDK (azureml-sdk) and even more work to get comfortable with building a containerized Docker environment, but it paid dividends when I needed to scale up for heavy training or inference. AML was brand new and very buggy when I started, but it has since settled down. The documentation is good. The 05_aml_pipeline.ipynb notebook shows how I used it, and the Dockerfile I built is at trident_project/docker_files in the repository.
Deployment is a specialty. There are people called “DevOps engineers” who do nothing but figure out how to scale up software. I recommend easing yourself gradually into it; don’t assume that you need it at first. It’s pretty amazing what you can accomplish with a single, cheap VM. Also ask yourself seriously what your endpoint is. Do you really need to maintain a website with a backend model and database infrastructure, or can you get by sharing a VM with teammates, or producing batch results from time to time? When you eventually do start to need something bigger, Azure has all of the tools you’ll need to run multiple containers with Kubernetes, to recover from interrupted jobs, distribute your data, build pipelines, expose web endpoints, and so on (ditto for Amazon and Google, of course).
AML accounts and permissions are a big headache, just like with AWS. I’ve had to switch between accounts, copy data and remake virtual machines at least 6 times, and have also been shut out of accounts for a total of more than two months over the last two years by a combination of issues that included some serious technical support problems. That’s a lot of wasted time, and makes you yearn for the simplicity of working on a local machine. But working in the cloud is inevitable and Azure’s services are as good as any.

An AML workspace integrates a lot of different machine learning functionality (image from Microsoft)

Keeping track of code

As I built one VM after another and switched back and forth between accounts, I started realizing that I had a seriously tangled mess of code to keep track of. My primary need was to keep the versions of code in sync between all of the machines I worked on, which included a laptop and 3 or 4 remote VMs on different accounts. Working with collaborators was a secondary concern. I fell back on git and Github, the world standard. Git is superb but it is not simple; as someone once said, you never really stop learning how to use it. My solution was to create a private Github repository for keeping my own mess straight, sync it with each of the remote machines, and then push a clean master copy from that repository to another repository that I shared with Howard. It’s been a great system, but I have to remember to always pull from the central repository before beginning work on a new machine, and to push to it before shutting down a machine.

My git workflow