[August 2022] Machine Learning Monthly Newsletter 💻🤖

32nd issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.

Hey everyone!

Daniel here, I’m a Machine Learning Engineer who teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.

Enough about me!

You're here for this month's Machine Learning Monthly Newsletter. Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in August as a Machine Learning Engineer…

My work 👇

The Zero to Mastery PyTorch course is live!

Easily one of the most requested courses is officially live. Inside, we’ll learn and practice writing PyTorch code, the most popular deep learning framework in a hands-on and beginner-friendly way.

PyTorch Paper Replicating and PyTorch Model Deployment materials completed.

The final two milestone projects of the ZTM PyTorch course have been completed and videos are being edited as you read this, expect to see them on your ZTM Academy dashboard within the next couple of weeks.

In the meantime, you can read the materials on learnpytorch.io:

08. PyTorch Paper Replicating: recreating the Vision Transformer computer vision architecture with PyTorch
09. PyTorch Model Deployment: deploying the FoodVision Mini models from the PyTorch course into a usable app

From the Internet 🌐

What is machine learning model quantization?

I’ve been seeing a lot about model optimization lately.

How do you prepare your model in the best way possible for the most efficient results?

As in, if you want to deploy your model to a system for it to perform inference, how do you make that inference happen as fast as possible?

In the case of a Tesla self-driving car, model optimization might mean a model making predictions fast but still keeping the energy usage low and accuracy high.

One of the best ways to optimize a model for deployment is via quantization.

Quantization in machine learning is the practice of reducing how much memory the parts of a neural network (weights, biases, activation's) require.

Why would you want to do this?

Well if a neural network requires 1GB of space (this number is arbitrary, some models require more, some require less) and your device only has 512MB of space, the neural network won’t be able to run on the device.

So one of the steps you might take is to quantize the model in an effort to reduce how much space it takes up.

This practice of reducing space is often referred to as “reducing precision”.

What’s precision?

Computers don’t represent numbers exactly.

Instead, they use varying degrees of precision, using varying combinations of 0’s and 1’s to represent another number.

The higher the number of 0’s and 1’s used to represent a number, the higher the precision but the higher the storage.

For example, if using 32-bit precision (also called float32 or single-precision floating-point format), a computer represents a number using 32 bits:

number (float32) = 01010101010101010101010101010101

The above example is a made-up sequence of 32 0’s and 1’s to represent some number.

Many deep learning libraries (such as PyTorch) use float32 as the default datatype.

However, due to the flexibility of neural networks, you can often use lower precision datatypes to represent numbers without many tradeoffs in performance (of course this will vary from problem to problem).

The float16 (half-precision floating-point format) datatype uses 16 bits to represent a number:

number (float16) = 0101010101010101

The number gets represented by fewer values but is still precise enough to be used in a neural network.

Using lower precision often speeds up training (fewer numbers to manipulate) and reduces model storage size (fewer bits representing a network).

PyTorch offers a way to use float16 representation during training through torch.amp, where amp stands for Automatic Mixed Precision, meaning PyTorch will automatically use mixed forms of precision (float32 and float16) where possible to improve training speed whilst attempting to maintain model performance.

Finally, one of the most aggressive forms of reducing precision is to convert neural network components to int8 representation (this is the datatype most often used when the term quantization is used), where a number is represented by 8 bits:

number (int8) = 01010101

Again, fewer bits to represent a single number.

So stepping through the examples above, we’ve gone from float32 to float16 to int8.

This is one of the main pros of quantization: a reduction in 4x from the number of bits to represent a number.

Scale this process throughout all of the elements in a neural network (weights, activations, biases) and you often get a smaller model size and faster inference time (lower latency).

However, because the neural network is now performing inference using “less precise” numerical representations, you also often get a degradation in performance.

For example, take the two model scenarios:

Non-quantized model (float32) — Model size 100MB, 99% accuracy, 10ms inference time.
Quantized model (int8) — Model size 50MB, 97.5% accuracy, 2ms inference time.

I’ve made these numbers up but they illustrate what you can often expect with quantization.

The quantized model requires half the size of the non-quantized model and performs inference 5x faster but achieves 1.5% less accuracy.

So you take a hit on performance but you get nice gains on storage requirements and inference time.

These improvements in storage size and inference time may be critical if you’re trying to deploy your model to a device with limited compute power (such as a mobile device).

However, if you’re going for the best performance possible with unlimited compute power, you’d likely opt for the biggest model you can.

For more on quantization and different datatypes, I’d recommend the following:

What Is int8 Quantization and Why Is It Popular for Deep Neural Networks?
Available datatypes in PyTorch (see the different datatypes you can store tensors as in PyTorch, this will give you an idea of how many different ways you can represent data).
Available datatypes in TensorFlow (same as above but with a TensorFlow focus).
Bonus: Another common method for reducing neural network model size is “model pruning”, where a select number of weights in the model are set to 0, effectively removing their storage requirement. Pruning focuses on removing the less contributive weights whilst keeping the best.

Practical Model Quantization

In light of the above, the following are a collection of resources for applying quantization in practice.

Practical Quantization in PyTorch — A blog post from the PyTorch team detailing how to perform several quantization methods in PyTorch.
Quantization in TensorFlow — A tutorial from the TensorFlow documentation for performing model quantization in TensorFlow.
Quantization with ONNX — ONNX stands for Open Neural Network Exchange. If you’re deploying to a small computing device (such as a Raspberry Pi) or to a web browser (via ONNX Runtime), you’ll likely run into ONNX.
Post-training Quantization of PyTorch models with OpenVINO and NNCF (Neural Network Compression Framework) — OpenVINO is a framework for speeding up your model performance on CPUs. Quantization of an OpenVINO model even further improves that performance. For example, the ResNet50 model went from 1060.16 FPS (float32) to 3172.54 FPS (int8), an increase of ~3x!
- Bonus: You can convert an existing PyTorch model from PyTorch → ONNX → OpenVINO for faster inference on CPU devices.
Moving ML inference from the Cloud to the Edge (a significant speedup in inference time) — An excellent blog post detailing the benefits of quantizing a model and deploying to the edge (running it in the browser) rather than on the cloud (much faster inference time).
- Bonus: Many of the concepts in this blog post are referenced in the ZTM PyTorch course section 09. PyTorch Model Deployment. Such as the benefit of the cloud being unlimited compute but having to wait for inference to happen. Whereas running on-device saves on compute resources but often causes the need for a smaller model.
LLM.int8() and Emergent Features — LLM.int8() is a new research paper showing how quantization (converting most of the parameters of a model to int8) can be used to make LLMs (large language models) much more accessible to all. For example, running an LLM like GPT-3 with 175B parameters usually requires 100s, if not 1000s of GPUs. Whereas with the help of quantization and the techniques in LLM.int8(), these models can now be run on consumer-accessible hardware.

ml-monthly-august-gpu-access

Hardware requirements for running large language models (LLMs). Note that prior to LLM.int8() these models were often only usable on much larger compute resources. Source: LLM.int8() paper.

Stable Diffusion Public Release: Generate images with text!

Stable Diffusion is a machine learning model capable of generating images from text prompts.

If you’ve seen DALLE by OpenAI, then consider Stable Diffusion the open-source version.

ml-monthly-august-2022-huggingface-stable-diffusion

Images generated from the prompt “fun machine learning course being taught in the city of Atlantis”. Source: Stable Diffusion Hugging Face Space Demo.

This is a super exciting release in the world of AI. And very telling of where the field is going, more open, more opportunities for all.

Stable Diffusion is also the model powering the new dreamstudio.ai, an interface for generating images from text prompts:

ml-monthly-august-2022-dreamstudio

Generating an image with the prompt “fun machine learning course being taught in the city of Atlantis”. Source: DreamStudio.

And perhaps the best of all, you can find all of the code for the model(s) on GitHub.

Beware the data science pin factory, become a full stack DS generalist

A classic essay from 2019 (yes, 2019 is now considered classic in the world of machine learning) by Eric Colson from Stitchfix.

Eric argues that data science is not an assembly line (like a pin factory where everyone has a very specific role) but rather an environment that requires much trial and error across a wide range of topics.

In an assembly line, you can optimize things because you know your outcomes.

However, in data science, you can’t necessarily optimize a process because you don’t know your outcomes.

The cure?

Plenty of trial and error.

For data scientists and machine learning engineers, experiment often (especially if failure is low cost) and create a demo as soon as possible.

And if the cost of failure is high, use tried practices.

What should you monitor in machine learning systems?

Once you’ve deployed a machine learning model, you’ll likely want to know how it’s performing in production.

This practice is called machine learning monitoring.

However, monitoring doesn’t just stop with a model, in Monitoring ML systems in production. Which metrics should you track? by EvidentlyAI (a tool for monitoring ML systems), the authors extend ML model monitoring to ML system monitoring with four parts:

ML system health monitoring — machine learning systems are a combination of ML-specific and non-ML-specific software, how are non-ML-specific systems performing?
ML data quality monitoring — poor data in, poor results out, how does the data that’s being fed into your model look like compared to the data it was trained on?
ML model quality monitoring — what do the outputs of your model look like? Are they in the right format/shape for the system you’re building? What happens if something unexpected goes into your model? How does your model perform on different segments of the data compared to other segments?
Business metrics and KPI — a model performing at 99.9999% accuracy doesn’t mean much if it doesn’t help the business in some way. Monitoring how a model influences business metrics often depends on all of the other systems functioning correctly.

ml-monthly-august-2022 summary table-min

Table of different things to monitor in a machine learning system along with who’s involved. Source: EvidentlyAI blog.

Can you explain what you’re building to a 5 year old?

Eugene Yan’s latest blog post, Simplicity is An Advantage but Sadly Complexity Sells Better, discusses how too often software and technology projects drown in complexity because it looks and sounds cool.

Whereas simplicity offers several benefits: easier onboarding for new staff/users (if you keep things simple, people can learn how to use/build on them faster) and a higher probability of longevity (simplicity often means using battle-tested tools such as Instagram scaling to 10+ million users with PostgreSQL).

And when it comes to machine learning techniques, even with all the latest breakthroughs, a lot of traditional techniques out-perform newer more complex methods:

ml-monthly-august-2022

Older/simpler machine learning techniques can often outperform newer/more complex techniques across a range of problem types. Source: Eugene Yan blog.

Research Papers 📰

Two cool papers caught my eye this month (out of the many curated from Paperswithcode.com):

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Robotics meets language modelling. Google AI and Everyday Robots partner up to create a robot with a single arm and many cameras to act as a language model’s hands and eyes.

Given the instruction such as “I spilled my drink, can you help?” the robot interprets the instruction using the language model and then proceeds to execute actions such as finding picking up a sponge with its cameras and arm.

See the blog post and project website for cool video demos.

A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning

Another robotics paper, this time using reinforcement learning to teach an off-the-shelf robotics dog to walk in ~20 minutes in a variety of environments such as outdoors and on dirt.

Very cool to see reinforcement learning coming to the real-world. A few years ago, I thought that reinforcement learning only really works in video games, but more and more research is proving this wrong.

See the paper website for more cool video demos.

YouTube videos 📺

Stability AI (the creators of Stable Diffusion) founder Emad Mostaque talks with Yannic Kilcher about what Stability AI is, why it was founded and what they plan to do in the future.
The Full Stack Deep Learning (a resource for learning all of the pieces of the puzzle that go into making machine learning-powered applications) 2022 curriculum is well underway and you can see the video lectures on YouTube.
MLOps Zoomcamp is a recently created free online course on MLOps (machine learning operations), you can find a playlist of all the videos on YouTube as well as all the code and materials on GitHub.
Tesla’s head of Autopilot, Ashok Elluswamy gave a talk on many of the computer vision updates coming to Tesla’s self-driving cars at CVPR (Computer Vision and Pattern Recognition Conference) 2022 and it’s now available to watch on YouTube. Watching this and seeing what’s possible with computer vision not only amazed me but gets me excited for Tesla’s upcoming AI day on September 30, 2022.

See you next month!

What a massive month for the ML world in August!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month, Daniel

www.mrdbourke.com | YouTube

By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.