[June 2023] Machine Learning Monthly Newsletter 💻🤖

42nd issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.

Hey there, Daniel here.

I’m a Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in June 2023 as a Machine Learning Engineer…

My work 👇

I did an interview with Aldo on the ZTM YouTube channel on everything from AI to BJJ to what I've been working on to advice for getting into ML. Thank you Aldo for the fun conversation!
Free video: The first 10 hours of the ZTM ML & Data Science course has been published to YouTube in one video! Learn Pandas, NumPy and Matplotlib to start off your journey in the wonderful world of machine learning. Watch on YouTube and check out the full course on ZTM.

From the Internet

Computer vision resources

1) I-JEPA — Learning visual representations by playing hide and seek

I-JEPA (Image Joint Embedding Predictive Architecture) is a self-supervised learning model that learns high-quality image representations in far less than training time than previous methods (2-10x less).

The model learns to “fill in the gap” on multiple blanked-out patches of an image.

Similar to how playing peek a boo with a child can help them learn different representations of your face.

The oldest learning trick in the book! Hide and seek comes to machine learning.

The research is one of the first big steps in Yann LeCun’s (head of AI research at Meta) vision to make AI systems learn more like humans.

i-JEPA-architecture-and-learning-overview

I-JPEA architecture overview. Given the context of an image, can you predict the patches? Source: Meta AI blog.

Links: Blog post, paper, code and models on GitHub

2) Computer vision datasets + automatic data labelling library from Roboflow

Roboflow helps you organise your computer vision resources. From images to labelling to model training and deployment.

Two fantastic resources from them are:

Roboflow Datasets — a collection of 90,000 computer vision datasets alongside 7,000 pretrained models.
Roboflow Autodistill — an open-source library to help you automatically label vision data with pretrained foundation models (see also: blog post, YouTube video tutorial).

Now you can not only get data for your computer vision problems (from Roboflow Datasets) but you can also use large foundation models to help with labelling it (from Roboflow Autodistill).

The auto-labelling is a huge opportunity.

Thanks to the upgrades of recent models such as SAM (Segment Anything) and GroundedSAM (segment anything with natural language grounding).

You can pass an image + text input to have it automatically labelled.

For example, say you wanted to create a production line analyzer to detect milk bottles and caps, you could pass in an image of the production line along with the text [“bottle”, “cap”] and get back bounding box predictions for the input image!

You could then train a new model on these auto-generated labels and improve them overtime!

milk-bottle-and-cap-auto-label

An example of generating automatic bounding box predictions based on text-label inputs. Source: Roboflow blog.

Blog posts and tech discussions

1) Finding the most relevant groceries in an instant

I enjoyed reading two fun tech blog posts from Instacart on how they power their search.

Instacart’s business is to enable you to buy what you want from a store nearby and get it delivered to you.

So good quality search is fundamental.

How do they do it?

A mix of traditional matches (e.g. “bananas” → “bananas”) and embedding matches for semantic searches (e.g. “german foods” → “sauerkraut”, “pretzel”).

The blog post How Instacart Uses Embeddings to Improve Search Relevance has some fantastic takeaways on how to build a production-level semantic search system:

Queries are very long-tailed (<1000 popular queries account for >50% of the search traffic) — not much data on samples with small amounts of search terms.
Take queries and product information and embed them in the same vector sapce so they can be compared.
Uses a Sentence Transformers model (open-source) as a baseline and then fine-tunes on custom data.
Uses organic + synthetic data for samples with little to no examples to train models (though the amount of synthetic data is small to not over-bias the models).
Cascading training setup involves training a model on large dataset of noisy data and then using the trained model as a starting point to further train on a smaller, higher quality dataset.
Semantic deduplication helps to remove duplicates in search results to increase diversity of products displayed.

Example of before and after implementing the new embedding-powered search engine. Given the search query “german foods”, previously the results only returned food items with “german” in them. Afterwards, the results include items that are semantically related to the query but don’t necessarily explicitly contain it, for example, “soft pretzel”. Source: Instacart Tech blog.

2) Sorting product results in an instant

One big thing you notice about product development is that once you’ve developed a model, actually deploying it and seeing how people interact with it is a totally different form of evaluation than just pure metrics.

This is where Instacart uses Machine Learning-Driven Autocompelete to Help People Fill Their Carts.

As in, once you’ve retrieved some results given a search query, how should you display them?

Or as in, once you’re typing in the search bar, say the letters “pa”, how should the model know that you might mean “paper” or “parmesan cheese”?

Or even if you type in “mac and cheese” and there are 1000 products similar to “mac and cheese” which should you display/not display (removing duplicates, increasing diversity etc)?

semantic-deduplication-at-instacart

Example of how semantic deduplication happens in the search results. If a result is semantically very similar to another, it gets removed to show a more diverse set of results (rather than multiples of similar items). Source: Instacart Tech blog.

And because many queries are more common than others, how can you save them (store the results in cache) to serve them quicker?

In the ML-powered autocomplete blog post, Instacart discuss how they tie together three models to achieve better search results:

An autocomplete model to predict search queries as they’re typed.
A semantic deduplication model to remove very similar items.
A ranking model to make sure the most likely relevant item appears closer to the top of the list.

3) Autolabel your text data with GPT-4

Autolabelling is the hot trend right now.

Or at least the workflow of auto-labelling to begin with and then improving them over time.

In other words: bootstrap labels with a large model → use it train another model in a noisy supervised way → use the predictions of the new model to find out what parts of the data labelling can be improved.

Jimmy Whitaker shares a blog post on how to do the above workflow with GPT-4 in the context of labelling text data for classification, named entity recognition (NER), sentiment analysis and more!

By the current prices of GPT-4/GPT-3.5-turbo API, labelling $1,000 samples could be from $0.21 — $3.18 (note: prices may change).

4) GPT-4’s architecture coud be 10x GPT-3’s (1.75B vs 1.75T)

When GPT-4 was released, nothing specific about the architecture details were announced (unlike previous versions of GPT).

But people talk.

GPT-3 had 175B parameters. And GPT-4 is rumoured to be 8x 220B (~1.75T total) parameter models all working together.

If it’s true, this trick/technique is called “Mixture of Experts” or MoE for short.

You can think of it as wisdom of the crowd.

The logic is if one model is pretty good, then the combination of multiple models must be better.

You can achieve this by training multiple similar models but varying the training setup, data, architecture, initialization and other hyperparameters slightly across each.

One model might specialize in everything to do with legal questions and another might be very good at coding questions. Combining gets the best of them all.

Read more on the Weights & Biases blog post by Brett Young.

5) How Apple Performs On-device Photo Analysis in Real Time

Two sensational case studies from how Apple performs on-device photo analysis:

Fast Class-Agnostic Salient Object Segmentation — identifying and extracting the main subject of an image.
A Multi-Task Neural Architecture for On-Device Scene Analysis — identifying image tags, sensitive content, image aesthetics, object detection and more.

The first is what powers the ability to extract salient (the most prominent) subjects from a photo and turn them into stickers or images.

apple-salient-object-segmentation-on-flower The model uses an EfficienetNetV2 backbone to extract features from an image and then create segmentation masks in under 10ms on an iPhone 14.

I love the discussion on training data creation.

The team used synthetic data in addition to real-world data.

In other words, they produced 2D and 3D synthetically generated segmentation data to enhance their dataset and make it more general across classes or images where real-world examples didn’t exist.

Another tidbit on evaluation was that metrics can show one thing but until you try the feature for yourself, you’re not really going to know how well it works.

So alongside metrics, they employed a team of human annotators to select and rate the best quality outputs so researchers could further rate them.

The second discusses Apple’s Neural Scene Analyzer (ANSA).

Which is a fancy way of describing a model which pulls a bunch of information from a photo.

It’s an excellent example of how small models can accomplish a lot if trained and tuned in a focused way.

The deployed model is a version of MobileNetV3 and ends up having 26M parameters after pruning, a 24.6MB memory footprint and performs inference in 9.7ms.

And the workflow is: image → model → features → smaller model heads for different outputs (tags, faces, landmarks, objects).

apple-neural-scence-analysis-model-size

Table comparing the different architectures for vision backbones. When deploying models to mobile devices, size is a limiting factor. But notice how MobilenetV3 achieves ~90% of the performance of ViT-B/16 with nearly 10x less parameters. Source: Apple Machine Learning blog.

Updates and new releases

1) Apple WWDC 2023

Apple hosted their World Wide Developers Conference for 2023 at the start of June.

And of course, the Vision Pro was announced (which uses a bunch of machine learning for its operations).

But there was also a bunch of cool machine learning updates on the developer side of things (tip: search/filter for "machine learning").

Such as my favourite, model compression in coremltools (Apple's open-source framework for converting models to run on Apple devices).

Model compression makes it so your model can be smaller (less storage) and run faster without comprising much on performance.

There are two main approaches for this:

Post-training compression (works quite well, but limited by how much you can compress).
During training/fine-tuning compression (slower but allows largest amount of compression).

apple-model-compression

Example of how post-training quantization can work well but fails at higher compression amounts. For the most compression, training with compression is recommended. Source: Use Core ML Tools for machine learning model compression talk.

For more on preparing your models for Apple device deployment, see the follwing:

coremltools GitHub
coremltools model optimization guide
Example of Stable Diffusion running faster on iPhone when using model compression on the Hugging Face blog (30% improvement)

2) OpenAI API updates

The OpenAI API now has the ability to call functions in a language model.

For example, if you ask, “what’s the weather in California?” it can return whether or not it thinks to use a function such as get_weather_in_city().

It doesn’t actually execute the function, only let you know whether or not it thinks it could use it.

There are also some price reductions, such as 75% cheaper embeddings (though you can still get similar quality embeddings with Sentence Transformers for free) and 25% cheaper input tokens on gpt-3.5-turbo.

See the blog post and example notebook for more.

Research and open-source

Let’s go quickfire on the research and open-source!

Salesforce releases code and weights for a 7B LLM trained on 8K input length for commercial use (Apache 2.0 licence). Blog post, code on GitHub.
MosiacML releases a 30B LLM for commercial use (Apache 2.0 licence). Blog post, model weights on Hugging Face.
Paper: Scaling open-vocabulary object detection — researchers from DeepMind scale open-vocabulary object detection (e.g. detect any object in an image) to pretraining on 1B images wit pseudo-labelling from another model (OWL-ViT, open-source on Hugging Face). The new model has a 43% improvement on rare categories on the popular LVIS dataset.
Paper: Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles — Hiera is a new style of Vision Transformer which performs better and faster on images (1.1% improvement on ImageNet, 1.3x faster) and videos (+2.1 accuracy and 2.3x faster on Kinetics-400). See the code and models on GitHub.
RoboCat is a self-improving robotic agent from DeepMind that generates it's own demonstrations to try and solve a task. For example, start by collecting 100-1000 demonstrations of a new task using a robotic arm controlled by a human, fine-tune the RoboCat on these demonstrations and then use the fine-tuned model to practice the task ~10,000 times and generate more data.

robocat-training-style

How RoboCat trains: start with 100-1000 demonstrations, practice the demonstrations, create new demonstrations through self-generation, retrain again on the new demonstrations, repeat. Source: DeepMind blog.

Recognize Anything (RAM) is a new model (Apache 2.0 license) that's able to generate open-set tags given an image. For example, input an image of a plate of fruit and it could generate "table, plate, banana, apple, pineapple, food, fruit”. The model is trained on several different large and open-source datasets and has a wide variety of recognition capabilities. See the code on GitHub, demo on Hugging Face, project page, reseach paper.

YouTube / podcasts

George Hotz (founder of comma.ai and now Tiny Corp) on Latent Space podcast (one of my favourites for AI + ML). Watch on YouTube, listen online.

Tip of the week

LordFoogThe2st has some great advice from the ZTM #machinelearning-ai Discord channel, experiment, experiment, experiment! ml-monthly-tip-of-the-month-from-lordfoogthe2st

See you next month!

What a massive month for the ML world in June!

As always, let me know if there's anything you think should be included in a future post.

Liked something here? Leave a comment below.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.