42nd issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Hey there, Daniel here.
I’m a Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
I-JEPA (Image Joint Embedding Predictive Architecture) is a self-supervised learning model that learns high-quality image representations in far less than training time than previous methods (2-10x less).
The model learns to “fill in the gap” on multiple blanked-out patches of an image.
Similar to how playing peek a boo with a child can help them learn different representations of your face.
The oldest learning trick in the book! Hide and seek comes to machine learning.
The research is one of the first big steps in Yann LeCun’s (head of AI research at Meta) vision to make AI systems learn more like humans.
I-JPEA architecture overview. Given the context of an image, can you predict the patches? Source: Meta AI blog.
Links: Blog post, paper, code and models on GitHub
Roboflow helps you organise your computer vision resources. From images to labelling to model training and deployment.
Two fantastic resources from them are:
Now you can not only get data for your computer vision problems (from Roboflow Datasets) but you can also use large foundation models to help with labelling it (from Roboflow Autodistill).
The auto-labelling is a huge opportunity.
Thanks to the upgrades of recent models such as SAM (Segment Anything) and GroundedSAM (segment anything with natural language grounding).
You can pass an image + text input to have it automatically labelled.
For example, say you wanted to create a production line analyzer to detect milk bottles and caps, you could pass in an image of the production line along with the text [“bottle”, “cap”] and get back bounding box predictions for the input image!
You could then train a new model on these auto-generated labels and improve them overtime!
An example of generating automatic bounding box predictions based on text-label inputs. Source: Roboflow blog.
I enjoyed reading two fun tech blog posts from Instacart on how they power their search.
Instacart’s business is to enable you to buy what you want from a store nearby and get it delivered to you.
So good quality search is fundamental.
How do they do it?
A mix of traditional matches (e.g. “bananas” → “bananas”) and embedding matches for semantic searches (e.g. “german foods” → “sauerkraut”, “pretzel”).
The blog post How Instacart Uses Embeddings to Improve Search Relevance has some fantastic takeaways on how to build a production-level semantic search system:
Example of before and after implementing the new embedding-powered search engine. Given the search query “german foods”, previously the results only returned food items with “german” in them. Afterwards, the results include items that are semantically related to the query but don’t necessarily explicitly contain it, for example, “soft pretzel”. Source: Instacart Tech blog.
One big thing you notice about product development is that once you’ve developed a model, actually deploying it and seeing how people interact with it is a totally different form of evaluation than just pure metrics.
This is where Instacart uses Machine Learning-Driven Autocompelete to Help People Fill Their Carts.
As in, once you’ve retrieved some results given a search query, how should you display them?
Or as in, once you’re typing in the search bar, say the letters “pa”, how should the model know that you might mean “paper” or “parmesan cheese”?
Or even if you type in “mac and cheese” and there are 1000 products similar to “mac and cheese” which should you display/not display (removing duplicates, increasing diversity etc)?
Example of how semantic deduplication happens in the search results. If a result is semantically very similar to another, it gets removed to show a more diverse set of results (rather than multiples of similar items). Source: Instacart Tech blog.
And because many queries are more common than others, how can you save them (store the results in cache) to serve them quicker?
In the ML-powered autocomplete blog post, Instacart discuss how they tie together three models to achieve better search results:
Autolabelling is the hot trend right now.
Or at least the workflow of auto-labelling to begin with and then improving them over time.
In other words: bootstrap labels with a large model → use it train another model in a noisy supervised way → use the predictions of the new model to find out what parts of the data labelling can be improved.
Jimmy Whitaker shares a blog post on how to do the above workflow with GPT-4 in the context of labelling text data for classification, named entity recognition (NER), sentiment analysis and more!
By the current prices of GPT-4/GPT-3.5-turbo API, labelling $1,000 samples could be from $0.21 — $3.18 (note: prices may change).
When GPT-4 was released, nothing specific about the architecture details were announced (unlike previous versions of GPT).
But people talk.
GPT-3 had 175B parameters. And GPT-4 is rumoured to be 8x 220B (~1.75T total) parameter models all working together.
If it’s true, this trick/technique is called “Mixture of Experts” or MoE for short.
You can think of it as wisdom of the crowd.
The logic is if one model is pretty good, then the combination of multiple models must be better.
You can achieve this by training multiple similar models but varying the training setup, data, architecture, initialization and other hyperparameters slightly across each.
One model might specialize in everything to do with legal questions and another might be very good at coding questions. Combining gets the best of them all.
Read more on the Weights & Biases blog post by Brett Young.
Two sensational case studies from how Apple performs on-device photo analysis:
The first is what powers the ability to extract salient (the most prominent) subjects from a photo and turn them into stickers or images.
The model uses an EfficienetNetV2 backbone to extract features from an image and then create segmentation masks in under 10ms on an iPhone 14.
I love the discussion on training data creation.
The team used synthetic data in addition to real-world data.
In other words, they produced 2D and 3D synthetically generated segmentation data to enhance their dataset and make it more general across classes or images where real-world examples didn’t exist.
Another tidbit on evaluation was that metrics can show one thing but until you try the feature for yourself, you’re not really going to know how well it works.
So alongside metrics, they employed a team of human annotators to select and rate the best quality outputs so researchers could further rate them.
The second discusses Apple’s Neural Scene Analyzer (ANSA).
Which is a fancy way of describing a model which pulls a bunch of information from a photo.
It’s an excellent example of how small models can accomplish a lot if trained and tuned in a focused way.
The deployed model is a version of MobileNetV3 and ends up having 26M parameters after pruning, a 24.6MB memory footprint and performs inference in 9.7ms.
And the workflow is: image → model → features → smaller model heads for different outputs (tags, faces, landmarks, objects).
Table comparing the different architectures for vision backbones. When deploying models to mobile devices, size is a limiting factor. But notice how MobilenetV3 achieves ~90% of the performance of ViT-B/16 with nearly 10x less parameters. Source: Apple Machine Learning blog.
Apple hosted their World Wide Developers Conference for 2023 at the start of June.
And of course, the Vision Pro was announced (which uses a bunch of machine learning for its operations).
But there was also a bunch of cool machine learning updates on the developer side of things (tip: search/filter for "machine learning").
Such as my favourite, model compression in coremltools
(Apple's open-source framework for converting models to run on Apple devices).
Model compression makes it so your model can be smaller (less storage) and run faster without comprising much on performance.
There are two main approaches for this:
Example of how post-training quantization can work well but fails at higher compression amounts. For the most compression, training with compression is recommended. Source: Use Core ML Tools for machine learning model compression talk.
For more on preparing your models for Apple device deployment, see the follwing:
coremltools
GitHubcoremltools
model optimization guideThe OpenAI API now has the ability to call functions in a language model.
For example, if you ask, “what’s the weather in California?” it can return whether or not it thinks to use a function such as get_weather_in_city()
.
It doesn’t actually execute the function, only let you know whether or not it thinks it could use it.
There are also some price reductions, such as 75% cheaper embeddings (though you can still get similar quality embeddings with Sentence Transformers for free) and 25% cheaper input tokens on gpt-3.5-turbo.
See the blog post and example notebook for more.
Let’s go quickfire on the research and open-source!
How RoboCat trains: start with 100-1000 demonstrations, practice the demonstrations, create new demonstrations through self-generation, retrain again on the new demonstrations, repeat. Source: DeepMind blog.
LordFoogThe2st has some great advice from the ZTM #machinelearning-ai Discord channel, experiment, experiment, experiment!
What a massive month for the ML world in June!
As always, let me know if there's anything you think should be included in a future post.
Liked something here? Leave a comment below.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.