56th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
A new blog called learnml.io — I’ve long had my own personal website (mrdbourke.com) where I’ve published machine learning-related articles in the past. But I decided to create a more ML-focused blog/resource where I can write about anything ML.
The website is bare for now.
But I’ll be creating blog posts, resources and tutorials about tidbits I learn in the ML world.
The first post is live and it’s called The importance of a test set.
Inside, I talk about how making a good test set is one the most important things for any custom ML project (if you can’t test your model well, how can you ship it?).
Comparison of internet data versus real-life data. Internet data is good to get started on many ML problems but to make it work in the real world, you’re going to need real world samples. Image is from a talk I gave on MLOps lessons learned building Nutrify.
Lucas Beyer is one of my favourite ML researchers. If you’re not following him on X/Twitter, you should.
He’s also one of the co-creators of the ViT architecture.
In On the speed of ViTs and CNNs, Lucas examines how ViTs and CNNs compare on samples per second at various given batch sizes.
His findings show that ViTs can perform as fast or sometimes faster than CNNs at various batch sizes and image resolutions.
ViT architecture versus other kinds of CNN-like architectures (ConvNeXt and NFNet). Source: Lucas Beyer blog.
One of my other favourite takeaways was that on the different image sizes:
My conservative claim is that you can always stretch to a square, and for:
This is important because depending on the problem you’re working on, a different image resolution may impact your results.
And using a higher image size often results in more compute required/longer training times.
If you’re training computer vision models, I’d highly recommend reading the blog post, it’s well worth it.
These are my favourite kind of stories.
A company or startup has a large data resource (or chooses to specialize in a certain thing) and then takes existing research and applies it to make their product better.
Pinterest is a visual platform which among many things helps people find products they’d like to use.
So one of their biggest revenue generators is product advertisement.
Their latest research shows how they used generation models to create Pinterest Canvas, a model which is able to take a target product image and generate various backgrounds and style while keeping the product intact.
How?
Using their own large custom image dataset (even after excessive filtering of their visual data, they were left with 1.5 billion high-quality text-image pairs), they trained a base generation model which is capable of creating high quality images from text.
They then fine-tuned that model to be able to segment a product from an image and recreate its background.
To create a personalized product image generation model, Pinterest used a combination of different embeddings across text and images before merging them in their own text-to-image generation model Pinterest Canvas. Source: Pinterest Engineering Blog.
People who would like to advertise their products on Pinterest are now able to alter their product photography with different styles to appeal to those with different interests.
Answer.AI is a newish company with the goal of shipping useful AI products.
As a small AI developer, this is inspiring to me.
And so far, they’ve been delivering.
I’ve been reading their various blog posts and learning the potential of small embedding models (answerai-colbert-small-v1
shows that small doesn’t mean un-useful!), how to create things on the web without the complexity of the web (FastHTML) and how you can efficiently fine-tune Llama on a single GPU (with DoRA - Weight-Decomposed Low-Rank Adaptation).
For some idea on how research and development is done at Answer AI, I’d highly recommend checking out the following:
If you want to do AI research & development, working on small but useful projects, I’d highly recommend seeing how Answer do it and copying their approach (e.g. build in public).
With open-source LLMs gaining more and more traction, it’s starting to make more and more sense to customize them to your own problems.
This can be done quite well with larger LLMs such as GPT-4o, Claude and Gemini using in-context learning (e.g. putting examples into your prompts so the LLM can mimic them).
However, these models are closed-sourced and require your data (often one of your business’s most valuable resources) to be sent to a third party for processing.
The solution?
Fine-tune an open-source LLM to suit your needs!
And Meta’s three part series can help you out:
Dropbox is one of the most popular pieces of software on the planet.
Their mission is to store and sort all of your files.
But as files grow (in size and numbers), this is no easy feat.
However, AI makes this easier.
In their recent tech blog, the Dropbox team stated that one method for making searching better is to turn all files to text.
Even for video files, you can go from video -> audio file -> audio transcript -> text -> embeddings
.
Dropbox makes over 300 file types searchable with text by… turning them into text! And then embedding the text to make it semantically searchable. Source: Dropbox Tech Blog.
By doing this, you enable semantic searching (e.g. searching across embeddings) across a wide range of files.
They then use the results of these searches to power AI-based features such as Q&A with sources (answer questions based on your own documents) and summarization.
I also liked their text-chunking technique of collecting semantically similar chunks by clustering them with K-means. This means groups/paragraphs of texts in a document which have similar meaning can be returned together rather than sequentially.
One of my favourite websites for explaining how Convolutional Neural Networks (CNNs) work is CNN Explainer.
And now the same people have created Transformer Explainer!
And my gosh is it fun to use.
If you’re looking for an excellent resource to learn about the Transformer architecture (the same neural network architecture powering ChatGPT and similar LLMs), be sure to check it out.
Example of Transformer Explainer running in the browser showing the computations of the multi-head self attention operation. Source: Transformer Explainer website.
Since the initial release of the attention mechanism (one of the main building blocks of the Transformer architecture), there have been many iterations of it.
And the torch.nn.attention.flex_attention
module hopes to help with both existing methods and newer methods.
The PyTorch team have also released a attention-gym
repo on GitHub to help with learning examples.
VLMs are models which merge the vision and language modalities (see VLMs explained as mentioned in the July 2024 edition of AI/ML monthly).
With all the movements in the open-source LLM world, it’s good to see that open-source VLMs are thriving too.
Namely:
General architecture overview of many modern VLMs. A vision encoder (e.g. SigLIP) is used to encode images and a text encoder (e.g. the embedding layer of Llama-3.1-8B) is used to encode text. These encodings are then fused/concatenated and then fed to the rest of the LLM to decode the outputs tokens. Source: Idefics 3 paper.
Phew!
What a month in the VLM space!
One thing I noticed is the trend of using SigLIP as a vision encoder.
I use SigLIP almost every day and it’s an incredible model.
Bonus: For those wanting to learn more about it, I’d highly recommend this talk from one of the authors (Lucas Beyer, the same Lucas from the ViT vs. CNN post above).
A quote from the talk that stood out to me was:
The more precise your text is, the higher your matching score.
Notice how the matching score goes through the roof when you describe the image with precision. Something to keep in mind if you’re using the SigLIP model. Source: Lucas Beyer SigLIP talk on Cohere YouTube Channel.
This was referring to image-text similarity matching with SigLIP. If you can describe your image visually with text, chances are, SigLIP can match it correctly. This is a really powerful technique for things such as zero-shot labelling and clustering.
NVIDIA’s best practices for compressing a bigger LLM into a smaller LLM. Source: Compact Language Models via Pruning and Knowledge Distillation paper.
FLUX.1-schnell
(Apache 2.0) and FLUX.1-dev
(non-commercial). From my brief experimenting, these are incredibly good models. They’re from the team who created the original Stable Diffusion models, you can read their launch post on their blog.stable-fast-3d
enabling fast 3D asset generation from a single image.Grounded-SAM-2
enabling automatic segmentation labelling by combining Florence-2 (for text-based captions, see AI/ML monthly June 2024 for more), GroundingDINO (for boxes) and SAM-2 (for segmentation masks).What a massive month for the ML world in August!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.