[May 2025] AI & Machine Learning Monthly Newsletter

Daniel Bourke
Daniel Bourke
hero image

65th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in May 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

Apache 2.0 object detection models — I wrote a short blog post collecting different Apache 2.0 object detection models. These are models tested by me to perform well on various object detection tasks. They perform on par or better than YOLO-based equivalents.

This is part of an upcoming object detection project I’m working on for ZTM.

From the Internet

Netflix share how they created a foundation model for recommendations.

Traditionally Netflix uses many recommendation models to produce "what to watch next" items on their apps.

However, using and updating many models can lead to many challenges.

If one breaks, chances are, its errors will cascade.

It also means that many different features often have to be engineered to fit the models requirements rather than the model learning the required features directly from the data.

In a recent blog post, they share how they adopted LLM-like strategies to create a foundation model for recommendations.

From data filtering to structuring to weighting to encoding strageties to model training and scaling.

It turns out, just like LLMs, foundational models for recommendation systems improve well with scale (e.g. more parameters = better performance).

The benefits of having a foundation model for recommendations means that there's now a single source which can be used for various use cases such as fine-tuned for specific tasks, use directly for predictions or leveraged for embeddings.

netflix-strategies

Netflix moving from a model-centric approach to a data-centric approach. In other words, freeze the model and iterate on the data.

Phillipp Schmid writes about the differences between AI workflows and AI agent patterns.

In essence, use AI workflows for predictable tasks where the required steps are known and use AI agents when flexibility and model-driven decision-making is required.

In the article, Phil writes about:

  • Prompt chaining (feed the outputs of one AI model into another)
  • Routing (use one AI model to route requests to a specifc model)
  • Parallelisation (call multiple AI models at once and combine the answers)
  • Reflection (AI model reflects on its answers before delivering a final one)
  • Tool use (AI model picks a tool to use to complete a task)
  • Planning (use an AI model to plan tasks to complete)

See the full article for more.

Bonus: See agentrecipes.com (free) for an excellent collection of code examples for different kinds of AI patterns.

Spotify shares how they built Text2Tracks, a system for retrieving songs based on natural language searches.

It’s Christmas and you want “Christmas songs” but there’s a lot in that search term.

Do you want the tracks with “Christmas” in their name? Or do you want the popular Christmas songs you hear when visiting the shops at Christmas time?

Or how about “Sunday afternoon BBQ chill”?

It’s hard to match songs based on their title to a query like that.

To build Text2Track, Spotify fine-tuned an LLM to generate Track IDs based on a query rather than generate straight Artist Name and Song Title pairs.

text2tracks-overview

Spotify’s Text2Tracks overview: going from input prompt → recommended track IDs.

They found trying to generate “Artist Track Name” type outputs could often take too long (in the case where the titles were long) and the titles of songs can often not capture the overall feel of a song.

Instead they found the two best options were “Artist Track Integer”, for example “artist_001_track_001”, a surprisingly simple yet effective approach and Semantic IDs, for example, learning a vector representation based on collaborative filtering embeddings (built by mining patterns of songs appearing together in playlists).

This Semantic ID technique turned out to work 2.7x better than BM25, a popular technique for keyword indexing.

LinkedIn share how they improved their job posting embeddings with LLMs.

LinkedIn’s JUDE stands for Job Understanding Data Expert.

The goal is to match job postings with candidates and vice versa.

This is no easy task.

Job postings can have large amounts of text, different locations, different requirements, slightly different titles.

And applicants can have almost an infinite variety of attributes.

A naive way to match would be straight text matching.

But at LinkedIn’s scaled of 1 billion users and 10 million+ job postings, to find the right match, embeddings often come into play.

JUDE uses a LoRA fine-tuned Mistral LLM (as well as others if needed) to generate job posting and candidate embeddings.

When someone posts a job posting, the model is able to create embeddings in ~300ms. There’s also a smart caching system which detects for significant changes in a posting before making calls to the LLM. If the change is significant enough, new embeddings are generated, if not, they stay the same. This caching system reduces inference volume by up to 6x compared to updating them every time.

Results from the JUDE LLM-based embedding system led to +2.07% qualified applications, -5.13% dismiss to apply and +1.91% total job applications.

Mixedbread.ai explores the OCR ceiling, or where RAG systems fail because of poor retrieval.

Retrieval Augmented Generation (RAG) systems are useful across a wide range of industries.

Essentially, if you’ve got custom documentation or documents, chances are, RAG can be implemented somewhere.

However, the quality of your RAG system depends on the quality of:

  1. Your Retrieval (the “R” in RAG) or the ability your system to retrieve the right documents based on a given query.
  2. Your Generation (the “G” in RAG) or the ability of your system to generate outputs based on given context from the retrieval.

Because it is a cascading system, a poor result in retreival will almost certainly lead to poorer generation.

What if you had the perfect conditions?

Perfect retrieval and perfect text as input to a generation model?

Well Mixedbread started there and worked backwards.

They found that even in the case of having perfect text (from human labellers), text-only retrieval systems fall short when trying to find the right document.

The solution?

Embed the document and the page in a multimodal embedding (Mixedbread offers this via their mxbai-omni-v0.1 model).

Embedding the whole page captures nuances such as diagrams, pictures, figures and handwriting that traditional OCR and text-only methods miss out on.

Even with perfect retrieval and perfect text, multimodal embeddings perform only slightly worse on a corpus of 8000+ diverse documents.

TL;DR: Visual embedding retrieval = best retrieval results, but perfect text is still best for extraction (versus directly extracting information from an image).

example-where-rag-fails-handwriting

Example of where pure OCR-based RAG can fail: handwriting. VLMs tend to perform quite well here. Source: Mixedbread blog.

NVIDIA NeMo adds day-0 support for Hugging Face Models.

The NVIDIA NeMo framework helps optimize models for performance on NVIDIA GPUs and is now integrating with Hugging Face text generation and vision language models. The update will bring speed benefits to Hugging Face models to go from Model ID → NeMo Framework AutoModel → NVIDIA NeMo.

NVIDIA Kaggle Grandmaster shares competition tips with cuML.

Kaggle Grandmaster Chris Deotte shares the strategy for wining a recent tabular-based competition to predict podcast listening time. By leveraging cuML (a framework for speeding up machine learning models on GPU), Chris was able to test 500 different model types before stacking together 75 models across three different tiers.

Using cuML (discussed in AI + ML Monthly April 2025) meant that more models could be tried in a faster time.

Benedict Evan’s updates his AI eats the world presentation and writes about GenAI’s adoption puzzle.

Some of my favourite takeaways:

On GenAI adoption:

If GenAI is the next big thing (and if you’re into tech, it certainly feels like it), how come the daily average users (5% to 15% of people) is much lower than the weekly active users?

Perhaps new technologies take time to adapt to.

What people call “the cloud” is still only ~30% of software workloads.

On AI eating the world:

Jevon’s paradox results in more usage of something because it’s cheaper but not necessarily more revenue or profits.

What does this mean?

As token prices fall for foundation models (and now many companies have fairly similarly performing top-tier models) usage increases but that doesn’t mean profits do.

Perhaps it’s a builders world, GenAI services become like databases, a commodity tool which entire businesses can be built upon (sell the service on top of the model rather than the model itself).

llm-models-over-time

Performance of top-tier models is catching up to each other. Notice how close all of the 2025+ models are.

5 Steps to Analyze LLM Application Failures by Alex Strick van Linschoten.

A great write up discussing an iterative loop process to analyse the performance of an LLM in your application.

From bootstrapping an initial dataset → reading and labelling sample types (manually or with an LLM) → clustering failure modes → labelling traces (inputs and outputs of LLM) → quantify and iterate.

llm-eval-loop

Steps in an LLM evaluation (or any other kind of ML model) evaluation loop. Source: Alex Strick van Linschoten blog.

My favourite quote from the article:

“Importantly, you let the categories emerge from the data rather than coming in with pre-conceived ideas of what the categories already are.”

Daniel’s Open-Source AI of the month

1. FastVLM: Efficient Vision Encoding for Vision Language Models.

FastVLM is an open-source vision-language model capable of running on your iPhone.

It combines a FastViTHD vision encoder as well as various size Qwen2 LLMs (0.5B, 1.6B, 7B). It has been designed with speed in mind and is capable of achieving equal or better results than other vision encoders with best-in-class time to first token (TTFT).

All variants of FastVLM are available to download and try locally. It even comes with an example app. I tried it on my iPhone 15 Pro and was impressed by its speed (it even works offline!).

You can find more details in the research paper.

fast-vlm-iphone-demo

The FastVLM architecture as well as a demo version of the app running on my iPhone 15 Pro. The model was able to infer the text in the image and output it as a response. It did miss the typo in my handwriting though, see if you can find it.

2. nanoVLM is a ~750 line of code fully functioning VLM (Vision Language Model) pipeline.

It combines SigLIP-B/16-224-85M as the vision encoder and SmolLM2-135M as the language model for a total of 222M parameters. The repo includes training code to train the model as well as fine-tune it on your own dataset. If you’re looking to get hands on building your own VLM at a smaller scale, the nanoVLM repo as well as the accompanying blog post is probably the best place on the internet to start.

nanovlm-architecture

nanaVLM architecture. The model follows the vision encoder + language encoder → fusion layer → LLM paradigm.

3. SmolVLM demo in the browser.

It turns out SmolVLM-500M is such a small model, it can be run directly in the browser running inference on images from your webcam.

smolvlm-browser

4. D-FINE, a series of state-of-the-art Apache 2.0 real-time object detection models get added to Transformers..

I’m loving the trend of new Apache 2.0 real-time object detection models coming out.

One of the latest is D-FINE (Redefine Regression Task in DETRs as Fine-grained Distribution Refinement).

There are several variants now available on Hugging Face.

The models currently outperform or are on par with all variants of YOLO.

See the GitHub for more and an example notebook on how to fine-tune D-FINE on a custom dataset.

5. OpenVision is a series of 20+ open-source CLIP-style models.

From data to training code to model variants ranging from 5M parameters and 160 pixel image resolution to 632M parameters and 224 pixel image resolution, researchers from University of California have open-sourced all of the components required to produce highly effective CLIP-style models.

The OpenVision models perform on-par with less open models such as OpenAI’s CLIP (only the weights are open) but also train 2-3x faster.

See the code on GitHub, the dataset used to train the models (Recap-DataComp-1B) and all of the model variants.

6. Track anything in a video on-device with EdgeTAM.

Meta's new EdgeTAM (Track Anything Model) enables you to select an object in a video and have it tracked across frames. The model performs similar to SAM2 but is 22x faster on devices such as the iPhone 15 Pro.

You can see the demo on Hugging Face and get the code and models on GitHub.

edgetam-demo-raw

EdgeTAM demo tracking a dog in a video using several points as a reference (the model accepts both positive and negative points as input).

7. trackers is an open-source library for tracking multiple objects in video.

Roboflow recently published trackers which implements reusable model pipelines and helper functions to accurately track many objects in video at once. You can use a detection model to detect items and then use a tracking algorithm to track that item across video frames.

See the code and demos on GitHub.

8. Osmosis is a fine-tuned version of Qwen3-0.6B design to extract structured data from text.

One of my most common use cases for LLMs and VLMs is to turn unstructured data (e.g. images and natural text) into structured data.

The Osmosis team trained a model to do just that. Omosis-Structure-0.6B can extract structured data in the form of JSON to a specific schema.

On the listed benchmarks, the model outperforms larger closed models such as Claude 4, GPT 4.1 and OpenAI o3.

Very cool to see a model with less than 1B parameters outperforming larger models.

Shows how much potential there is for specific fine-tuning.

9. NVIDIA Parakeetv2 is a 600M parameter model capable of 3380x real-time factor for automatic speech recognition.

Currently the best open-source automatic speed recognition (ASR) mdoel available at the time of writing.

Includes automatic punctuation and timestamps in the output, performs really well on spoken numbers and song lyrics and can handle audio inputs of up to 3 hours long.

The model is available for commercial and non-commercial use.

Try the online demo with your own audio.

10. LLMDet is an open-vocabulary object detection model capable of detecting objects in images using natural language.

LLMDet outperforms similar models such as Grounding-DINO and MM-Grounding-DINO on open-vocabulary object detection.

The workflow is to pass a list of words, such as, ["apple", "banna", "watermelon"] as well as an image and have the model return bounding boxes of the input words if they appear in the image.

You can get the models on Hugging Face and the full code on GitHub.

11. DeepSeek release two new reasoning models.

DeepSeek-R1-0528 is a 685B parameter model which performs close to that of OpenAI-o3 as well as Gemini 2.5 Pro and well above the original DeepSeek-R1.

DeepSeek-R1-0528-Qwen3-8B is an 8B parameter model which involves distilling DeepSeek-R1-0528 into Qwen3-8B. The result of the distillation is a model that is on par with Gemini 2.5 Flash and o3-mini on several benchmarks.

Both models are available under the MIT license and can be used for distillation and synthetic data creation.

12. Xiamoi release open-source MiMo LLM and VLMs.

Xiamoi the mobile phone making company has dipped their toes into the open-source AI world with a collection of open-source base and RL LLMs.

MiMo is a suite of 7B LLMs with variants such as MiMo-7B-Base and MiMo-7B-RL.

MiMo-VL is a pair of VLMs (Vision Language Models) with a SFT version (supervised fine-tuning) MiMo-VL-7B-SFT and RL version MiMo-VL-7B-RL.

The MiMo-VL models outperform all similar sized VLMs and even get close to GPT4-o and Claude 3.7 levels on several benchmarks.

13. ByteDance release Dolphin for high-quality document layout and element parsing.

ByteDance’s new open-source and MIT-licensed Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) model provides a stage parsing of documents:

  1. Page level analysis generates elements sequences (e.g. text, titles, tables, figures) in natural reading order.
  2. Document elements get parsed with task-specific prompts (e.g. extracting information from table or turning a function into LaTeX).

The model is available on Hugging Face and at 398M parameters, it’s very capable of fitting on local hardware.

On several document-based benchmarks, Dolphin outperforms other open-source models 10x its size and even outperforms proprietary models such as GPT-4o and Gemini 1.5 Pro.

dolphin-demo

Dolphin demo turning a PDF page into raw text. Source: Dolphin GitHub.

Papers

  • CountGD is an open-world counting model capable of counting objects in an image based on text or image input references. Paper, code, demo.
  • Seed1.5-VL model paper describes ByteDances new 20B active parameter model which performs achieves state of the art on 38 out of 60 benchmarks. I particularly like the section in the paper about experimenting with different data mixtures for different levels of sparsity (e.g. using a large dataset, how should you balance different classes if some are more common than others). The model is also capable of detecting objects in images. Paper, GitHub, grounding demo notebook.

Company releases

Media, podcasts and more


See you next month!

What a massive month for the ML world in May!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job
19 min read

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)
30 min read

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

[May 2025] Python Monthly Newsletter 🐍 preview
[May 2025] Python Monthly Newsletter 🐍
7 min read

66th issue of Andrei Neagoie's must-read monthly Python Newsletter: Free-Threaded Python, Pyrefly vs ty, Big Tech News, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.