65th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Apache 2.0 object detection models — I wrote a short blog post collecting different Apache 2.0 object detection models. These are models tested by me to perform well on various object detection tasks. They perform on par or better than YOLO-based equivalents.
This is part of an upcoming object detection project I’m working on for ZTM.
Traditionally Netflix uses many recommendation models to produce "what to watch next" items on their apps.
However, using and updating many models can lead to many challenges.
If one breaks, chances are, its errors will cascade.
It also means that many different features often have to be engineered to fit the models requirements rather than the model learning the required features directly from the data.
In a recent blog post, they share how they adopted LLM-like strategies to create a foundation model for recommendations.
From data filtering to structuring to weighting to encoding strageties to model training and scaling.
It turns out, just like LLMs, foundational models for recommendation systems improve well with scale (e.g. more parameters = better performance).
The benefits of having a foundation model for recommendations means that there's now a single source which can be used for various use cases such as fine-tuned for specific tasks, use directly for predictions or leveraged for embeddings.
Netflix moving from a model-centric approach to a data-centric approach. In other words, freeze the model and iterate on the data.
In essence, use AI workflows for predictable tasks where the required steps are known and use AI agents when flexibility and model-driven decision-making is required.
In the article, Phil writes about:
See the full article for more.
Bonus: See agentrecipes.com (free) for an excellent collection of code examples for different kinds of AI patterns.
It’s Christmas and you want “Christmas songs” but there’s a lot in that search term.
Do you want the tracks with “Christmas” in their name? Or do you want the popular Christmas songs you hear when visiting the shops at Christmas time?
Or how about “Sunday afternoon BBQ chill”?
It’s hard to match songs based on their title to a query like that.
To build Text2Track, Spotify fine-tuned an LLM to generate Track IDs based on a query rather than generate straight Artist Name and Song Title pairs.
Spotify’s Text2Tracks overview: going from input prompt → recommended track IDs.
They found trying to generate “Artist Track Name” type outputs could often take too long (in the case where the titles were long) and the titles of songs can often not capture the overall feel of a song.
Instead they found the two best options were “Artist Track Integer”, for example “artist_001_track_001”, a surprisingly simple yet effective approach and Semantic IDs, for example, learning a vector representation based on collaborative filtering embeddings (built by mining patterns of songs appearing together in playlists).
This Semantic ID technique turned out to work 2.7x better than BM25, a popular technique for keyword indexing.
LinkedIn’s JUDE stands for Job Understanding Data Expert.
The goal is to match job postings with candidates and vice versa.
This is no easy task.
Job postings can have large amounts of text, different locations, different requirements, slightly different titles.
And applicants can have almost an infinite variety of attributes.
A naive way to match would be straight text matching.
But at LinkedIn’s scaled of 1 billion users and 10 million+ job postings, to find the right match, embeddings often come into play.
JUDE uses a LoRA fine-tuned Mistral LLM (as well as others if needed) to generate job posting and candidate embeddings.
When someone posts a job posting, the model is able to create embeddings in ~300ms. There’s also a smart caching system which detects for significant changes in a posting before making calls to the LLM. If the change is significant enough, new embeddings are generated, if not, they stay the same. This caching system reduces inference volume by up to 6x compared to updating them every time.
Results from the JUDE LLM-based embedding system led to +2.07% qualified applications, -5.13% dismiss to apply and +1.91% total job applications.
Retrieval Augmented Generation (RAG) systems are useful across a wide range of industries.
Essentially, if you’ve got custom documentation or documents, chances are, RAG can be implemented somewhere.
However, the quality of your RAG system depends on the quality of:
Because it is a cascading system, a poor result in retreival will almost certainly lead to poorer generation.
What if you had the perfect conditions?
Perfect retrieval and perfect text as input to a generation model?
Well Mixedbread started there and worked backwards.
They found that even in the case of having perfect text (from human labellers), text-only retrieval systems fall short when trying to find the right document.
The solution?
Embed the document and the page in a multimodal embedding (Mixedbread offers this via their mxbai-omni-v0.1 model).
Embedding the whole page captures nuances such as diagrams, pictures, figures and handwriting that traditional OCR and text-only methods miss out on.
Even with perfect retrieval and perfect text, multimodal embeddings perform only slightly worse on a corpus of 8000+ diverse documents.
TL;DR: Visual embedding retrieval = best retrieval results, but perfect text is still best for extraction (versus directly extracting information from an image).
Example of where pure OCR-based RAG can fail: handwriting. VLMs tend to perform quite well here. Source: Mixedbread blog.
The NVIDIA NeMo framework helps optimize models for performance on NVIDIA GPUs and is now integrating with Hugging Face text generation and vision language models. The update will bring speed benefits to Hugging Face models to go from Model ID → NeMo Framework AutoModel → NVIDIA NeMo.
Kaggle Grandmaster Chris Deotte shares the strategy for wining a recent tabular-based competition to predict podcast listening time. By leveraging cuML (a framework for speeding up machine learning models on GPU), Chris was able to test 500 different model types before stacking together 75 models across three different tiers.
Using cuML (discussed in AI + ML Monthly April 2025) meant that more models could be tried in a faster time.
Some of my favourite takeaways:
On GenAI adoption:
If GenAI is the next big thing (and if you’re into tech, it certainly feels like it), how come the daily average users (5% to 15% of people) is much lower than the weekly active users?
Perhaps new technologies take time to adapt to.
What people call “the cloud” is still only ~30% of software workloads.
On AI eating the world:
Jevon’s paradox results in more usage of something because it’s cheaper but not necessarily more revenue or profits.
What does this mean?
As token prices fall for foundation models (and now many companies have fairly similarly performing top-tier models) usage increases but that doesn’t mean profits do.
Perhaps it’s a builders world, GenAI services become like databases, a commodity tool which entire businesses can be built upon (sell the service on top of the model rather than the model itself).
Performance of top-tier models is catching up to each other. Notice how close all of the 2025+ models are.
A great write up discussing an iterative loop process to analyse the performance of an LLM in your application.
From bootstrapping an initial dataset → reading and labelling sample types (manually or with an LLM) → clustering failure modes → labelling traces (inputs and outputs of LLM) → quantify and iterate.
Steps in an LLM evaluation (or any other kind of ML model) evaluation loop. Source: Alex Strick van Linschoten blog.
My favourite quote from the article:
“Importantly, you let the categories emerge from the data rather than coming in with pre-conceived ideas of what the categories already are.”
FastVLM is an open-source vision-language model capable of running on your iPhone.
It combines a FastViTHD vision encoder as well as various size Qwen2 LLMs (0.5B, 1.6B, 7B). It has been designed with speed in mind and is capable of achieving equal or better results than other vision encoders with best-in-class time to first token (TTFT).
All variants of FastVLM are available to download and try locally. It even comes with an example app. I tried it on my iPhone 15 Pro and was impressed by its speed (it even works offline!).
You can find more details in the research paper.
The FastVLM architecture as well as a demo version of the app running on my iPhone 15 Pro. The model was able to infer the text in the image and output it as a response. It did miss the typo in my handwriting though, see if you can find it.
It combines SigLIP-B/16-224-85M as the vision encoder and SmolLM2-135M as the language model for a total of 222M parameters. The repo includes training code to train the model as well as fine-tune it on your own dataset. If you’re looking to get hands on building your own VLM at a smaller scale, the nanoVLM repo as well as the accompanying blog post is probably the best place on the internet to start.
nanaVLM architecture. The model follows the vision encoder + language encoder → fusion layer → LLM paradigm.
It turns out SmolVLM-500M is such a small model, it can be run directly in the browser running inference on images from your webcam.
I’m loving the trend of new Apache 2.0 real-time object detection models coming out.
One of the latest is D-FINE (Redefine Regression Task in DETRs as Fine-grained Distribution Refinement).
There are several variants now available on Hugging Face.
The models currently outperform or are on par with all variants of YOLO.
See the GitHub for more and an example notebook on how to fine-tune D-FINE on a custom dataset.
From data to training code to model variants ranging from 5M parameters and 160 pixel image resolution to 632M parameters and 224 pixel image resolution, researchers from University of California have open-sourced all of the components required to produce highly effective CLIP-style models.
The OpenVision models perform on-par with less open models such as OpenAI’s CLIP (only the weights are open) but also train 2-3x faster.
See the code on GitHub, the dataset used to train the models (Recap-DataComp-1B) and all of the model variants.
Meta's new EdgeTAM (Track Anything Model) enables you to select an object in a video and have it tracked across frames. The model performs similar to SAM2 but is 22x faster on devices such as the iPhone 15 Pro.
You can see the demo on Hugging Face and get the code and models on GitHub.
EdgeTAM demo tracking a dog in a video using several points as a reference (the model accepts both positive and negative points as input).
Roboflow recently published trackers which implements reusable model pipelines and helper functions to accurately track many objects in video at once. You can use a detection model to detect items and then use a tracking algorithm to track that item across video frames.
See the code and demos on GitHub.
One of my most common use cases for LLMs and VLMs is to turn unstructured data (e.g. images and natural text) into structured data.
The Osmosis team trained a model to do just that. Omosis-Structure-0.6B can extract structured data in the form of JSON to a specific schema.
On the listed benchmarks, the model outperforms larger closed models such as Claude 4, GPT 4.1 and OpenAI o3.
Very cool to see a model with less than 1B parameters outperforming larger models.
Shows how much potential there is for specific fine-tuning.
Currently the best open-source automatic speed recognition (ASR) mdoel available at the time of writing.
Includes automatic punctuation and timestamps in the output, performs really well on spoken numbers and song lyrics and can handle audio inputs of up to 3 hours long.
The model is available for commercial and non-commercial use.
Try the online demo with your own audio.
LLMDet outperforms similar models such as Grounding-DINO and MM-Grounding-DINO on open-vocabulary object detection.
The workflow is to pass a list of words, such as, ["apple", "banna", "watermelon"] as well as an image and have the model return bounding boxes of the input words if they appear in the image.
You can get the models on Hugging Face and the full code on GitHub.
DeepSeek-R1-0528 is a 685B parameter model which performs close to that of OpenAI-o3 as well as Gemini 2.5 Pro and well above the original DeepSeek-R1.
DeepSeek-R1-0528-Qwen3-8B is an 8B parameter model which involves distilling DeepSeek-R1-0528 into Qwen3-8B. The result of the distillation is a model that is on par with Gemini 2.5 Flash and o3-mini on several benchmarks.
Both models are available under the MIT license and can be used for distillation and synthetic data creation.
Xiamoi the mobile phone making company has dipped their toes into the open-source AI world with a collection of open-source base and RL LLMs.
MiMo is a suite of 7B LLMs with variants such as MiMo-7B-Base and MiMo-7B-RL.
MiMo-VL is a pair of VLMs (Vision Language Models) with a SFT version (supervised fine-tuning) MiMo-VL-7B-SFT and RL version MiMo-VL-7B-RL.
The MiMo-VL models outperform all similar sized VLMs and even get close to GPT4-o and Claude 3.7 levels on several benchmarks.
ByteDance’s new open-source and MIT-licensed Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) model provides a stage parsing of documents:
The model is available on Hugging Face and at 398M parameters, it’s very capable of fitting on local hardware.
On several document-based benchmarks, Dolphin outperforms other open-source models 10x its size and even outperforms proprietary models such as GPT-4o and Gemini 1.5 Pro.
Dolphin demo turning a PDF page into raw text. Source: Dolphin GitHub.
What a massive month for the ML world in May!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.