[April 2025] AI & Machine Learning Monthly Newsletter 💻🤖

64th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in April 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

A hands-on guide to IoU (Intersection over Union) — Intersection over Union or IoU is a way to measure how much two bounding boxes overlap. If two boxes perfectly overlap they have an IoU score of 1.0. If they don’t overlap at all, they have an IoU score of 0.0. This IoU score is an important step in evaluating object detection models. It’s part of the mAP (Mean Average Precision) metric calculation. So every time you see a mAP metric for a new YOLO or Detection model, IoU is likely being used under the hood. See the related previous post on different bounding box formats.

iou-blog-post-cover

[Coming Soon] Project: Build a custom object detection model with Hugging Face Transformers — I’m working on a new ZTM project to build Trashify 🚮, a custom object detection model to incentivise picking up trash in a local area. The code is complete and I’m in the process of making supplementary materials (tutorial text, slides, videos, evaluation breakdowns). Stay tuned for the completed release!
Video version of ML Monthly March 2025 — If you like seeing video walkthrough’s of these kinds of materials (videos tend to be better for demos), check out the video walkthrough of last month’s ML Monthly. The video walkthrough for this issue (April 2025) should be live a couple of days after the text version gets posted!

From the Internet

NVIDIA’s new cuML (CUDA ML) framework enables Scikit-Learn to be run on NVIDIA GPUs. It comes built into Google Colab and offers a “no code change” setup. See the graphic below for an example of a 92x speedup using the RandomForestClassifier model, all code available in an example notebook.

nvidia-gpu-speedup-scikit-learn.001

NVIDIA’s cuML speeds up Scikit-Learn on various tasks by up to 92x (potentially higher depending on the task).

Intel’s AutoRound framework helps to quantize (make smaller) LLMs and VLMs without losing large amounts of accuracy. Models quantized with AutoRound can retain 99%+ of their original accuracy whilst requiring 2-3x less memory. AutoRound is capable of running on a single GPU (e.g. A100 80G) in a couple of minutes (for smaller models) and a couple of hours (for larger models).

intel-autoround-statistics Retained performance (average on 13 tasks) of quantized models versus their original 16 bit implementation as well as how long each model took to go through a different AutoRound setting of “best”, “default”, “light” tuning.

12 factor agents is the AI Agent version of the twelve-factor app. If you’re looking to build AI Agents, you should read through these twelve steps. The author, Dex, breaks Agents down into bite sized components with the spirt of maximizing experimentation whilst retaining control.

My favourite is Factor 2: own your prompts. If prompts are your main entry point to an LLM, why wouldn’t you treat them like first-class code?

120-own-your-prompts

Factor 2 of 12: Own your prompts. If prompts are one of the main interactions your applications with an LLM or AI model, they should be treated as first-class code. Source: 12 factor apps.

Apple share how they use LLMs for App Store Review Summarization. Combines three LLMs for creating summaries tailored to a specific app:

Insight extraction LLM fine-tuned with LoRA adapters distills each review into a set of distinct insights.
Dynamic Topic Modeling LLM distills each insight into a topic name in a standardized fashion while avoiding a fixed taxonomy. Topics are grouped and deduplicated based on embeddings and similar names. Priority is given to topics relating to “App Experience” rather than “Out-of-App Experience” (e.g. the quality of food for a food delivery app rather than the app itself).
Summary Generation LLM fine-tuned with LoRA adapters on a large diverse dataset of reference summaries written by humans is used to generate a summary from the selected insights. This model is aligned with DPO (Direct Preference Optimization) to match human preferences. The DPO dataset was created from a dataset of summary pairs with one item being the model’s initially generated output the the subsequent human-edited version.

A good case study to how multiple LLMs can be combined each with specialized functions to perform a task at scale.

Hero App Store Summary 43583dda49

Example of an LLM-generated review summary based on existing user reviews in the App Store. Source: Apple ML Research blog.

vLLM adds in support for Transforms backend and shares best practice for accelerating RLHF. vLLM is a serving and inference engine for LLMs. Meaning, if you want to serve and deploy your LLMs as well as have them run faster, vLLM is one of the best tools on the market. I’ve personally noticed speedups of 20x running LLMs such as Phi-4 with vLLM versus the native Hugging Face Transformers implementation. The good news is vLLM is expanding support to many more Transformers models by adding the --model-impl transformers flag. The vLLM team also share best practices for generating data (using a large amount of inference) for Reinforcement Learning from Human Feedback techniques.

Bespoke Labs show how to use Reinforcement Learning to improve Qwen2.5-7B-Instruct’s tool use by 23% with only 100 samples. If you’re building an AI Agent, one of the most important requirements is for the underlying LLM to be able to use tools (consider tools as being structured outputs to call a certain function, e.g. “what is the weather?” → get_weather_tool).

The article contains good tidbits and training recipe steps such as filtering the responses with ultra-long outputs to prevent the model from getting stuck in recursive loops.

Bespoke Labs share a case study on how to scale up synthetic data to build a high-quality chart extraction model. Bespoke’s MiniChart-7B is a fine-tuned version of Qwen2.5-VL-7B-Instruct which is capable of performing on par or better than models such as Gemini 1.5 Pro and Claude 3.5 at chart information extraction.

They achieved this thanks to a four stage synthetic data curation pipeline involving 40k real-world images of charts and doing the following: extract facts, generation questions about the charts, answer the questions, augment the questions and regenerate answers.

Their final dataset ends with 270k chart-question-CoT-answer (CoT = Chain of Thought, in other words, the model’s thinking steps outlined line by line) tuples (13k images, 91k curated unique QA pairs with 3 CoT traces each). This is a really cool example of how targeted synthetic data generation can get you outstanding results with a smaller as well as open model.

bespoke-labs-minichat-7b-training-data

Example of a training data question and answer pair for creating Bespoke-MiniChart-7B. The chart image has a question related to the information contained within it and the associated answer. The text output of the model shows thinking steps as well as the final answer. These pairs of samples, chart, question and thinking trace are used to fine-tune the model. Source: Bespoke Labs blog.

Brakes on an intelligence explosion. Nathan Lambert, one of my favourite writers in the AI space and the post-training lead at the Allen Institute for AI, writes a series of inquiries about why he thinks AI 2027 (some form of Superhuman AI Researcher by 2027) won’t happen.

Points include: Labs making progress on evaluations by bootstrapping similar problems, current AI is broad, not narrow intelligence, data research is the foundation of algorithmic AI progress and the over-optimism of RL training (the real world doesn’t have as many narrow objectives as most RL systems tend to optimize for).

There are no new ideas in AI… only new datasets. ImageNet, the Web, human preferences and verifiers, what do all of these have in common?

Jack Morris writes how they’re all a form of dataset which led to the latest AI innovation. ImageNet led to superhuman computer vision systems, the Web (the whole internet of text) led to pretraining LLMs, human preferences led to ChatGPT (Reinforcement Learning on Human Preferences) and verifiers (such as calculators and problems with verifiable answers) led to reasoning models such as DeepSeek R1. And what seems to be the trend is that various tricks and tips for ML models all end up at similar results (e.g. Transformers and CNNs perform on par with computer vision tasks), the main thing that seems to drive significant progress is a high quality dataset.

Daniel’s open-source of the month

Meta releases Llama 4 Scout (17B active parameters, 109B total parameters) and Maverick (17B active parameters, 400B total), two highly performant and open-weight multi-modal LLMs (they can handle text and images), as well as a suite of other releases at LlamaCon such as the Llama API, Llama Guard 4 12B (a safety check for LLMs and VLMs), Llama Prompt Guard 2 86M (a classifier for prompt attacks) and SAM 3 (coming soon). These large models are incredibly impressive. But if you ask me, I’d like to see smaller versions of Llama 4. Same style but perhaps in the 1B to 32B range. These are models people can run on their own devices much easier. See the LlamaCon 2025 livestream replay.

sam-3-preview

SAM 3 (Segment Anything 3) announced as ‘coming soon’ at LlamaCon 2025. Source: LlamaCon 2025 livestream.

Meta open-sources a plethora of AI research, data and models including:

Perception Encoder: A state of the art language-aligned vision encoder which performs better or on par to SigLIP2. Perception Encoder comes in three flavours: Core, Language aligned (e.g. for use with a VLM) and Spatial aligned (e.g. for use with object detection/segmentation).

Perception LM: An open data, open training VLM which performs on par with Qwen2.5-VL-7B, perfect for those looking to see how modern VLMs are trained.

Locate 3D: An open-vocabulary detection model capable of detecting items in 3D space. For example you can see detect “bicycle” and the model will find where in a 3D world a bicycle is detected.

Byte Latent Transformer: A tokenizer-free language model that operates on raw bytes rather than tokens. The first model of its kind to equal language-trained tokens except trained on raw bytes.

locate-3d-demo

Example of Locate 3D working on 3D point clouds in a room detecting a natural language query of “bicycle”. Source: Locate 3D website.

Open-Qwen2-VL reproduces or betters Qwen2-VL with 0.36% of the training tokens (5 billion tokens vs. 1.4 trillion tokens). All data, models and training code are available.
Qwen3 models get released everywhere. Qwen3 is the latest instalment from the Alibaba Cloud team of foundation language models. Their flagship models Qwen3-235B-A22B (235B total parameters and 22B active parameters), Qwen3-30B-A3B and Qwen3-32B perform in the range of Gemini 2.5 Pro, GPT-4o and OpenAI-o1 respectively. The models also come in 0.6/1.7/4/8/14B parameter variants, all of which perform on par or better than Qwen2.5 models with double the parameter counts (e.g. Qwen-14B performs on par with Qwen2.5-32B). The release is a masterclass in how to get your open-source models out there. Qwen3 variants are available everywhere: Transformers, Ollama, MLX LM, vLLM and more in the GitHub repo.
Sentence Transformers v4.0 is out and it allows you to train your own custom rerankers. A reranker is tasked with the job of “given these top-k documents, which suits the query best?”. If you are working on a RAG pipeline, you might use an embedding model to retrieve the top 50 documents given a query. A reranker is then dedicated to sort those top 50 in the best order to satisfy the query. There are many open-source general rerankers out there but as with most ML systems, if you have your own custom data, it’s often best to train your own custom model.
Microsoft release BitNet-b1.58-2B-4T, a native 1-bit LLM. Less memory, faster inference, less energy. That’s the potential of using 1-bit LLMs. Instead of the full-precision or float16, BitNet uses weights that are either -1, 0 or 1. Modern hardware does not cater to 1.58 bit matrix multiplications so the authors had to develop their own custom CUDA kernel. BitNet is able to perform on par with similar sized models despite being trained from scratch in 1-bit format. See the GitHub or technical report for more.
Dia-1.6B is an open weight dialogue model capable of generating highly realistic dialogue from a transcript. For example, you can use [S1] “Some text here” and [S2] “Some more text here” for dialogue to occur between two people. You can even add various tags such as (laughs) and (sighs) in between the speech and the model will take those into account.
Three new VLMs to try out from ByteDance with SAIL-VL-1.6B, Kimi-VL-A3B (3B active parameters) and InternVL3 variants (1/2/8/9/14/38/78B parameters). All score favourably on the OpenCompass VLM leaderboard with InternVL3 38B and 78B performing in 2nd and 3rd position compared to Gemini 2.5 Pro.
NVIDIA’s Describe Anything Model (DAM) is capable of describing specific regions of an image with text in low detail or high detail. See the demo online.

nvidia-DAM-model-demo

Example of selecting an object in an image (using a segmentation model) and then have the DAM model describe what’s there. Source: DAM demo.

Papers of note

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs beyond the base model? — This paper asks the question whether RL adds in reasoning capability or improves the existing sampling capabilities in a base model. For example, a base LLM might have the ability to solve a hard problem given enough tries but RL makes this process better (e.g. less tries to solve the same problem).
Multi-label Cluster Discrimination for Visual Representation Learning — Shows a highly performant training scheme for learning a visual representation from a large dataset. First, you cluster images into a large number of clusters (e.g. 100k to 1M clusters) and then you assign images different labels based on the clusters they are near. For example, a food image might be near several clusters (e.g. avocado, banana, apple). Using these close by clusters, you assign multiple labels to the image. Doing this at scale provides a good baseline visual representation for which models such as MLCD-Seg (an open-world segmentation model) and MLCD-ViT (a model which is an extremely good vision backbone for VLMs).
SmolVLM: Redefining small and efficient multimodal models — SmolVLM-256M and SmolVLM-500M redefined what’s possible with a smaller number of parameters. Now the recipe on how they were made is out. There’s also a demo iOS app, HuggingSnap, that can run SmolVLM on-device available on the App Store for free download. You can also get the code for the app on GitHub so you can see the code for getting such a model to work on-device.

Videos and Talks

[Video] OpenAI engineers talk about GPT-4.5 pre-training from a technical standpoint, including timeline, hardware and data preparation. As an engineer, this is really cool to see some of the behind the scenes details.
[Video] Building an evaluating AI Agents by Sayash Kapoor from AI Snake Oil — A talk discussing the fact that AI Agents are already here in the sense that ChatGPT is much more than just a model now, the same with other systems such as Claude and Gemini. And how it’s perhaps the job of the AI Engineer to get a system from 99% reliable to 99.99% reliable, much like how computers evolved over the initial years, AI systems will be required to do the same.

See you next month!

What a massive month for the ML world in April!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.