[April 2025] AI & Machine Learning Monthly Newsletter

Daniel Bourke
Daniel Bourke
hero image

64th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in April 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

  • A hands-on guide to IoU (Intersection over Union) — Intersection over Union or IoU is a way to measure how much two bounding boxes overlap. If two boxes perfectly overlap they have an IoU score of 1.0. If they don’t overlap at all, they have an IoU score of 0.0. This IoU score is an important step in evaluating object detection models. It’s part of the mAP (Mean Average Precision) metric calculation. So every time you see a mAP metric for a new YOLO or Detection model, IoU is likely being used under the hood. See the related previous post on different bounding box formats.

iou-blog-post-cover

  • [Coming Soon] Project: Build a custom object detection model with Hugging Face Transformers — I’m working on a new ZTM project to build Trashify 🚮, a custom object detection model to incentivise picking up trash in a local area. The code is complete and I’m in the process of making supplementary materials (tutorial text, slides, videos, evaluation breakdowns). Stay tuned for the completed release!
  • Video version of ML Monthly March 2025 — If you like seeing video walkthrough’s of these kinds of materials (videos tend to be better for demos), check out the video walkthrough of last month’s ML Monthly. The video walkthrough for this issue (April 2025) should be live a couple of days after the text version gets posted!

From the Internet

NVIDIA’s new cuML (CUDA ML) framework enables Scikit-Learn to be run on NVIDIA GPUs. It comes built into Google Colab and offers a “no code change” setup. See the graphic below for an example of a 92x speedup using the RandomForestClassifier model, all code available in an example notebook.

nvidia-gpu-speedup-scikit-learn.001

NVIDIA’s cuML speeds up Scikit-Learn on various tasks by up to 92x (potentially higher depending on the task).

Intel’s AutoRound framework helps to quantize (make smaller) LLMs and VLMs without losing large amounts of accuracy. Models quantized with AutoRound can retain 99%+ of their original accuracy whilst requiring 2-3x less memory. AutoRound is capable of running on a single GPU (e.g. A100 80G) in a couple of minutes (for smaller models) and a couple of hours (for larger models).

intel-autoround-statistics Retained performance (average on 13 tasks) of quantized models versus their original 16 bit implementation as well as how long each model took to go through a different AutoRound setting of “best”, “default”, “light” tuning.

12 factor agents is the AI Agent version of the twelve-factor app. If you’re looking to build AI Agents, you should read through these twelve steps. The author, Dex, breaks Agents down into bite sized components with the spirt of maximizing experimentation whilst retaining control.

My favourite is Factor 2: own your prompts. If prompts are your main entry point to an LLM, why wouldn’t you treat them like first-class code?

120-own-your-prompts

Factor 2 of 12: Own your prompts. If prompts are one of the main interactions your applications with an LLM or AI model, they should be treated as first-class code. Source: 12 factor apps.

Apple share how they use LLMs for App Store Review Summarization. Combines three LLMs for creating summaries tailored to a specific app:

  1. Insight extraction LLM fine-tuned with LoRA adapters distills each review into a set of distinct insights.
  2. Dynamic Topic Modeling LLM distills each insight into a topic name in a standardized fashion while avoiding a fixed taxonomy. Topics are grouped and deduplicated based on embeddings and similar names. Priority is given to topics relating to “App Experience” rather than “Out-of-App Experience” (e.g. the quality of food for a food delivery app rather than the app itself).
  3. Summary Generation LLM fine-tuned with LoRA adapters on a large diverse dataset of reference summaries written by humans is used to generate a summary from the selected insights. This model is aligned with DPO (Direct Preference Optimization) to match human preferences. The DPO dataset was created from a dataset of summary pairs with one item being the model’s initially generated output the the subsequent human-edited version.

A good case study to how multiple LLMs can be combined each with specialized functions to perform a task at scale.

Hero App Store Summary 43583dda49

Example of an LLM-generated review summary based on existing user reviews in the App Store. Source: Apple ML Research blog.

vLLM adds in support for Transforms backend and shares best practice for accelerating RLHF. vLLM is a serving and inference engine for LLMs. Meaning, if you want to serve and deploy your LLMs as well as have them run faster, vLLM is one of the best tools on the market. I’ve personally noticed speedups of 20x running LLMs such as Phi-4 with vLLM versus the native Hugging Face Transformers implementation. The good news is vLLM is expanding support to many more Transformers models by adding the --model-impl transformers flag. The vLLM team also share best practices for generating data (using a large amount of inference) for Reinforcement Learning from Human Feedback techniques.

Bespoke Labs show how to use Reinforcement Learning to improve Qwen2.5-7B-Instruct’s tool use by 23% with only 100 samples. If you’re building an AI Agent, one of the most important requirements is for the underlying LLM to be able to use tools (consider tools as being structured outputs to call a certain function, e.g. “what is the weather?” → get_weather_tool).

The article contains good tidbits and training recipe steps such as filtering the responses with ultra-long outputs to prevent the model from getting stuck in recursive loops.

Bespoke Labs share a case study on how to scale up synthetic data to build a high-quality chart extraction model. Bespoke’s MiniChart-7B is a fine-tuned version of Qwen2.5-VL-7B-Instruct which is capable of performing on par or better than models such as Gemini 1.5 Pro and Claude 3.5 at chart information extraction.

They achieved this thanks to a four stage synthetic data curation pipeline involving 40k real-world images of charts and doing the following: extract facts, generation questions about the charts, answer the questions, augment the questions and regenerate answers.

Their final dataset ends with 270k chart-question-CoT-answer (CoT = Chain of Thought, in other words, the model’s thinking steps outlined line by line) tuples (13k images, 91k curated unique QA pairs with 3 CoT traces each). This is a really cool example of how targeted synthetic data generation can get you outstanding results with a smaller as well as open model.

bespoke-labs-minichat-7b-training-data

Example of a training data question and answer pair for creating Bespoke-MiniChart-7B. The chart image has a question related to the information contained within it and the associated answer. The text output of the model shows thinking steps as well as the final answer. These pairs of samples, chart, question and thinking trace are used to fine-tune the model. Source: Bespoke Labs blog.

Brakes on an intelligence explosion. Nathan Lambert, one of my favourite writers in the AI space and the post-training lead at the Allen Institute for AI, writes a series of inquiries about why he thinks AI 2027 (some form of Superhuman AI Researcher by 2027) won’t happen.

Points include: Labs making progress on evaluations by bootstrapping similar problems, current AI is broad, not narrow intelligence, data research is the foundation of algorithmic AI progress and the over-optimism of RL training (the real world doesn’t have as many narrow objectives as most RL systems tend to optimize for).

There are no new ideas in AI… only new datasets. ImageNet, the Web, human preferences and verifiers, what do all of these have in common?

Jack Morris writes how they’re all a form of dataset which led to the latest AI innovation. ImageNet led to superhuman computer vision systems, the Web (the whole internet of text) led to pretraining LLMs, human preferences led to ChatGPT (Reinforcement Learning on Human Preferences) and verifiers (such as calculators and problems with verifiable answers) led to reasoning models such as DeepSeek R1. And what seems to be the trend is that various tricks and tips for ML models all end up at similar results (e.g. Transformers and CNNs perform on par with computer vision tasks), the main thing that seems to drive significant progress is a high quality dataset.

Daniel’s open-source of the month

sam-3-preview

SAM 3 (Segment Anything 3) announced as ‘coming soon’ at LlamaCon 2025. Source: LlamaCon 2025 livestream.

Perception Encoder: A state of the art language-aligned vision encoder which performs better or on par to SigLIP2. Perception Encoder comes in three flavours: Core, Language aligned (e.g. for use with a VLM) and Spatial aligned (e.g. for use with object detection/segmentation).

Perception LM: An open data, open training VLM which performs on par with Qwen2.5-VL-7B, perfect for those looking to see how modern VLMs are trained.

Locate 3D: An open-vocabulary detection model capable of detecting items in 3D space. For example you can see detect “bicycle” and the model will find where in a 3D world a bicycle is detected.

Byte Latent Transformer: A tokenizer-free language model that operates on raw bytes rather than tokens. The first model of its kind to equal language-trained tokens except trained on raw bytes.

locate-3d-demo

Example of Locate 3D working on 3D point clouds in a room detecting a natural language query of “bicycle”. Source: Locate 3D website.

nvidia-DAM-model-demo

Example of selecting an object in an image (using a segmentation model) and then have the DAM model describe what’s there. Source: DAM demo.

Papers of note

  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs beyond the base model? — This paper asks the question whether RL adds in reasoning capability or improves the existing sampling capabilities in a base model. For example, a base LLM might have the ability to solve a hard problem given enough tries but RL makes this process better (e.g. less tries to solve the same problem).
  • Multi-label Cluster Discrimination for Visual Representation Learning — Shows a highly performant training scheme for learning a visual representation from a large dataset. First, you cluster images into a large number of clusters (e.g. 100k to 1M clusters) and then you assign images different labels based on the clusters they are near. For example, a food image might be near several clusters (e.g. avocado, banana, apple). Using these close by clusters, you assign multiple labels to the image. Doing this at scale provides a good baseline visual representation for which models such as MLCD-Seg (an open-world segmentation model) and MLCD-ViT (a model which is an extremely good vision backbone for VLMs).
  • SmolVLM: Redefining small and efficient multimodal models — SmolVLM-256M and SmolVLM-500M redefined what’s possible with a smaller number of parameters. Now the recipe on how they were made is out. There’s also a demo iOS app, HuggingSnap, that can run SmolVLM on-device available on the App Store for free download. You can also get the code for the app on GitHub so you can see the code for getting such a model to work on-device.

Videos and Talks

  • [Video] OpenAI engineers talk about GPT-4.5 pre-training from a technical standpoint, including timeline, hardware and data preparation. As an engineer, this is really cool to see some of the behind the scenes details.
  • [Video] Building an evaluating AI Agents by Sayash Kapoor from AI Snake Oil — A talk discussing the fact that AI Agents are already here in the sense that ChatGPT is much more than just a model now, the same with other systems such as Claude and Gemini. And how it’s perhaps the job of the AI Engineer to get a system from 99% reliable to 99.99% reliable, much like how computers evolved over the initial years, AI systems will be required to do the same.

See you next month!

What a massive month for the ML world in April!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job
19 min read

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)
30 min read

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

[April 2025] Python Monthly Newsletter 🐍 preview
[April 2025] Python Monthly Newsletter 🐍
7 min read

65th issue of Andrei Neagoie's must-read monthly Python Newsletter: Python's New t-strings, The Best Programmers I Know, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.