[September 2025] AI & Machine Learning Monthly Newsletter

Daniel Bourke
Daniel Bourke
hero image

69th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in September 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

  • Birthday article: I turned 32 at the start of September and wrote an article on some things I’ve learned (I do a similar article every year).
  • Work in progress: I’m working on an LLM fine-tuning tutorial. Inside, we’ll take a small LLM and fine-tune it to perform a specific task. I’ve got the outline ready, now I just need to put it together in a fun way.
  • Coming very soon: My Hugging Face Object Detection Project Course is veryyyy close to being launched on ZTM. A quote from the editor: “lessons should be done in next couple days!”

From the Internet

Post Training 101 by Han Fang and Karthik Sankararaman

Ever wonder how an LLM reads the internet but then provides you useful responses?

Pre-training deals with the art of next token prediction (e.g. “The fox jumped over the ___”).

But Post-training takes the learned representation of token ordering and helps steer it towards something humans want/is useful.

The Post-training 101 guide walks you through examples of how this happens, including:

  • From next-token prediction to instruction following
  • Supervised Fine-Tuning (SFT) fundamentals
  • Reinforcement learning (RL) techniques such as RLHF, RLAIF, RLVR
  • Evaluation techniques for assessing model quality

I loved the flow of this blog post.

It’s an excellent guide with plenty of data examples for each technique.

sft-data-example

Example of a Supervised Fine-Tuning (SFT) example for post training a LLM. SFT involves showing the model examples of inputs and direct desired outputs. Source: Post-training 101 blog post.

One of my favourite quotes from the post was a reference to the Gemini 2.5 Pro paper:

The Gemini 2.5 Pro paper specifically emphasized that “Since the initial announcement of Gemini 1.5, significant advancements have been made in our post-training methodologies, driven by a consistent focus on data quality across the Supervised Fine-Tuning (SFT), Reward Modeling (RM), and Reinforcement Learning (RL) stages.

After reading this guide, you’ll have a great reference point to many of the techniques which go into creating a world-class model.

Hugging Face TRL now supports native vision fine-tuning

If you need to fine-tune an LLM (or SML - Small Language Model) or VLM, Hugging Face’s TRL (Transformers Reinforcement Learning) is one of your best options.

And with a recent update, TRL now natively supports vision as well as language fine-tuning.

Support includes the following RL (Reinforcement Learning) methods:

  • Mixed Preference Optimization (MPO)
  • Group Relative Policy Optimization (GRPO)
  • Group Sequence Policy Optimization (GSPO)
  • Direct Preference Optimization (DPO)

TRL also supports Supervised Fine-Tuning (SFT).

Here’s a short example of fine-tuning a VLM with TRL:

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None), # To avoid truncation that may remove image tokens during training
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()

Now if you have your own data, you can easily customize and open-source model to be even more aligned with your own unique problem space.

You can then deploy this model at your own discretion without having to worry about third-party APIs.

See the example sft_vlm.py script for more.

Bonus: Speaking of fine-tuning, here’s a Tweet from Nathan Lambert, author of The RLFH Book and researcher at Ai2 with regards to using RL-based fine-tuning versus SFT. In my experience, small models fine-tuned on specific data can equal or exceed the performance of those much much larger (which haven’t been fine-tuned).

tweet-for-small-models-using-sft

Rule of thumb: If you have a model under 15B parameters, try supervised fine-tuning. If you’ve got a big model, go RL. Of course, actual experience many vary. Best to experiment, experiment, experiment! Source: Nathan Lambert on X.

Ethan Ding writes about agents and clouds

Do you build an AI agent and then a cloud?

Or do you build a cloud and then an agent?

OpenAI has an agent but no cloud (yet).

Google has both.

A few startups are in between like Bolt and Loveable as users of agents.

And Netlify and Supabase as providers of cloud services.

Anthropic is trying to go vertical with their Claude model as well as all the tools to go alongside it.

As powerful as models get, you still need a place to store data (a database) and a place run to application code (web services).

agents-vs-clouds

Agents (yellow) vs Clouds (green) and everything in between. Source: Ethan Ding blog.

Modal releases notebooks — GPU-powered notebooks in seconds

Speaking of agents and clouds, Modal is a fast compute provider from 1 GPU to many in seconds.

And they just released support for Jupyter Notebooks hosted on Modal.

That means you can start up a Jupyter Notebook instance with multiple GPUs on the backend in a few seconds and have it automatically power down when you’re not using it (the default is 30 minutes of idle = shutdown).

I was able to go from zero to running Qwen3-4B on a H100 GPU in ~2 minutes.

This is really helpful to experiment with a smaller GPU locally or even on Google Colab and then easily upgrade the hardware using Modal when necessary.

Modal bills per second so you only get charged when the notebook is running.

modal-notebooks-example

Example interface of using Modal notebooks with a H100 on the backend. Source: Modal documentation + my own use case.

Case study: Hugging Face Is All You Need

A cool read from the Finegrain team (they make high-quality image editing tools) on how sometimes simple is best when it comes to improving model quality.

They share how they use the Hugging Face ecosystem to create:

  • A simple web app (using Hugging Face Spaces) that lets human testers play with the Eraser model — pick an image, brush over an object to erase, and carefully inspect the result.
  • A way to report any issues (using Hugging Face Datasets) to record the inputs/outputs and describe what went wrong from a quality perspective.

Doing this means they can test a new model and use the discovered issues to improve it over time.

This is a similar workflow to what I’ve recently using for an object detection project with a client: train a model, upload to Hugging Face Spaces, try it out, improve it with better samples in the next training run, upload, test, repeat.

An experiment loop is all you need.

Case study: TensorZero finds out you can save 30x on costs and 4x on inference time by fine-tuning models

If you’ve got a specific use case with plenty of data, chances are, fine-tuning a smaller model will save you on time and money.

Another workflow is to use the best model you can via API, then fine-tune a smaller model (or the exact same model, if it allows) to repeat the same task.

They tested Gemini 2.0 Flash, Gemini 2.0 Flash Lite, Qwen3-8B, GPT-4.1 nano and GPT-4.1 mini (both zero-shot and fine-tuned versions).

In many zero-shot settings, GPT-4.1 was the best performing.

However, after fine-tuning, almost all of the models performed on par or better than GPT-4.1 on several benchmarks resulting in cost savings and faster inference time.

fine-tuning-model-cost-savings

Every model tested achieved significant cost savings after fine-tuning compared to the original GPT-4.1 model. Source: TensorZero blog.

Thinking Machines discovers and publishes how to make LLMs deterministic

If you want to learn more about GPU programming, read everything Horace He has published.

See previous works such as gpt-fast and *Making Deep Learning Go Brrrr From First Principles.*

And his recent post, Defeating Nondeterminism in LLM Inference, in collaboration with Thinking Machines is sensational.

If you’ve used an LLM, you might’ve found that given the exact same input, it produces a different output.

Even when running locally and setting temperature to 0, this can still happen.

This can be helpful sometimes but when you’re trying to run repeatable tests, determinism is your friend.

It turns out, this isn’t a trivial problem to solve.

Until…

From the write up:

In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.

When you ping an LLM endpoint, chances are your inputs are in the queue with someone else.

So while your requests might be deterministic, you can’t guarantee all of the others (along with yours) are.

llm-determinism

Source: Thinking Machines blog.

How do you fix it?

You match three operations batch invariant (it doesn’t matter what’s in the batch): RMSNorm, matrix multiplication and attention.

See the rest of the blog post for this works in practice as well as example code on GitHub for how to make it work.

Stephan Tulkens blog is an excellent read for more on tokenizers

Tokenizers turn sequences of data such as text, audio or images into numbers so machine learning models can find patterns in them.

They are the first step and second last step in most interactions with LLMs.

For example, “Hello my name is Daniel” might get turned into [hello, my name, Dan, iel] (I made this example up) and then would get mapped to numbers such as [43, 5556, 339, 46, 119] (again, made up).

It turns out that some tokenizers are cased (they take into account capitals, “Dan” is different to “dan”) and some are uncared (“Dan” is the same as “dan”).

Depending how you input text into your tokenizer, this can influence performance.

The good news is that you can Stephen shows ways to mitigate this.

Anthropic release a guide on prompt engineering vs context engineering

In last month’s AI & Machine Learning Monthly (August 2025), we covered prompt engineering vs. context engineering.

The short version is:

  • Prompt engineering = usually 1 step, for example, prompt in, text out
  • Context engineering = can be multiple steps, previous steps influence future steps, add in documents, tools and more

Anthropic’s guide nails down some of these terms.

I especially like the question it asks right at the start:

What configuration of context is most likely to generate our model’s desired behaviour?

And then narrowing down on what good context engineering is (bold is mine):

Good context engineering means finding the smallest possible set of high-signal tokens that maximise the likelihood of some desired outcome. Implementing this practice is much easier said than done.

Some practical tips:

Providing examples is a well-known best practice and something we continue to advise.

And finally, a simple definition for agents:

LLMs autonomously using tools in a loop.

prompt-engineering-vs-context-engineering

Prompt engineering usually involves a simple input and output workflow. Whereas context engineering can be thought of as a system which combines 1 to N steps and tools. Source: Anthropic blog.

Daniel’s Open Source AI of the month

Qwen causes an avalanche

The Qwen team are well in front for the unofficial “Open Source AI team of the year” trophy 🏆.

It seems almost every month they’re releasing frontier models with open-source licenses across almost every domain.

This month is no different:

  • Qwen3-Omni is a series of models which can operate in all modalities: audio, video, images and text. Currently achieves the best results on 32/36 speech recognition tasks.
  • Qwen3-Next is a series of models which balances performance with number of parameters. Using 80B parameters with 3B active, the models are able to perform on par or better than Qwen3-235B-A22B, a model with ~4x more parameters and ~6x more active parameters. It also sees an incredible inference speed performance compared to Qwen3-32B-Base due to having ~10x less active parameters. I can confirm… In my few trials using the the Hugging Face Inference API… they are fast!
  • Qwen3Guard is a series of models which help protect from unsafe prompts. For example, you could put the Qwen3Guard model in between your users input and your production system.
  • Qwen3-VL is a series of models which marries together Qwen’s flagship text-based LLM as well as a vision encoder for incredible performance in the vision modality. The model is on par with or better than Gemini 2.5 Pro’s and GPT-5’s vision capabilities. It my brief hands-on experience, I’ve found it fantastic at localization and OCR (see image below). Read the Qwen3-VL blog post for more or see the Qwen3-VL cookbooks series on GitHub for ideas.

Hooooweeee! If they can keep this momentum up, it’s going to be a big finish to the year.

qwen3-vl-example

Example of using Qwen3-VL for structured data extraction. The model is capable of simultaneous localization (detecting boxes, on the left), as well as detecting text values. I then used the same model to turn the recognized text into markdown (image on the right). Looks like it got it all correct! Source: Author creator.

Google DeepMind Release EmbeddingGemma, a 308M parameter embedding model

Embedding models are designed to turn text into learned representations which can be compared and used for tasks such as retrieval.

For example “largest countries in the world” might get turned into [0.234, 0.934, 0.004…].

EmbeddingGemma does this with sensational performance at a very manageable size.

Some quick stats:

  • 100 languages
  • Highest ranking embedding model under 500M parameters
  • Small enough to run on device
  • 2k context window (embed sequences up to 2k tokens long)
  • Can leverage Matryoshka Representation Learning (MRL) to convert embeddings to size 768, 512, 256 or 128
  • Quantization-Aware Training (QAT) applied during fine-tuning so the models can be quantized (made even smaller) without large performance loss

Note: You should be aware of the prompt instructions. Each task has a specific prompt instruction, for example, to embed a query of “largest countries in the world”, you should prefix “task: search result | query: {content}”, for example, “task: search result | query: largest countries in the world”. See the EmbeddingGemma model page for more.

If you need to build a RAG (Retrieval Augmented Generation), you should check out EmbeddingGemma and see how it performs.

See the Google Developer breakdown or read the research paper for more.

Apple release MobileCLIP2

MobileCLIP2 is a model designed to match images and text.

CLIP stands for “Constrastive Language-Image Pretraining” so images and texts get encoded into the same embedding space.

The MobileCLIP2 models have been designed to run on mobile devices such as iPhones (hence “mobile” in their name).

This means they’re lightweight and run fassssssssst, they’re capable of running from between 3ms and 30ms on an iPhone 12 Pro.

It’s highly likely that the MobileCLIP2 models are what power the search feature on Apple’s Photos app, allowing you to search for things like “Georgia standing in front a book shelf”.

Get the MobileCLIP2 models on Hugging Face, try out a MobileCLIP2 demo on your own images and read the research paper for more.

mobileclip2-demo-image

Since CLIP-like models are trained to match texts and images, the better your text matches an image (and vice versa), the higher score it will get. And since MobileCLIP2 has seen many different kinds of images and texts, it can even match its own results graph quite well. Notice the more descriptive texts get a higher score. Source: MobileCLIP2 demo.

5TB of high-quality vision data, ready for training via FineVision

One of the most underrated skills in the world of AI is being able to curate a dataset.

Two things seem to be true: scale is important (more samples are generally better) and task-specific samples are important.

That’s reflected in the new FineVision dataset from Hugging Face.

Some quick details:

  • 24 million samples
  • 200 datasets combined into a similar interface
  • 17M images
  • 89M question-answer turns
  • 10B answer tokens
  • 5TB of high-quality data

The researchers quote their efforts as:

FineVision was a giant act of data curation. We started by collecting publicly available datasets, and augmenting underrepresented categories. We then evaluated all datasets for duplicated data internally and benchmark contamination. This data is then cleaned and rated, before being added to the final mixture.

It turns out it was worth it, a nanoVLM model trained on the FineVision dataset outperforms other nanoVLM models trained on other similar but smaller datasets:

finevision-results

Rankings of different nanoVLM instances trained on various image + text datasets. FineVision-trained models start out slower but end up being the best ranked overall with longer training. Source: FineVision blog post.

Finally, another interesting insight based on whether it’s important to only train on “high quality” samples (samples with an average rating of X or above) or to simply train on everything:

Simply training on the most diverse data, that one containing all samples, outperforms in benchmarks (Fig. 6) (Fig. 7). This could mean multiple things. Firstly, we can see almost the same distribution in the ranks across all filters: from best to worst with an increase in the rating threshold. For example the visual dependency and the image correspondence rating both result in exactly the same distribution of rankings, corresponding to the natural order of options, 1 through 5. This could indicate that with a sufficiently large dataset that you train on long enough, it hurts more to remove samples, even if they were judged to be of low quality, than to train on them.*

It seems the combination of both scale and high quality samples is what results in the best performing model.

Ettin encoder and decoder pairs of models outperform ModernBERT

Recently LLMs have favoured decoder models.

But encoder models are still incredibly useful.

When building the Ettin series of models, the researchers found:

The results show clear patterns:

Encoders dominate classification and retrieval: On MNLI classification, even a 150M encoder (89.2) outperforms a 400M decoder (88.2). For retrieval tasks, the gap is smaller but still noticeable - especially when decoders are not trained with MNTP.

Decoders excel at generation: On generative tasks, decoders maintain consistent advantages, with the performance gap actually widening at larger model sizes.

Size doesn't always matter: A 400M encoder beats a 1B decoder on classification tasks, while a 400M decoder beats a 1B encoder on generation tasks.

The good news is, for every encoder in the Ettin series, there’s a matching decoder.

The only difference is the training objective.

Encoder models use bidirectional attention, allowing each token to “see” all other tokens in the sequence.

Decoder models use causal attention, where only previous tokens are visible to enable autoregressive generation.

The Ettin models come in the following sizes: 17M parameters, 32M, 68M, 150M, 400M and 1B.

A perfect collection of sizes to run on smaller devices without compromising on performance as each performs on par or better than an equivalent model of its size.

See the Hugging Face collection, research paper and GitHub for more.

TinyLettuce Encoders show how powerful small specific models can be

Ranging from 17M-68M parameters (that’s million rather than billion), the TinyLettuce models punch well above their weight, performing on par or better than much larger models such as gpt-oss-120b and Qwen3-235B on a hallucination detection task.

Given a context, question and answer, the models predict at a token level which tokens in the answer might be hallucinated.

They achieve these results by creating task-specific synthetic data and then fine-tuning a set of base Ettin encoders (see above) on it.

The models are even small enough to run on CPU.

If you’ve got a large scale repeatable specific task, I’d highly recommend checking out the TinyLettuce blog post for ideas you can reproduce in your own workflow.

Christmas for OCR models!

  • MinerU2.5 is a 1.2B VLM model focused on OCR capable of parsing tables and mathematic formulas. It can operate at 2.12 fps on an A100, AGPL-3.0 licensed.
  • POINTS-Reader is a fine-tuned version of Qwen2.5-VL-3B capable of outputting markdown and HTML from documents. The paper shows an excellent methodology for bootstrapping documents for pretraining. Apache 2.0 license.
  • PP-OCRv5 and PP-StructureV3 are the latest versions of the PaddlePaddle document recognition models. These models are far smaller than most (in the range of less than 100M parameters). The library is quite fleshed out on GitHub. Available under Apache 2.0 license.
  • Granite-Docling-258M is a combination of SigLIP2-Base-512 as well as an LLM (Granite 156M). The model is designed to take in documents and out the docling format which provides items in specific tags, such as <chart>, <formula>, <code> as well as their locations in <loc> tags. Trained using the nanoVLM framework and available under Apache 2.0, try the demo on your own documents.

docling-extract-example

Example of the granite-docling-258m model extracting docling style format from an input. The model automatically creates bounding boxes as well as extracts the target items. Source: granite-docling-258m demo.

Z.ai releases GLM-4.6 hot on the tails of Claude Sonnet 4.5

GLM-4.6 is an MIT licensed model with a longer context window (128k → 200k), incredible coding performance (on par with Claude Sonnet 4 and 4.5).

A sensational open-source alternative to other code-focused models.

Mistral updates Magistral with vision

Magistral is Mistrals open-source reasoning model with 24B parameters.

The September 2025 update equips the model with:

  • Multimodality — the model can now handle vision and text inputs
  • Better performance — all round better performance than the previous version on several benchmarks
  • Less over generation — stops generating better rather than getting caught in a loop
  • Think tokens — generates thinking traces between [THINK] and [/THINK] for easy extraction

An excellent example of how post-training (see the Post Training 101 article above) can influence a model!

Releases

  • Google DeepMind releases Gemini Robotics 1.5 with state of the art planning and spatial pointing capabilities. The pointing capabilities really impressed me. The model is able to accurately point at many different objects in a given scene. It then uses these point coordinates to help it move and complete actions in physical space. See the example notebook on GitHub.

gemini-1.5-robotics

Gemini Robotics 1.5 visual capabilities can be used for spatial reasoning in images. The model has been trained for pointing at generic objects for robotics use cases, however, I also see a potential here for data annotation or general item interaction in the real world. Source: Author created from Gemini Robotics 1.5 blog and custom image.

Research

Two somewhat related papers this month on the topic of time series foundation and video foundation models being few shot learners (similar to the recent trend of LLMs being few shot learners for text):

  • Video models are zero-shot learners and reasoners — Research showing video foundation models such as Veo 3 showcase zero-shot capabilities such as perception and modelling.
  • Time series foundation model can be few shot learners — Research showing how a time series foundation model can be given a few examples for In-Context Fine-tuning (ICF) and have its performance improve 6.8% above baseline. ICF was also shown to be on par with a fully fine-tuned (FT) model on a specific dataset (0.776 MASE for the FT model vs 0.777 MASE for the ICF model).

Talks

See you next month!

What a massive month for the ML world in September!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job
19 min read

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)
30 min read

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

[September 2025] Python Monthly Newsletter 🐍 preview
[September 2025] Python Monthly Newsletter 🐍
6 min read

70th issue of Andrei's Python Monthly: Customizing Your Python REPL, AI Coding Trap, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.