69th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Ever wonder how an LLM reads the internet but then provides you useful responses?
Pre-training deals with the art of next token prediction (e.g. “The fox jumped over the ___”).
But Post-training takes the learned representation of token ordering and helps steer it towards something humans want/is useful.
The Post-training 101 guide walks you through examples of how this happens, including:
I loved the flow of this blog post.
It’s an excellent guide with plenty of data examples for each technique.
Example of a Supervised Fine-Tuning (SFT) example for post training a LLM. SFT involves showing the model examples of inputs and direct desired outputs. Source: Post-training 101 blog post.
One of my favourite quotes from the post was a reference to the Gemini 2.5 Pro paper:
The Gemini 2.5 Pro paper specifically emphasized that “Since the initial announcement of Gemini 1.5, significant advancements have been made in our post-training methodologies, driven by a consistent focus on data quality across the Supervised Fine-Tuning (SFT), Reward Modeling (RM), and Reinforcement Learning (RL) stages.
After reading this guide, you’ll have a great reference point to many of the techniques which go into creating a world-class model.
If you need to fine-tune an LLM (or SML - Small Language Model) or VLM, Hugging Face’s TRL (Transformers Reinforcement Learning) is one of your best options.
And with a recent update, TRL now natively supports vision as well as language fine-tuning.
Support includes the following RL (Reinforcement Learning) methods:
TRL also supports Supervised Fine-Tuning (SFT).
Here’s a short example of fine-tuning a VLM with TRL:
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=SFTConfig(max_length=None), # To avoid truncation that may remove image tokens during training
train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()
Now if you have your own data, you can easily customize and open-source model to be even more aligned with your own unique problem space.
You can then deploy this model at your own discretion without having to worry about third-party APIs.
See the example sft_vlm.py
script for more.
Bonus: Speaking of fine-tuning, here’s a Tweet from Nathan Lambert, author of The RLFH Book and researcher at Ai2 with regards to using RL-based fine-tuning versus SFT. In my experience, small models fine-tuned on specific data can equal or exceed the performance of those much much larger (which haven’t been fine-tuned).
Rule of thumb: If you have a model under 15B parameters, try supervised fine-tuning. If you’ve got a big model, go RL. Of course, actual experience many vary. Best to experiment, experiment, experiment! Source: Nathan Lambert on X.
Do you build an AI agent and then a cloud?
Or do you build a cloud and then an agent?
OpenAI has an agent but no cloud (yet).
Google has both.
A few startups are in between like Bolt and Loveable as users of agents.
And Netlify and Supabase as providers of cloud services.
Anthropic is trying to go vertical with their Claude model as well as all the tools to go alongside it.
As powerful as models get, you still need a place to store data (a database) and a place run to application code (web services).
Agents (yellow) vs Clouds (green) and everything in between. Source: Ethan Ding blog.
Speaking of agents and clouds, Modal is a fast compute provider from 1 GPU to many in seconds.
And they just released support for Jupyter Notebooks hosted on Modal.
That means you can start up a Jupyter Notebook instance with multiple GPUs on the backend in a few seconds and have it automatically power down when you’re not using it (the default is 30 minutes of idle = shutdown).
I was able to go from zero to running Qwen3-4B on a H100 GPU in ~2 minutes.
This is really helpful to experiment with a smaller GPU locally or even on Google Colab and then easily upgrade the hardware using Modal when necessary.
Modal bills per second so you only get charged when the notebook is running.
Example interface of using Modal notebooks with a H100 on the backend. Source: Modal documentation + my own use case.
A cool read from the Finegrain team (they make high-quality image editing tools) on how sometimes simple is best when it comes to improving model quality.
They share how they use the Hugging Face ecosystem to create:
Doing this means they can test a new model and use the discovered issues to improve it over time.
This is a similar workflow to what I’ve recently using for an object detection project with a client: train a model, upload to Hugging Face Spaces, try it out, improve it with better samples in the next training run, upload, test, repeat.
An experiment loop is all you need.
If you’ve got a specific use case with plenty of data, chances are, fine-tuning a smaller model will save you on time and money.
Another workflow is to use the best model you can via API, then fine-tune a smaller model (or the exact same model, if it allows) to repeat the same task.
They tested Gemini 2.0 Flash, Gemini 2.0 Flash Lite, Qwen3-8B, GPT-4.1 nano and GPT-4.1 mini (both zero-shot and fine-tuned versions).
In many zero-shot settings, GPT-4.1 was the best performing.
However, after fine-tuning, almost all of the models performed on par or better than GPT-4.1 on several benchmarks resulting in cost savings and faster inference time.
Every model tested achieved significant cost savings after fine-tuning compared to the original GPT-4.1 model. Source: TensorZero blog.
If you want to learn more about GPU programming, read everything Horace He has published.
See previous works such as gpt-fast
and *Making Deep Learning Go Brrrr From First Principles.*
And his recent post, Defeating Nondeterminism in LLM Inference, in collaboration with Thinking Machines is sensational.
If you’ve used an LLM, you might’ve found that given the exact same input, it produces a different output.
Even when running locally and setting temperature to 0, this can still happen.
This can be helpful sometimes but when you’re trying to run repeatable tests, determinism is your friend.
It turns out, this isn’t a trivial problem to solve.
Until…
From the write up:
In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.
When you ping an LLM endpoint, chances are your inputs are in the queue with someone else.
So while your requests might be deterministic, you can’t guarantee all of the others (along with yours) are.
Source: Thinking Machines blog.
How do you fix it?
You match three operations batch invariant (it doesn’t matter what’s in the batch): RMSNorm, matrix multiplication and attention.
See the rest of the blog post for this works in practice as well as example code on GitHub for how to make it work.
Tokenizers turn sequences of data such as text, audio or images into numbers so machine learning models can find patterns in them.
They are the first step and second last step in most interactions with LLMs.
For example, “Hello my name is Daniel” might get turned into [hello, my name, Dan, iel]
(I made this example up) and then would get mapped to numbers such as [43, 5556, 339, 46, 119]
(again, made up).
It turns out that some tokenizers are cased (they take into account capitals, “Dan” is different to “dan”) and some are uncared (“Dan” is the same as “dan”).
Depending how you input text into your tokenizer, this can influence performance.
The good news is that you can Stephen shows ways to mitigate this.
In last month’s AI & Machine Learning Monthly (August 2025), we covered prompt engineering vs. context engineering.
The short version is:
Anthropic’s guide nails down some of these terms.
I especially like the question it asks right at the start:
What configuration of context is most likely to generate our model’s desired behaviour?
And then narrowing down on what good context engineering is (bold is mine):
Good context engineering means finding the smallest possible set of high-signal tokens that maximise the likelihood of some desired outcome. Implementing this practice is much easier said than done.
Some practical tips:
Providing examples is a well-known best practice and something we continue to advise.
And finally, a simple definition for agents:
LLMs autonomously using tools in a loop.
Prompt engineering usually involves a simple input and output workflow. Whereas context engineering can be thought of as a system which combines 1 to N steps and tools. Source: Anthropic blog.
The Qwen team are well in front for the unofficial “Open Source AI team of the year” trophy 🏆.
It seems almost every month they’re releasing frontier models with open-source licenses across almost every domain.
This month is no different:
Hooooweeee! If they can keep this momentum up, it’s going to be a big finish to the year.
Example of using Qwen3-VL for structured data extraction. The model is capable of simultaneous localization (detecting boxes, on the left), as well as detecting text values. I then used the same model to turn the recognized text into markdown (image on the right). Looks like it got it all correct! Source: Author creator.
Embedding models are designed to turn text into learned representations which can be compared and used for tasks such as retrieval.
For example “largest countries in the world” might get turned into [0.234, 0.934, 0.004…].
EmbeddingGemma does this with sensational performance at a very manageable size.
Some quick stats:
Note: You should be aware of the prompt instructions. Each task has a specific prompt instruction, for example, to embed a query of “largest countries in the world”, you should prefix “task: search result | query: {content}”, for example, “task: search result | query: largest countries in the world”. See the EmbeddingGemma model page for more.
If you need to build a RAG (Retrieval Augmented Generation), you should check out EmbeddingGemma and see how it performs.
See the Google Developer breakdown or read the research paper for more.
MobileCLIP2 is a model designed to match images and text.
CLIP stands for “Constrastive Language-Image Pretraining” so images and texts get encoded into the same embedding space.
The MobileCLIP2 models have been designed to run on mobile devices such as iPhones (hence “mobile” in their name).
This means they’re lightweight and run fassssssssst, they’re capable of running from between 3ms and 30ms on an iPhone 12 Pro.
It’s highly likely that the MobileCLIP2 models are what power the search feature on Apple’s Photos app, allowing you to search for things like “Georgia standing in front a book shelf”.
Get the MobileCLIP2 models on Hugging Face, try out a MobileCLIP2 demo on your own images and read the research paper for more.
Since CLIP-like models are trained to match texts and images, the better your text matches an image (and vice versa), the higher score it will get. And since MobileCLIP2 has seen many different kinds of images and texts, it can even match its own results graph quite well. Notice the more descriptive texts get a higher score. Source: MobileCLIP2 demo.
One of the most underrated skills in the world of AI is being able to curate a dataset.
Two things seem to be true: scale is important (more samples are generally better) and task-specific samples are important.
That’s reflected in the new FineVision dataset from Hugging Face.
Some quick details:
The researchers quote their efforts as:
FineVision was a giant act of data curation. We started by collecting publicly available datasets, and augmenting underrepresented categories. We then evaluated all datasets for duplicated data internally and benchmark contamination. This data is then cleaned and rated, before being added to the final mixture.
It turns out it was worth it, a nanoVLM model trained on the FineVision dataset outperforms other nanoVLM models trained on other similar but smaller datasets:
Rankings of different nanoVLM instances trained on various image + text datasets. FineVision-trained models start out slower but end up being the best ranked overall with longer training. Source: FineVision blog post.
Finally, another interesting insight based on whether it’s important to only train on “high quality” samples (samples with an average rating of X or above) or to simply train on everything:
Simply training on the most diverse data, that one containing all samples, outperforms in benchmarks (Fig. 6) (Fig. 7). This could mean multiple things. Firstly, we can see almost the same distribution in the ranks across all filters: from best to worst with an increase in the rating threshold. For example the visual dependency and the image correspondence rating both result in exactly the same distribution of rankings, corresponding to the natural order of options, 1 through 5. This could indicate that with a sufficiently large dataset that you train on long enough, it hurts more to remove samples, even if they were judged to be of low quality, than to train on them.*
It seems the combination of both scale and high quality samples is what results in the best performing model.
Recently LLMs have favoured decoder models.
But encoder models are still incredibly useful.
When building the Ettin series of models, the researchers found:
The results show clear patterns:
Encoders dominate classification and retrieval: On MNLI classification, even a 150M encoder (89.2) outperforms a 400M decoder (88.2). For retrieval tasks, the gap is smaller but still noticeable - especially when decoders are not trained with MNTP.
Decoders excel at generation: On generative tasks, decoders maintain consistent advantages, with the performance gap actually widening at larger model sizes.
Size doesn't always matter: A 400M encoder beats a 1B decoder on classification tasks, while a 400M decoder beats a 1B encoder on generation tasks.
The good news is, for every encoder in the Ettin series, there’s a matching decoder.
The only difference is the training objective.
Encoder models use bidirectional attention, allowing each token to “see” all other tokens in the sequence.
Decoder models use causal attention, where only previous tokens are visible to enable autoregressive generation.
The Ettin models come in the following sizes: 17M parameters, 32M, 68M, 150M, 400M and 1B.
A perfect collection of sizes to run on smaller devices without compromising on performance as each performs on par or better than an equivalent model of its size.
See the Hugging Face collection, research paper and GitHub for more.
Ranging from 17M-68M parameters (that’s million rather than billion), the TinyLettuce models punch well above their weight, performing on par or better than much larger models such as gpt-oss-120b
and Qwen3-235B
on a hallucination detection task.
Given a context, question and answer, the models predict at a token level which tokens in the answer might be hallucinated.
They achieve these results by creating task-specific synthetic data and then fine-tuning a set of base Ettin encoders (see above) on it.
The models are even small enough to run on CPU.
If you’ve got a large scale repeatable specific task, I’d highly recommend checking out the TinyLettuce blog post for ideas you can reproduce in your own workflow.
<chart>
, <formula>
, <code>
as well as their locations in <loc>
tags. Trained using the nanoVLM framework and available under Apache 2.0, try the demo on your own documents.Example of the granite-docling-258m model extracting docling style format from an input. The model automatically creates bounding boxes as well as extracts the target items. Source: granite-docling-258m demo.
GLM-4.6 is an MIT licensed model with a longer context window (128k → 200k), incredible coding performance (on par with Claude Sonnet 4 and 4.5).
A sensational open-source alternative to other code-focused models.
Magistral is Mistrals open-source reasoning model with 24B parameters.
The September 2025 update equips the model with:
An excellent example of how post-training (see the Post Training 101 article above) can influence a model!
Gemini Robotics 1.5 visual capabilities can be used for spatial reasoning in images. The model has been trained for pointing at generic objects for robotics use cases, however, I also see a potential here for data annotation or general item interaction in the real world. Source: Author created from Gemini Robotics 1.5 blog and custom image.
trackio
, an open-source experiment tracking library which is a drop-in replacement for Weights & Biases. I tried it recently on a client project to stick with open-source libraries and it worked well. It does experiment tracking very well, however, I’d say it’s not as full featured as a service such as Weights & Biases (but that makes sense since it was just released… and free!). If you need a lightweight experiment tracker for your ML experiments, give it a try. Get the code on GitHub.Two somewhat related papers this month on the topic of time series foundation and video foundation models being few shot learners (similar to the recent trend of LLMs being few shot learners for text):
What a massive month for the ML world in September!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.