[October 2025] AI & Machine Learning Monthly Newsletter 💻🤖

70th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in October 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

My next course is ~~going~~ NOW live!!! I’m very excited for this release, as Hugging Face is the modern homepage for AI and this course combines the best of Machine Learning and Hugging Face. In my own AI work, I use the platform every day. And the course is designed to teach you how to do the same in a project-focused manner.
A note on AI/ML Monthly for November 2025 (next month): I’m going to be skipping next month’s AI/ML monthly issue as I’m getting married on November 25th 💍. So despite how much I love reading and writing about AI/ML, I’m going to spend a few weeks hanging out with my beautiful new wife and away from a keyboard. I’ll see you all for the final edition of 2025!

From the Internet

A textbook on LLM training: Ever wonder what goes into training a state of the art large language model? It’s just internet text and a Transformer architecture right? Well… turns out there’s quite a bit more. To try and capture all of it from small experiments to model architecture to how to create a dataset to how to maintain a cluster of 384 H100 GPUs, Hugging Face researchers and engineers wrote a 200+ page guide called The Smol Training Playbook:The Secrets to Building World-Class LLMs. Inside they document almost everything that led them to building SmolLM3.
After five years of development, the Hugging Face Hub (huggingface_hub) library hits v1.0 with several upgrades and a handful of breaking changes.
Hugging Face upgrades streaming datasets with 100x fewer requests, 10x faster data resolution, 2x sample/sec, 0 worker crashes at 256 concurrent workers. Streaming datasets enables you to use a dataset without downloading it. These updates were enough to outperform local SSDs when training on 64xH100 GPUs and 256 workers. Get started streaming datasets with a few lines of code:

from datasets import load_dataset

# Stream a dataset instead of downloading it
dataset = load_dataset("HuggingFaceM4/FineVisionMax", 
											 split="train", 
											 streaming=True)

# Get the first example
print(next(iter(dataset)))

Exo share how they connected a NVIDIA DGX Spark and a Mac Studio M3 Ultra to improve LLM latency. The NVIDIA DGX Spark has 4x the compute but the Mac Studio has 3x the memory bandwidth. Compute power helps with TTFT (time-to-first-token), this the delay from sending a prompt to seeing the first response token, this is known as prefill. Memory bandwidth helps with TPS (tokens-per-second) which is the speed at which tokens appear step by step after the first one, this is often referred to as decoding. Combing the NVIDIA DGX Spark’s prefill capacity with the M3 Ultra’s decoding capacity resulted in a 2.8x speedup over a baseline M3 Ultra using Llama-3.1-8b-Instruct.

exo-m3-ultra-plus-dgx-spark-overview

Combing the memory-bandwidth from the M3 Ultra in the Mac Studio as well as the compute power from the NVIDIA DGX Spark results in a best of both worlds scenario for LLM inference. Due to asynchronous data transfers, the transfer time between the two devices is negligible in the final speedup results. Images from Exo blog.

Hafedh Hichri and Ed Daniels break down how VLMs (Vision-Language Models) work with some nice visualizations.

They break a modern VLM (HuggingFaceTB/SmolVLM-256M-Instruct) into five main parts:

Processor (prepares and aligns raw text and image inputs).
Vision module (converts pixel data into high-dimensional patch embeddings).
Connector (compresses and projects visual features into the same embedding space as text tokens).
Input merger (replaces placeholder tokens with visual embeddings to form a unified multimodal sequence).
Decoder (generates context-aware text by attending to both visual and textual information).

visualizing-how-vlms-work

Example of a VLM architecture which splits an image into patches, encodes it with a vision encoder, compresses the input pixels then connects it with language input tokens to a decoder LLM model. Source: Visualizing how VLMs work blog post.

Case study: Piotr Skalski shows how to build an end-to-end system to detect, track, classify and segment basketball players on a court using computer vision. A brilliant insight into how stacking together several open-source models as well as a several specific datasets can make for quite the project. If you wanted to practice your computer vision skills, this is the kind of system I’d recommend trying to replicate for yourself. The project strings together RF-DETR for detection, SAM 2 for segmentation and tracking, SmolVLM2/ResNet for jersey number classification, SigLIP for player embedding capturing and K-means for clustering together player embeddings (to separate them into teams).

basketball-tracking-raw

Basketball player detection, recognition, segmentation and tracking using a combination of open-source computer vision models. Source: Roboflow blog.

Thinking Machines runs a series of experiments to see when LoRA (low-rank adaptation) can match the performance of full model fine-tuning.

LoRA is a form of PEFT (Parameter Efficient Fine-tuning) which is a technique where a smaller number of model parameters are trained to adapt a base model to a specific task rather than training the full model.

thinking-machines-lora-vs-fullft

Results from several different runs of LoRA settings versus the full fine-tuning (orange line). At lower ranks, LoRA underperforms full fine-tuning but as the ranks get higher, results. get closer and closer to full fine-tuning. Source: Thinking Machines blog.

It is generally much more compute efficient than full fine-tuning and as Thinking Machines found, it can equal full fine-tuning under two main conditions:

LoRA is applied to all layers of the network, especially the MLP/MoE layers which house most of the parameters.
LoRA works well when not capacity constrained, i.e., the number of trainable parameters exceeds the amount of information to be learned, which can be estimated in terms of dataset size.

LoRA’s are what’s used in Apple’s Adapters framework, a framework which enables you to create lightweight Adapters to augment the performance of Apple’s on-device Foundation models for a specific task.

Something that stood out to me was the use of the Hugging Face peft library:

In our experiments, we used the standard parametrization used in the Hugging Face peft library.

It’s cool to see how much surface area different Hugging Face libraries are starting to cover.

UPTOHERE: more resources…
- Add the Smol Large Language Model training booklet

Daniel’s Open-source AI of the month

Roboflow releases RF-DETR Seg, a state-of-the-art segmentation model which achieves 170 FPS on a NVIDIA T4 GPU. This is a follow up to the recently released RF-DETR models for detection which are also best-in-class for their size.
ModernVBERT is a 250M parameter vision-language retriever (it’s designed to retrieve similar documents based on visual/text inputs). ?Based on the Ettin archietecture (mentioned in ML Monthly September 2025) now with an added vision component. It performs on par with models 10x the size except it’s much faster. All model checkpoints, datasets and training recipes are available under the MIT license.
Rex-Omni is a VLM (Vision Language Model or MLLM, Multimodal Large Language Model) which turns object detection, object referring, visual prompting, pointing, OCR and more into a next-token prediction problem. Rex-Omni is fine-tuned from Qwen2.5-VL-3B and combines 10 computer vision tasks into one model and is able to perform on par with specialist computer vision models. The modal can handle text-based or visual-based inputs for several kinds of detection tasks. Something of note is the two-stage training pipeline, the first is SFT (Supervised Fine-tuning) to get the model used to predicting next-token like detection coordinates, the second is reinforcement learning via GPRO to get the model to adhere to geometrically correct outputs (e.g. the model is rewarded for predicting the right number of boxes). Try the demo on your own images.

rex-omni-demo-overview

Rex-Omni is able to detect boxes, points, keypoints and many other vision-based outputs due to to its training data and training schedule. You can use natural language inputs such as “bagel” or “coffee” and even vision-based inputs such as existing bounding boxes of target items to detect.

A paradigm shift for OCR

Over the last month we’ve been continued to be blessed with high-quality open-source OCR models.

A few trends have appeared:

Use synthetic data to create artificial documents which allows a model to learn with perfect ground truth.
Build layout detection and recognition into the same pipeline (for example, where do headers go and what is the reading order?)
Take an existing VLM (vision-language model) and fine-tune it directly for OCR-like tasks

All of the following have either all of or at least one of the above qualities:

olmOCR-2 from Allen AI (Ai2) — Performs exceptionally well on olmOCR-Bench. Some notable changes include switching to YAML instead of JSON outputs (YAML turned out to have fewer errors). Comes with open data as well as open training pipeline. See the paper for more.

olmOCR-2-overview

olmOCR-2 uses real documents to extract layout and content and then turns those into HTML renderings for training an OCR model. olmOCR-2 also open-sources model weights, training data, training code, inference code and comes with an open license. Source: olmOCR-2 paper.

LightOnOCR-1B focuses on end-to-end pipelines and speed. The model is capable of handing 5.71 pages per second on a single H100 GPU which is equal to 493,000 pages per day for less than ~$0.01 per 1000 pages.
PaddleOCR-VL-0.9B combines a NaViT (native resolution ViT) vision encoder with a lightweight ERNIE-4.5-0.3B language decoder. The model supports 109 different languages.

paddleocr-vl

Architecture of the PaddleOCR-VL-0.9B model which combines a Vision Encoder (400M parameters), a MLP connector and a LLM decoder model (300M parameters). Source PaddleOCR-VL paper.

Chandra is a 9B parameter OCR model which can output HTML, JSON and markdown. It supports form reconstruction with checkboxes and also supports 40+ languages. On olmOCR-Bench Chandra is on par with olmOCR-2.
Nanonets-OCR2-3B is an OCR model capable of extracting LaTeX formulas, extracting images (with description text) between <img> tags, signature detection, watermark detection, flow charts (as mermaid code) and more. For a very comprehensive extraction or if your documents are image, checkbox and flow chart heavy, you might want to try Nanonets-OCR2.
DeepSeek-OCR is a powerful OCR model but also very efficient. The paper explores using vision as compression. As in, how good of results can you get when you continually lower the vision token usage? The DeepSeek researchers found that even with a compression ratio of 10x (10x lower vision tokens compared to pure text tokens), you can get a 97% recovery rate. And with a compression rate of 20x, the OCR accuracy is still ~60%. Sam Witteeven has a great video breakdown of the paper.

deepseek-ocr-compression

The DeepSeek-OCR paper found you can get ~97% precision with 10x less vision to text tokens. This enables DeepSeek-OCR to get excellent results despite using less tokens than other models. Source: DeepSeek-OCR paper.

With all of these OCR models coming out, naturally, you might ask, which OCR model should you use?

Based on benchmarks alone, each performs quite well on various tasks.

However, as always, if you have a specific task in mind, best to try out these models on your own datasets and see which is best for your use case.

Open-source VLMs

A fleet of open-source VLMs hit the market over the past month. Many of which are starting to include their open-source data.

Apriel-1.5-15B-Thinker is a multimodal reasoning model which has undergone text-only supervised fine-tuning (no reinforcement learning). It performs on par with Gemini 2.5 Flash on the Artificial Analysis Intelligence Index. They start with the Pixtral-12B-Base-2409 model and upscale the model to 15B parameter by depth scaling. See the paper for more.
The Qwen3-VL family expands to 2B, 4B, 8B and 32B Instruct and Thinking variants.
LlaVA-OneVision-1.5 is a collection of open-weight, open data (a combination of ImageNet-21k, LAION-CN, DataComp-1B, COYO700M, SA-1B and more) and open training VLMs. The 8B model is on par or better than Qwen2.5-VL-7B (a very strong open-source VLM). Get the code on GitHub, read the paper, try the demo. The end-to-end training cost is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour.
Bee-8B-SFT and Bee-8B-RL are VLMs trained on open data (though be sure to check the licenses, as some of it is non-commercial) and are competitive with models such as Qwen2.5-VL-7B.
ByteDance update the Sa2VA (these models combine SAM2 with a VLM for enabling segmentation and object detection with language) series of models with, InternVL3, Qwen2.5-VL and Qwen3-VL backbones.
SANSA allows adapting the SAM2 model to perform few-shot segmentation thanks to an adaptation module (small number of parameters). This enables you to give a reference image with a segmentation mask and then have the same segmentation mask labelled in a new image (e.g. give a single image with 1x mask of a cat and have the cat in the next image labelled with a mask). See the demo notebook to try it on your own images.

Small LLMs are getting better

IBM-Granite release Granite 4.0 (3B, 7B, 32B parameters) and Granite 4.0 Nano language models (1B and 350M parameters). All models are under the Apache 2.0 license. Models in the Nano series perform on the Pareto frontier for their size, outperforming Gemma3-270M (for the Granite 4.0 350M model) as well as Qwen3-1.7B (for the Granite 4.0 1B model). See the blog post for more.

granite-4-nano-language-models

Granite 4.0 Nano models perform the best across aggregated benchmarks for their size. Source: Granite 4.0 Nano blog post.

Facebook release MobileLLM-Pro (non-commercial license) which is a 1B parameter model with INT4 quantization (1.3% performance loss from the base model) which allows it to be run fast on small devices and even CPUs.

A couple of cool things

Open Code is an open-source version of Claude Code allowing you to bring any model or provider right into the terminal or your editor of choice.
handy.computer is a free and open-source speech to text app which can run speech to text in any text field. Under the hood it runs the Whisper models by OpenAI. The code is fully accessible so if you need to extend it for your own preferences, you can.
Emu-3.5 is an open-source image generation/editing model on par with Gemini 2.5 Flash Image (Nano Banana).
OpenAI release gpt-oss-safeguard-120b and gpt-oss-safeguard-20b for safety-focused text classification (e.g. given a policy, they can classify whether an input is safe to use for your system or not). A workflow here could be to use these models to label a corpus of samples and then fine-tune a smaller text classification model like Ettin to repeat the task at scale.
RICE-ViT (Region-Aware Cluster Discrimination) is a vision encoder which encodes region-level information and OCR into the weights of a vision encoder. For example, the model captures object and OCR semantics in the same representation. When used as a vision encoder for VLM model training, the RICE-ViT models perform favourably against other vision encoders such as SigLIP2. See the paper for more.

Research

The SAM 3 (Segment Anything 3) paper gets posted to OpenReview in time for ICLR 2026. It’s titled SAM 3: Segment Anything with Concepts. Now, it’s not 100% confirmed this is Meta’s upgrade to SAM 2 (as the paper is still in review) but it definitely fits the criteria of being the follow up. The new version of SAM will allow segmentation with text-based concepts, e.g. “dog” or “cars” and the model will segment items in the image related to those concepts. Some highlights of the paper for me were the data engine and improvements using humans in the loop (a very common trend among large foundation models). Once the model weights are released (if they are), I’ll be sure to include them in a future ML monthly issue.

sam3-overview

SAM 3 allows using text-based or visual-based input prompts to a model to obtain segmentation masks. In training, the data is automatically labelled and then verified by either AI or human verifiers in a loop fashion. Source: SAM 3 paper.

Talks

Andrei Karpathy went on the Dwarkesh Patel podcast and talked about everything from AGI to self-driving cars to what teaching will look like in future. An excellent conversation from one of the best voices and practitioners in AI.

See you next month!

What a massive month for the ML world in October!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.