[January 2026] AI & Machine Learning Monthly Newsletter 🤖

73rd issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey everyone!

Daniel here, I’m a machine learning engineer who teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Here's what you might have missed in January 2026 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

Hello everyone! What an excellent first month of the year for the world of ML and AI.

I’ll start off by sharing some of my recent work and then get into some of my favourite resources from the past month.

In January, I released three new YouTube videos, all are focused on using Hugging Face Transformers for different tasks along with open-source models:

Learn to fine-tune an LLM (Gemma-3-270M) — A step-by-step guide to full fine-tuning an LLM using Hugging Face tools, see code.
Learn to fine-tune a VLM (SmolVLM2) — Fine-tuning a Vision Language Model (VLM) for custom image understanding tasks, see code.
Build a multimodal RAG system with NVIDIA Nemotron VL embedding models — Combine text and image retrieval for more powerful RAG applications, see code.

From the Internet

Anthropic releases a study on how AI coding assistance comes with pros and cons. In a controlled study, developers using AI assistance scored 17% lower on comprehension tests than those who coded manually, equivalent to nearly two letter grades (e.g. A grade comprehension → C grade comprehension). The researchers identified six distinct interaction patterns, finding that AI Delegation (complete reliance on code generation) led to the worst outcomes while Conceptual Inquiry (using AI as a teacher rather than a coder) preserved learning. The key insight: how you use AI determines whether you learn or lose skills.

anthropic-study

Anthropic study design for seeing how AI tools influence developer skill, knowledge and speed.

Case study: Isometric NYC: Turning New York City into an isometric map with AI agents. A fascinating project that combines AI agents with fine-tuned models to transform New York City imagery into beautiful isometric illustrations. A great example of using multiple specialized models working together to accomplish a creative task.

training-dataset

Example input/output pairs for a training dataset to customize Qwen-Image-Edit for turning real-life images into an isometric style. Source: Andy Coenen blog.

Case Study: GPT-OSS-20B vs BERT on Consumer Hardware for Multi-Label Text Classification by Ben Toussaint. Ben puts his RTX 4090 GPU to the test to see if he can beat GPT-OSS-120B with a fine-tuned classification model. Not only did he find encoder models such as mDeBERTa-v3-base can perform on par or better than a fine-tuned GPT-OSS-20B but it also was able to train the model in 90 seconds and achieve 174x faster inference (235 samples per second for mDeBERTa-v3 model vs 1.35 samples per second for the GPT-OSS-20B model). If you’d like to see what it’s like training your own text classification model, see text classification tutorial on learnhuggingface.com.
The rise of tabular foundation models. A thought-provoking piece on how foundation models are starting to come to tabular data, potentially changing how we approach traditional ML problems. The article explores what this might mean for practitioners who've relied on XGBoost and random forests. One cool feature that tabular foundation models seem to inherit (similar to LLMs) is the ability to do in-context learning (e.g. feed the model a couple of samples of your target data and it can adjust itself to your task).
Reminder: Data is the foundation of Language Models by Cameron Wolfe. I revisited this article after starting to fine-tine LLMs and VLMs of my own. It’s a deep-dive into how training data shapes LLM capabilities. Notably references the LIMA paper where approximately 1,000 aligned and high-quality samples were enough to steer an LLM into producing ideal outputs. Because LLMs have already seen so much data, if you’re looking to fine-tune your own models, perhaps just starting with 100-1000 quality samples can get excellent results.
VLM from scratch series. A hands-on, beginner-friendly introduction to Vision Language Models. Note: The content appears to be authored with Claude assistance, so watch for rough edges, but on the first few minutes reading the code, it looks like a good interactive format, accessible for learning. See the first notebook to get started.
Blog post: Use agents or be left behind by Tim Dettmers. Tim Dettmers (who also helped create QLoRA and bitsandbytes and one of the best guides to buying GPUs for deep learning) writes about how AI agents have helped his writing process and also helped him try new experiments he might not have otherwise done. Drawing from his background in factory automation, he introduces process optimization concepts to evaluate when agents actually help. I especially liked the parts where he mentions where AI agents did not help him for various workflows. The main one being emails.

Hugging Face Updates

Hugging Face releases Daggr, a new library to chain together ML apps. Daggr is an open-source Python library for building AI workflows that connect Gradio apps, ML models, and custom functions. It automatically generates a visual canvas where you can inspect intermediate outputs, rerun individual steps, and manage state for complex pipelines. All in a few lines of Python code. Perfect for debugging multi-step AI pipelines.
hf-mem is a library to estimate how much memory a model requires. A simple but useful tool for planning your model deployments and hardware requirements.
Mixture of Experts backends get a speedup in Transformers. Hugging Face optimizes MoE model inference, making these large but sparse models more practical to deploy.
Reinforcement Learning chapter added to the Hugging Face LLM Course. A new chapter covering RL fundamentals and how they apply to LLM training—particularly timely given the rise of RLHF and DPO techniques.
Open Responses: An open standard for response formats. Hugging Face proposes a standardized format for AI model responses, improving interoperability across different providers and frameworks.

From the Internet (continued)

Agent Skills — A platform for creating and sharing agent capabilities with SKILL.md files. Now being used in Google Antigravity, Google's agentic development platform and Claude Code.

agent-skills

An Agent Skill is a markdown file which gives an Agent a set of instructions, docs, executable code and more as a potential tool to use when exploring a problem space.

How to write technically in the age of LLMs by Prashanth Rao. Thoughtful guidance on maintaining your technical writing voice when AI assistance is everywhere.
Google's Tunix is a JAX-based library for LLM post-training. If you're in the JAX ecosystem and want to fine-tune or post-train LLMs, this is worth checking out.
Learn: Mini SGLang = small library for SGLang inference. A lightweight implementation of SGLang for those who want to understand or extend the inference framework.
Case study: Answer AI: How I created the Karpathy Chapter with SolveIt. Behind-the-scenes look at using AI to transform video content into structured documentation.
Dataset: FinePDFs: Liberating 3T tokens from PDFs. A massive effort to extract high-quality text from PDFs for training data. The Hugging Face team demonstrates how much valuable data is locked away in PDF format and how to extract it at scale.
Vision as LoRA. An interesting exploration of adding vision capabilities to LLMs through LoRA adapters rather than retraining the entire model.
NVIDIA's LLM memory for models that learn at test time. Research on using context as training data to enable models to learn and adapt during inference, a step toward more dynamic AI systems.
Learn: A Guide to Fine-tuning FunctionGemma. Google's guide to customizing FunctionGemma (a small and specialized 270M parameter model) for specific function-calling use cases. See also Fine-tune FunctionGemma with Unsloth for a faster training approach.
D4RT: Teaching AI to see the world in four dimensions. Google DeepMind's state-of-the-art 4D tracking for understanding objects through both space and time.
Case study: Scaling PostgreSQL to power 800 million ChatGPT users. Engineering deep-dive on how OpenAI scales their database infrastructure. Proof that sometimes boring technology at massive scale is the right choice.

Daniel's Open-source AI of the Month

OCR models continue to get better and better

The OCR space continues to heat up with several powerful new releases:

LightOnOCR-2, an efficient 1B VLM for OCR — Continues to push the boundaries of efficient OCR with impressive speed benchmarks, achieving 5.71 pages per second on a single NVIDIA H100 80GB GPU. Best results on the OlmOCR-Bench benchmark, even outperforming the Mistral OCR 3 API. In the paper they state that the architecture is a Mistral Small 3.1 vision encoder (400M parameters) + Qwen3 language model (600M parameters), they then fine-tune this model on a custom dataset for text extraction purposes. Cool idea: can you take just the vision encoder from a larger model and use it as the base for a smaller model? For example, take the vision encoder from Kimi-K2.5 (a very large model) and use it for downstream tasks?

lighton-ocr

Demo of LightOnOCR-2 running on the Attention Is All You Need paper.

PaddleOCR-VL-1.5 — An updated version of the popular PaddleOCR series with improved vision-language capabilities and support for 109 languages. It also now supports cross-page table merging as well as cross-page paragraph heading recognition which helps out for long-document parsing. It achieves 94.5% on OmniDocBench v1.5 with just 0.9B parameters.
DeepSeek OCR 2 introduces DeepEncoder V2, which fundamentally changes how AI "sees" documents. Instead of the traditional raster scanning approach, it uses "Visual Causal Flow" to read documents in logical order just like humans do. The 3B parameter model achieves 91.09% on OmniDocBench v1.5, with notably improved handling of complex layouts, tables, and multi-column documents. The breakthrough: replacing CLIP with Qwen2-0.5B as the vision encoder enables semantic reasoning about reading order.

Open-source VLMs and LLMs

Kimi K2.5 — Moonshot AI releases their most powerful open-source model yet: a 1 trillion parameter MoE model with 32 billion active parameters. Built through continual pretraining on approximately 15 trillion mixed visual and text tokens, K2.5 excels at visual coding (generating code from UI designs and video workflows) and introduces "Agent Swarm"—the ability to self-direct up to 100 AI sub-agents working in parallel. The model claims to outperform GPT-5.2, Claude 4.5 and Gemini 3 Pro on several benchmarks (the results are mixed in terms of which model is best for a certain benchmark but Kimi K2.5 is no slouch on any of them). Blog post.

kimi-k2.5

Kimi K2.5 performance on various benchmarks compared to frontier models such as Gemini 3 Pro, Claude 4.5 Opus and OpenAI’s GPT 5.2.

SimpleSeg for VLM-based segmentation — A straightforward approach to adding segmentation capabilities to VLMs. I really enjoy their straightforward data annotation pipeline. Go from image → box labels → segmentation labels → turn into outline → train VLM to reproduce the outline.

simpleseg-annotation-pipeline

Youtu-VL and Youtu-Parse from Tencent — A 4B parameter VLM with Youtu-Parsing for document extraction. See the paper for technical details. Notably the VLM model comes with vision-supervision (a fairly novel addition in VLMs).

youtu-vl-4b-example

Example of Youtu-VL detecting coordinates in a natural image with text-based prompting. Youtu-VL is able to output coordiantes for bounding boxes which can plotted on an image.

LlamaBarn — A macOS menu bar app for running local LLMs with llama.cpp. Great for quick access to AI assistance without leaving your workflow.
RADIOv4 — NVIDIA combines SigLIP2, DINOv3, and SAM3 into the same feature space. Models available under a commercially friendly license.
Arcee AI releases open-weight large models — Another entrant in the open-weight space. See the podcast discussion.
GLM-4.7-Flash (30B total parameters, 3B active parameters) — Z.ai's latest efficient model release is the best in class for the 30B MoE (Mixture of Experts) space.
iFlyBot-VLM for spatial reasoning — A VLM specifically tuned for spatial understanding tasks. Is able to ground itself on 2D and 3D detections.
Flux.2 Klein 4B and 9B — Black Forest Labs releases smaller, faster image generation models. The smaller 4B model is available under the Apache 2.0 license. See the blog post for more.
Google releases a series of translation models with TranslateGemma — Ranging from 4B parameters to 12B and 27B parameters and fine-tuned for translation tasks across 55 different languages. The TranslateGemma models maintain the multimodal capabilities of the original Gemma 3 models. Blog post.

Speech and Audio

NVIDIA Magpie for Speech Generation — A 357M parameter multilingual TTS model.
Qwen3-TTS model series — The Qwen team enters the text-to-speech space with a 600M parameter and 1.7B parameter model each capable of creating custom voices, cloning voices and producing custom speakers.
VibeVoice ASR — Microsoft's ASR model supporting 60-minute speech transcription with diarization, timestamping, and transcription in one pass.

Embeddings and Retrieval

Octen embedding models — New embedding models that outperform Gemini embedding models on several benchmarks including legal, finance, healthcare and code use cases. The models also support 20 languages. See the release blog post.
Qwen3-VL-Embedding and Qwen3-VL-Reranker — Multimodal embedding and reranking models from the Qwen team. Reranker collection. Technical report.
NVIDIA CES open-source model releases — Several new open-source releases including Nemotron speech streaming. See also Llama Nemotron VL for multimodal search and visual document retrieval.

Medical and Specialized Models

MedGemma 1.5 4B — Google's next-generation medical text and image interpretation model, now with medical speech-to-text via MedASR (a speech to text model focused on the medical domain). There's a Kaggle competition to explore medical applications with the new MedGemma-1.5 models as well.

Computer Vision

RF-DETR and Segmentation upgrades — Roboflow extends their excellent RF-DETR detection models with segmentation with new checkpoints ranging from Nano to 2XLarge in size. The models outperform YOLO26 models on the COCO benchmark at similar inference speeds.
Zcore — Find the most efficient and best dataset to use from a corpus of data. Zero-shot labeling for large-scale image datasets based on how much each sample will add to the training process.
YOLO26 — The latest iteration of YOLO improves detection throughput with a new NMS-free inference setup.

Papers

Prompt repetition improves non-reasoning LLMs. Sometimes the simplest techniques work, repeating prompts can boost performance on certain tasks with non-reasoning models.
WeDetect for Fast Open-Vocabulary Object Detection as Retrieval. A new approach treating object detection as a retrieval problem. Performs state-of-the-art zero-shot object detection. See the models and code.
Vision Encoder Survey from 2023-2025. Jina AI provides a comprehensive overview of how vision encoders have evolved over the past few years, useful for understanding the landscape before building your next VLM. In short, if you’re having trouble deciding, start with SigLIP2.
Ministral 3 paper. Technical details behind Mistral's latest small model series (Ministral 3 3B, 8B and 14B). A related takeaway to the LightOnOCR-2 model above, on the vision encoder:

Vision encoder. All Ministral 3 models use a 410M parameter ViT as a vision encoder for image understanding that is copied from Mistral Small 3.1 Base and kept frozen, with the same architecture described in Pixtral [Agrawal et al., 2024]. We discard the pretrained projection layer from the ViT to language model’s space and train a new projection for every model.

Google Introduces GIST, the next stage in smart sampling. GIST (Greedy Independent Set Thresholding) is a novel algorithm that provides provable guarantees for selecting high-quality data subsets that maximize both diversity and utility. When training on massive datasets, you need a representative subset that's not redundant, GIST solves this by balancing the diversity-utility tradeoff with mathematical guarantees. Especially useful for pre-training workflows at scale.

Releases

Google releases Agentic Vision with Gemini 3 Flash. Instead of just parsing images in a single pass, Gemini 3 Flash can now use a Think, Act, Observe loop with code execution. The model can zoom in, inspect, and manipulate images step-by-step to ground answers in visual evidence. This delivers a consistent 5-10% quality boost across vision benchmarks by replacing probabilistic guessing with verifiable, step-by-step inference execution. Developer docs.

gemini-3-agentic-vision

Demo of how Gemini 3 Flash uses agentic steps to break down a vision problem

Videos

AI is able to extract books verbatim — A concerning demonstration of how AI models can reproduce copyrighted content from training data. Several flagship models including Gemini, Claude and GPT were able to reproduce Harry Potter (and other copyrighted materials) with upwards of 90% accuracy. So it seems that it’s okay for large companies to syphon the internet for training data but when it comes to using the outputs to train other models it’s “against the terms of service”. A little pot calling the kettle black, no?

See you next month!

What a massive month for the ML world in January!

As always, let me know if there's anything you think should be included in a future post.

Liked something here? Share it with someone.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.