[January 2026] AI & Machine Learning Monthly Newsletter 🤖

Daniel Bourke
Daniel Bourke
hero image
Want to become an AI/ML Engineer?

Our AI/ML Career path takes you from complete beginner (at any age!) to getting hired as a Machine Learning and/or AI Engineer 👇

Get The Full Career PathGet The Full Career Path

73rd issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey everyone!

Daniel here, I’m a machine learning engineer who teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Here's what you might have missed in January 2026 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

Hello everyone! What an excellent first month of the year for the world of ML and AI.

I’ll start off by sharing some of my recent work and then get into some of my favourite resources from the past month.

In January, I released three new YouTube videos, all are focused on using Hugging Face Transformers for different tasks along with open-source models:

  1. Learn to fine-tune an LLM (Gemma-3-270M) — A step-by-step guide to full fine-tuning an LLM using Hugging Face tools, see code.
  2. Learn to fine-tune a VLM (SmolVLM2) — Fine-tuning a Vision Language Model (VLM) for custom image understanding tasks, see code.
  3. Build a multimodal RAG system with NVIDIA Nemotron VL embedding models — Combine text and image retrieval for more powerful RAG applications, see code.

From the Internet

  • Anthropic releases a study on how AI coding assistance comes with pros and cons. In a controlled study, developers using AI assistance scored 17% lower on comprehension tests than those who coded manually, equivalent to nearly two letter grades (e.g. A grade comprehension → C grade comprehension). The researchers identified six distinct interaction patterns, finding that AI Delegation (complete reliance on code generation) led to the worst outcomes while Conceptual Inquiry (using AI as a teacher rather than a coder) preserved learning. The key insight: how you use AI determines whether you learn or lose skills.

anthropic-study

Anthropic study design for seeing how AI tools influence developer skill, knowledge and speed.

training-dataset

Example input/output pairs for a training dataset to customize Qwen-Image-Edit for turning real-life images into an isometric style. Source: Andy Coenen blog.

  • Case Study: GPT-OSS-20B vs BERT on Consumer Hardware for Multi-Label Text Classification by Ben Toussaint. Ben puts his RTX 4090 GPU to the test to see if he can beat GPT-OSS-120B with a fine-tuned classification model. Not only did he find encoder models such as mDeBERTa-v3-base can perform on par or better than a fine-tuned GPT-OSS-20B but it also was able to train the model in 90 seconds and achieve 174x faster inference (235 samples per second for mDeBERTa-v3 model vs 1.35 samples per second for the GPT-OSS-20B model). If you’d like to see what it’s like training your own text classification model, see text classification tutorial on learnhuggingface.com.
  • The rise of tabular foundation models. A thought-provoking piece on how foundation models are starting to come to tabular data, potentially changing how we approach traditional ML problems. The article explores what this might mean for practitioners who've relied on XGBoost and random forests. One cool feature that tabular foundation models seem to inherit (similar to LLMs) is the ability to do in-context learning (e.g. feed the model a couple of samples of your target data and it can adjust itself to your task).
  • Reminder: Data is the foundation of Language Models by Cameron Wolfe. I revisited this article after starting to fine-tine LLMs and VLMs of my own. It’s a deep-dive into how training data shapes LLM capabilities. Notably references the LIMA paper where approximately 1,000 aligned and high-quality samples were enough to steer an LLM into producing ideal outputs. Because LLMs have already seen so much data, if you’re looking to fine-tune your own models, perhaps just starting with 100-1000 quality samples can get excellent results.
  • VLM from scratch series. A hands-on, beginner-friendly introduction to Vision Language Models. Note: The content appears to be authored with Claude assistance, so watch for rough edges, but on the first few minutes reading the code, it looks like a good interactive format, accessible for learning. See the first notebook to get started.
  • Blog post: Use agents or be left behind by Tim Dettmers. Tim Dettmers (who also helped create QLoRA and bitsandbytes and one of the best guides to buying GPUs for deep learning) writes about how AI agents have helped his writing process and also helped him try new experiments he might not have otherwise done. Drawing from his background in factory automation, he introduces process optimization concepts to evaluate when agents actually help. I especially liked the parts where he mentions where AI agents did not help him for various workflows. The main one being emails.

Hugging Face Updates

From the Internet (continued)

agent-skills

An Agent Skill is a markdown file which gives an Agent a set of instructions, docs, executable code and more as a potential tool to use when exploring a problem space.

Daniel's Open-source AI of the Month

OCR models continue to get better and better

The OCR space continues to heat up with several powerful new releases:

  • LightOnOCR-2, an efficient 1B VLM for OCR — Continues to push the boundaries of efficient OCR with impressive speed benchmarks, achieving 5.71 pages per second on a single NVIDIA H100 80GB GPU. Best results on the OlmOCR-Bench benchmark, even outperforming the Mistral OCR 3 API. In the paper they state that the architecture is a Mistral Small 3.1 vision encoder (400M parameters) + Qwen3 language model (600M parameters), they then fine-tune this model on a custom dataset for text extraction purposes. Cool idea: can you take just the vision encoder from a larger model and use it as the base for a smaller model? For example, take the vision encoder from Kimi-K2.5 (a very large model) and use it for downstream tasks?

lighton-ocr

Demo of LightOnOCR-2 running on the Attention Is All You Need paper.

  • PaddleOCR-VL-1.5 — An updated version of the popular PaddleOCR series with improved vision-language capabilities and support for 109 languages. It also now supports cross-page table merging as well as cross-page paragraph heading recognition which helps out for long-document parsing. It achieves 94.5% on OmniDocBench v1.5 with just 0.9B parameters.
  • DeepSeek OCR 2 introduces DeepEncoder V2, which fundamentally changes how AI "sees" documents. Instead of the traditional raster scanning approach, it uses "Visual Causal Flow" to read documents in logical order just like humans do. The 3B parameter model achieves 91.09% on OmniDocBench v1.5, with notably improved handling of complex layouts, tables, and multi-column documents. The breakthrough: replacing CLIP with Qwen2-0.5B as the vision encoder enables semantic reasoning about reading order.

Open-source VLMs and LLMs

  • Kimi K2.5 — Moonshot AI releases their most powerful open-source model yet: a 1 trillion parameter MoE model with 32 billion active parameters. Built through continual pretraining on approximately 15 trillion mixed visual and text tokens, K2.5 excels at visual coding (generating code from UI designs and video workflows) and introduces "Agent Swarm"—the ability to self-direct up to 100 AI sub-agents working in parallel. The model claims to outperform GPT-5.2, Claude 4.5 and Gemini 3 Pro on several benchmarks (the results are mixed in terms of which model is best for a certain benchmark but Kimi K2.5 is no slouch on any of them). Blog post.

kimi-k2.5

Kimi K2.5 performance on various benchmarks compared to frontier models such as Gemini 3 Pro, Claude 4.5 Opus and OpenAI’s GPT 5.2.

  • SimpleSeg for VLM-based segmentation — A straightforward approach to adding segmentation capabilities to VLMs. I really enjoy their straightforward data annotation pipeline. Go from image → box labels → segmentation labels → turn into outline → train VLM to reproduce the outline.

simpleseg-annotation-pipeline

youtu-vl-4b-example

Example of Youtu-VL detecting coordinates in a natural image with text-based prompting. Youtu-VL is able to output coordiantes for bounding boxes which can plotted on an image.

  • LlamaBarn — A macOS menu bar app for running local LLMs with llama.cpp. Great for quick access to AI assistance without leaving your workflow.
  • RADIOv4 — NVIDIA combines SigLIP2, DINOv3, and SAM3 into the same feature space. Models available under a commercially friendly license.
  • Arcee AI releases open-weight large models — Another entrant in the open-weight space. See the podcast discussion.
  • GLM-4.7-Flash (30B total parameters, 3B active parameters)Z.ai's latest efficient model release is the best in class for the 30B MoE (Mixture of Experts) space.
  • iFlyBot-VLM for spatial reasoning — A VLM specifically tuned for spatial understanding tasks. Is able to ground itself on 2D and 3D detections.
  • Flux.2 Klein 4B and 9B — Black Forest Labs releases smaller, faster image generation models. The smaller 4B model is available under the Apache 2.0 license. See the blog post for more.
  • Google releases a series of translation models with TranslateGemma — Ranging from 4B parameters to 12B and 27B parameters and fine-tuned for translation tasks across 55 different languages. The TranslateGemma models maintain the multimodal capabilities of the original Gemma 3 models. Blog post.

Speech and Audio

  • NVIDIA Magpie for Speech Generation — A 357M parameter multilingual TTS model.
  • Qwen3-TTS model series — The Qwen team enters the text-to-speech space with a 600M parameter and 1.7B parameter model each capable of creating custom voices, cloning voices and producing custom speakers.
  • VibeVoice ASR — Microsoft's ASR model supporting 60-minute speech transcription with diarization, timestamping, and transcription in one pass.

Embeddings and Retrieval

Medical and Specialized Models

  • MedGemma 1.5 4B — Google's next-generation medical text and image interpretation model, now with medical speech-to-text via MedASR (a speech to text model focused on the medical domain). There's a Kaggle competition to explore medical applications with the new MedGemma-1.5 models as well.

Computer Vision

  • RF-DETR and Segmentation upgrades — Roboflow extends their excellent RF-DETR detection models with segmentation with new checkpoints ranging from Nano to 2XLarge in size. The models outperform YOLO26 models on the COCO benchmark at similar inference speeds.
  • Zcore — Find the most efficient and best dataset to use from a corpus of data. Zero-shot labeling for large-scale image datasets based on how much each sample will add to the training process.
  • YOLO26 — The latest iteration of YOLO improves detection throughput with a new NMS-free inference setup.

Papers

Vision encoder. All Ministral 3 models use a 410M parameter ViT as a vision encoder for image understanding that is copied from Mistral Small 3.1 Base and kept frozen, with the same architecture described in Pixtral [Agrawal et al., 2024]. We discard the pretrained projection layer from the ViT to language model’s space and train a new projection for every model.

  • Google Introduces GIST, the next stage in smart sampling. GIST (Greedy Independent Set Thresholding) is a novel algorithm that provides provable guarantees for selecting high-quality data subsets that maximize both diversity and utility. When training on massive datasets, you need a representative subset that's not redundant, GIST solves this by balancing the diversity-utility tradeoff with mathematical guarantees. Especially useful for pre-training workflows at scale.

Releases

  • Google releases Agentic Vision with Gemini 3 Flash. Instead of just parsing images in a single pass, Gemini 3 Flash can now use a Think, Act, Observe loop with code execution. The model can zoom in, inspect, and manipulate images step-by-step to ground answers in visual evidence. This delivers a consistent 5-10% quality boost across vision benchmarks by replacing probabilistic guessing with verifiable, step-by-step inference execution. Developer docs.

gemini-3-agentic-vision

Demo of how Gemini 3 Flash uses agentic steps to break down a vision problem

Videos

  • AI is able to extract books verbatim — A concerning demonstration of how AI models can reproduce copyrighted content from training data. Several flagship models including Gemini, Claude and GPT were able to reproduce Harry Potter (and other copyrighted materials) with upwards of 90% accuracy. So it seems that it’s okay for large companies to syphon the internet for training data but when it comes to using the outputs to train other models it’s “against the terms of service”. A little pot calling the kettle black, no?

See you next month!

What a massive month for the ML world in January!

As always, let me know if there's anything you think should be included in a future post.

Liked something here? Share it with someone.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

You might like these courses

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job
19 min read

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)
30 min read

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

How to Convince Your Boss to Pay for Your Upskilling preview
How to Convince Your Boss to Pay for Your Upskilling
10 min read

Get you company to pay for your tech upskilling. Use this training request email and strategy to make it happen.