[May 2026] AI & Machine Learning Monthly Newsletter 🤖

77th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey everyone!

Daniel here, I’m a machine learning engineer who teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Here's what you might have missed in May 2026 as an A.I. & Machine Learning Engineer... let's get you caught up!

Hey everyone!

Daniel here, I’m a machine learning engineer who teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there’s a lot going on, the utmost care has been taken to keep things to the point.

My Work

Fine-tuning Small Language Models with Hugging Face Transformers - The Small Language Model (or SLM) fine-tuning tutorial series is nearly finished being edited and should be live on ZTM soon. In the meantime, you can read through all of the materials online.
I’ve started a new role at Artificial Analysis evaluating the world’s best AI models - I’ll be helping directly with building evals and testing frontier models, as well as making technical content around our products. My first video walks through the Artificial Analysis Coding Agent Index, a benchmark for testing different coding model and harness combinations. Watch on YouTube.

From The Internet

We should be more tired than the model by Vicki Boykis. “When I finish an agentic session, I get all the outward signs of having written code, but none of the internal processes that happen when we write code by hand.” Vicki writes about feeling like she’s losing control over the code she writes with agentic code generation. I’ve had similar feelings. I like her thoughts and practices on adding friction back into the problem to facilitate familiarity.
Gemini Managed Agents Guide by Philipp Schmid. A guide on using Gemini’s new managed agents feature. You call the model and provide the task and then each interaction runs inside a Linux sandbox (Ubuntu, Python 3.12, Node.js 22) with 4 CPU cores and 16 GB RAM, isolated at the OS level so the agent can install packages, run code, and write files without affecting your machine.
Harness, Scaffold, and the AI Agent Terms Worth Getting Right on the Hugging Face blog. A helpful glossary for common AI Agent terms such as Harness, Context Engineering, Skills and Tool Use. For example, Agent = Model + Harness.
Common Crawl is now available via a Hugging Face Bucket. The April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3. 380 TiB of uncompressed content across some 2.2 billion web pages, and Buckets are backed by Xet (chunk-based, deduplicated storage) and reachable with the same hf CLI and huggingface_hub client used for Models and Datasets. Cool to see the uptake of Hugging Face Storage Buckets getting used for such large scale data storage. I’ve personally started building a HF Storage Bucket for Nutrify assets.
[Case Study] Vector search at Notion: 10x the scale, 90% cheaper. Case study of how Notion moved from dedicated “pod” clusters to a serverless architecture, then migrated their multi-billion object workload to turbopuffer, and is now moving embeddings generation and serving onto Ray running on Anyscale.
Why benchmarking is hard by Epoch AI. “Running benchmarks involves many moving parts, each of which can influence the final score. The two most impactful components are scaffolds and API providers.” They use the example of SWE-bench Verified showing how simply switching the agent scaffold makes up to an 11% difference for GPT-5 and a 15% difference for Kimi K2. They even show how the same model can perform differently on different providers.

GLM 4.6 performing differently on different API providers. Source: Epoch.ai blog.

paperswithcode is back. When I started learning machine learning, paperswithcode was one of my favourite websites to learn new ideas and seeing which models performed the best on certain tasks. Now it’s back! The website collects research papers, model artifacts and code in a central place and categorizes them into vision, language, robotics and more. See this example page for the recently launched Surya OCR 2 model.

[Benchmark] Tau-knowledge: benchmarking agents on realistic knowledge by Sierra. A new benchmark for measuring how agents perform on messy, changing knowledge bases to complete multi-step tasks. It extends Sierra’s 𝜏-bench with a new 𝜏-Banking domain built around a realistic knowledge base of 698 documents, and today GPT-5.5 with xhigh reasoning leads the leaderboard at 37.4% Pass^1, meaning even the leading model still fails roughly 60% of these tasks at maximum reasoning effort. Leaderboard, Code, Paper.
What do vision tokens cost with frontier models? by Roboflow. A breakdown of how GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro each turn pixels into tokens, with a worked grid across five image sizes. An iPhone photo is about 2,451 image tokens on GPT-5.5, 6,636 on Claude, and 6,192 on Gemini.
[Case Study] Dropbox: Using LLMs to amplify human labeling and improve Dash search relevance. How Dropbox trains Dash’s search ranking models with a mix of human and LLM-assisted labeling, starting with a small amount of internal, human-labeled data and then amplifying it with LLMs to produce relevance labels at scale. Excellent example of humans-in-the-loop for data enrichment.
hfviewer for viewing the architecture of any HF model. You can also replace huggingface.co with hfviewer.com in any model URL to view it. See the example of Qwen3.5-4B below.

Francois Chollet’s Deep Learning with Python is now free to read online. Chapters range from What is Deep Learning? to Language Models and the Transformer.
Speed up inference with Gemma 4 and MTP (multi-token prediction). MTP drafter models for the Gemma 4 family are available under the same Apache 2.0 license as Gemma 4, and the draft models seamlessly use the target model’s activations and share its KV cache. See more in the Documentation, Gemma 4 object detection, Maarten Grootendorst’s visual guide to Gemma 4.
[Benchmark] ProgramBench is a new benchmark which tests whether language models can rebuild programs from scratch. A model is given only a compiled binary and its documentation and agents must architect and implement a complete codebase that reproduces the original program’s behavior. GPT 5.5 on xhigh with mini-SWE-agent scores 0.5% and Claude Opus 4.7 (xhigh) with mini-SWE-agent scores 0%. See more in the Paper, GitHub.

Open Source

OCR and document extraction

Surya OCR 2 from Datalab. Surya 2 is a single VLM that handles layout analysis, OCR (full page or per block), and table recognition in one model, at roughly 650M parameters. It scores 83.3% on olmOCR-bench and is the best in its class under 3B parameters.
Chandra OCR 2 from Datalab. Chandra 2 is a 4B parameter model that reaches an 85.9% olmOCR bench score and a 77.8% multilingual bench score (a 12% improvement over Chandra 1), with support for 90+ languages.
LlamaIndex LiteParse v2.0. LiteParse extracts structured text from PDFs and office documents without LLMs, and small docs see a 5-100x speedup with larger docs around 3x.
NuExtract v3 from NuMind. NuExtract3 is a 4B multimodal document-understanding model released under Apache 2.0, described as the first reasoning VLM specialized in both OCR and structured extraction. It extracts structured JSON from scans, receipts, and forms, converts documents to Markdown, and beats Qwen3.5-9B by 17 points on the structured-extraction task while being less than half the size.
MiniCPM-V 4.6 for small deployable VLM. A pocket-sized multimodal LLM for efficient image and video understanding on mobile devices, built on SigLIP2-400M and the Qwen3.5-0.8B LLM with 1.3B total parameters. It introduces mixed 4x/16x visual token compression and is optimized for on-device performance across iOS, Android, and HarmonyOS. PDF extract cookbook, MLX collection.

Image, audio, and 3D generation

[Dataset] Monet dataset from jasperai. MONET is a large-scale dataset of 104.9M image-text pairs released under the Apache 2.0 license and designed for training large text-to-image models. It was collected from 2.9B raw pairs across heterogeneous open sources through safety filtering, duplicate removal, and re-captioning with multiple vision-language models, and each image ships with pre-computed embeddings, structured annotations, and pre-encoded VAE latents. Paper, Retrieval demo.
Qwen-Image-Bench for automated text to image rating. Q-Judger is a vision-language model fine-tuned for automated evaluation of text-to-image generated images, built on top of Qwen3.6-27B. Given a text prompt and a generated image, it scores the image across 5 hierarchical dimensions using structured checklists and outputs JSON-formatted evaluation results. Lines up quite nicely with human expert raters with a Spearman score of 0.92. The current Q-Judger leaderboard has GPT Image 2 ahead by quite a large margin over Nano Banana 2.

Q-Judger leaderboard of text-to-image models as of May 2026. Source: Q-Judger Hugging Face page.

Bonsai Image 4B from PrismML. Built from FLUX.2 Klein 4B, the 1-bit and Ternary variants reduce the diffusion transformer from 7.75 GB down to 0.93 GB and 1.21 GB respectively, and to PrismML’s knowledge it’s the first image model in its parameter class to run directly on an iPhone.
Microsoft Lens, an open-source 3.8B parameter MIT licensed image generation model. A relatively small image generation model trained on 800M image-text pairs. Weights are available on Hugging Face under the MIT license, with a distilled Lens-Turbo variant also available.
HiDream-O1-Image model. An 8-billion-parameter, pixel-native unified image generative model open-sourced under the MIT License. Built on a Pixel-level Unified Transformer, it encodes raw pixels, text prompts, and task-specific conditions in a single shared token space, eliminating external VAEs and disjoint text encoders, and natively handles text-to-image generation, instruction-based editing, and subject-driven personalization at up to 2,048 x 2,048 resolution.
Stability AI releases Stable Audio 3. A family of open-weight audio generation models trained on fully licensed data, covering music composition, sound effects, and audio editing in a single unified architecture. The release includes three models (small SFX, small, medium) and a handful of extras, the open-weight resources can be downloaded on Hugging Face.
Tencent releases Pixal3D for image to 3D. Unlike previous methods that loosely inject image features via attention, Pixal3D explicitly lifts pixel features into 3D through back-projection, establishing direct pixel-to-3D correspondences.

VLMs, detection, and robotics

Nvidia release LocateAnything3B for detecting anything in an image. For faster inference, the model uses a very cool method, Parallel Box Decoding, which predicts complete bounding box coordinates in a single parallel step rather than autoregressive token-by-token decoding, enabling up to 2.5x higher throughput, and it’s trained on 12M images, 138M+ queries, and 785M bounding boxes.

Marlin is a tiny VLM to extract structured information from videos. A 2B-parameter video VLM from NemoStation, released under Apache 2.0 and tuned for two questions developers actually ask their videos: what is happening, and when. The model produces structured scene and event captions with second-precise timestamps and resolves natural-language queries to span-grounded (start, end) ranges, and at 2B params is competitive with Gemini-2.5-flash at a fraction of the cost.
RF-DETR models come to Hugging Face. Roboflow DETR models are now integrated in Transformers and available directly on Hugging Face under the Apache 2.0 license.
Brave open-source their model for faithful webpage summarization. Ocelot is an open-source vision-language model designed from the ground up to summarize web content quickly and reliably, and it now powers Leo, Brave’s browser AI assistant. It’s a LoRA adapter trained on top of Qwen3-VL-4B-Instruct and it gives Qwen3-VL-4B-Instruct the ability to produce web summaries faster than much larger LLMs at the same quality.
Ai2 Introduces MolmoAct2, VLA models for the real world. The outperforms capable proprietary robotics models on industry benchmarks, runs up to 37x faster than its predecessor, and ships alongside the MolmoAct 2-Bimanual YAM dataset with over 720 hours of training demonstrations. Models, Tech Report, Code.

LLMs and on-device efficiency

Cohere releases Command A+. A mixture-of-experts model with 218B total and 25B active parameters, released under Apache 2.0, supporting 48 languages and available in 16-bit, 8-bit, and 4-bit quantizations. Blog post.
needle = 26m function calling model. Cactus distilled Gemini 3.1 into a 26M parameter Simple Attention Network that you can finetune locally on your Mac or PC, running at 6000 tokens/sec prefill and 1200 decode in production, and it beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function calling.
Rapid-MLX for faster MLX model workflows on device. Rapid-MLX is a local AI engine for Apple Silicon. It includes 17 tool parsers, prompt cache, reasoning separation, and cloud routing, works as a drop-in OpenAI replacement, and works with Claude Code, Cursor, and Aider.

Embeddings and rerankers

Ettin Reranker Family. Six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes, built on top of the Ettin ModernBERT encoders and ranging from 17M to 1B parameters, all under Apache 2.0.
Granite small embedding models. A 97M parameter dense embedding model from IBM that produces 384-dimensional vectors with a context length up to 32,768 tokens. It scores 60.3 on Multilingual MTEB Retrieval, the highest retrieval score of any open multilingual embedding model under 100M parameters. See also Granite vision 4.1 for document extraction, Demo, Blog.

Tooling and other cool things

[Tool] Google Genkit for developing agentic apps. Genkit Middleware is a new system of composable hooks that intercept generation calls, including the tool execution loop, and inject custom behaviors. Hooks can intercept the cycle at three levels (generation, model calls, and tool execution), with built-in capabilities like retries, fallbacks, and human-in-the-loop approvals, available in TypeScript, Go, and Dart with Python coming soon.
Hugging Face release Carbon 3B for nucleotide modelling. A 3B-parameter decoder-only autoregressive genomic foundation model trained on DNA and RNA sequences with a primary focus on eukaryotes. It has a native context length of 32,768 6-mer tokens (about 197k DNA base pairs), can generate new DNA sequences and score the functional impact of mutations zero-shot, and matches Evo2-7B while running ~250x faster at inference.
Datadog releases time series foundation model under Apache 2.0. Toto 2.0 is a family of time series foundation models ranging from 4m to 2.5B parameters, designed to answer whether time series foundation models can improve as they scale. Toto-2.0-22m beats or matches Toto-1.0 quality with 7x less parameters.
fastino introduce GliGuard for LLM guardrails protection. A 300M parameter encoder-based safety moderation model that handles safety classification, jailbreak detection, harm categorization, and refusal detection in a single forward pass. Across nine established safety benchmarks, its accuracy matches or exceeds decoder-based models 23 to 90 times its size, including Meta’s LlamaGuard4 (12B) and Google’s ShieldGemma (27B), while running up to 20 times faster. See the paper for more.
Privacy Filter Nemotron for Peronally Identifiable Information extraction from text. A fine-tuned version of OpenAI’s privacy-filter model with 5x more granular PII classification, tagging each token with a BIOES label across 55 PII span classes. Categories include first_name, last_name, medical_record_number, credit_debit_card, ssn, and many others designed to match what downstream redaction and masking pipelines need.
pandas v3 is out. Highlights include a dedicated string data type by default (inferred as str instead of object), consistent copy/view behaviour with Copy-on-Write as the default and only mode (which removes the SettingWithCopyWarning), a new default microsecond resolution for datetime-like data, and new pd.col syntax.

Releases

MiniMax releases MiniMax-M3. MiniMax-M3 is a multimodal foundation model supporting text, image, and video inputs with text output, a 1M-token context window, and is suited for long-horizon agentic work, coding, and tool use. At the time of writing the model is brand new (May 31st 2026 release on Open Router). Model weights may be released soon on Hugging Face.
Gemini API now provides managed agents. You call the Interactions API and Gemini 3.5 Flash handles reasoning, code execution, package installation, file management, and web browsing inside an isolated Linux sandbox, with no servers to manage and no orchestration code to write. You can use the Antigravity agent out of the box for coding, research, and data analysis, or build your own with custom instructions, skills, and data.
Firecrawl releases Question and Answer Highlights. Two new formats for /scrape: question (pass a URL and a question, get a grounded answer back) and highlights (pass a URL and a query, get the exact sentences, code blocks, and table rows that match). Both are up to 100x more token-efficient than a full scrape, answer strictly from page content with zero hallucinations, and ship with prompt-injection hardening built in.
OpenAI releases Codex locked use. Codex can now use your Mac in a narrow scope, even when locked (as long as you allow it).
Google releases Gemini 3.5 Flash. Gemini 3.5 Flash is Google’s latest flagship model (before the incoming Gemini 3.5 Pro comes out). It outperforms Gemini 3.1 Pro on benchmarks like Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo), and MCP Atlas (83.6%), and when looking at output tokens per second it is up to 4 times faster than other frontier models.
Cursor releases Composer 2.5. Cursor really cooked with Composer 2.5. In my experience, it’s fast and high quality. And on benchmarks, such as the Artificial Analysis Coding Agent Index, it’s right up there with GPT 5.5 and Claude Opus 4.7. On CursorBench v3.1, it scores 63.2% (vs 64.8% for Claude Opus 4.7 and 64.3% for GPT 5.5) at a fraction of the cost.

Comparing model performance on CursorBench 3.1 against the average cost per task. Source: Cursor blog.

OpenAI releases Codex for iOS ChatGPT app. Now you can start work on one machine and use ChatGPT on iOS or Android, or Codex on Mac, to check progress, continue the thread, respond to prompts, and steer work while away from the desk.
Thinking Machines releases Interaction Models. Thinking Machines new Interaction Model uses a time-aligned micro-turn design, interleaving the processing of 200ms of input with the generation of 200ms of output, paired with an asynchronous background model that handles sustained reasoning and tool use.

Thinking Machines Interaction Model workflow. Source: Thinking Machines blog.

Research

SkillOpt for automatically improving skills. SkillOpt is described as the first systematic controllable text-space optimizer for agent skills, where a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document. Rather than modifying a model’s weights, it trains a Markdown document as an external parameter of the frozen model, and on GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, +24.8 inside the Codex agentic loop, and +19.1 inside Claude Code.
Sakana AI introduces DiffusionBlocks for training neural networks one block at a time. Sakana AI develops a way to train neural networks in blocks by treating each block as a small diffusion model. This reinterpretation drastically reduces the memory needed to train deep models, and they matched end-to-end performance across ViTs, DiTs, and LLMs while training just one isolated block at a time. Paper, GitHub.

Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings. The project aggregates 4.14M recipes from 11 sources spanning seven languages and normalises raw ingredient strings to 1,790 canonical entries, then trains three Metapath2Vec variants that differ only in the random-walk schema across the chemistry-versus-recipe-context spectrum. You can find the embedding models on Hugging Face.
ParaRNN: Large-Scale Nonlinear RNNs, Trainable in Parallel by Apple. A new framework for parallelized RNN training that achieves a 665x speedup over the traditional sequential approach, enabling the training of the first 7-billion-parameter classical RNNs that achieve language modeling performance competitive with transformers. The key insight is adapting Newton’s method so that solving a nonlinear RNN becomes the iterative solution of linear state space models, each of which can be solved efficiently in parallel. See the code.

Videos

Phillip Schmid on Why (Senior) Engineers Struggle to Build AI Agents. See also, the blog post version of the talk.

Why senior engineers struggle to build ai agents

How We Built Zeta2: Training an Edit Prediction Model in Production by Ben Kunkle, Zed. Ben Kunkle explains how Zed trained Zeta2, a small specialized edit prediction model, through knowledge distillation, using a frontier teacher model to generate the training signal.
Nick Nisi from WorkOS on how he deleted 95% of his agent skills and got better results. Deleting 95% of auto-generated skills (from around 10,000 to 553 lines of targeted gotchas) improved agent performance from 77% to 97% accuracy on key tasks. A key finding was that evals were critical to discovering that more context was actively hurting results (always be eval-ing!).
Yann LeCun on Welch Labs speaking about JEPA Architectures Part 2. Part 2 of Welch Labs’ deep dive into Joint Embedding Predictive Architectures (JEPA), Yann LeCun’s framework for self-supervised learning that aims to make AI learn more like humans. See Part 1.
Andrew Kelly on Zig Language and why he doesn’t use AI. Andrew Kelley, creator of Zig, explains the language’s strict no-AI policy when it comes to contributions and why AI-generated pull requests become a bottleneck for the team, alongside topics like the $670K foundation, leaving GitHub, and why Zig isn’t 1.0 yet.

See you next month!

What a massive month for the ML world in May!

As always, let me know if there's anything you think should be included in a future post.

Liked something here? Share it with someone.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.