[April 2026] AI & Machine Learning Monthly Newsletter 🤖

76th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey everyone!

Daniel here, I’m a machine learning engineer who teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Here's what you might have missed in April 2026 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work

Hugging Face Bootcamp. Small LLM fine-tune videos finished, currently editing. The fine-tune small LLM video series for the Hugging Face Bootcamp is done and currently being edited. And they’re coming to ZTM soon. In the meantime, feel free to read through all the source materials.

From The Internet

Mitchell Hashimoto on My AI Adoption Journey. Mitchell Hashimoto walks through his shift from AI skeptic to productive user, emphasising that LLM agents add the most value when they can read files, execute programs, and verify their own work.

Three tips from the post:

Break down sessions into separate clear, actionable tasks. Don’t try to “draw the owl” in one mega session.
For vague requests, split the work into separate planning vs. execution sessions.
If you give an agent a way to verify its work, it more often than not fixes its own mistakes and prevents regressions.

How to write agent skills (Phil Schmid). Phil Schmid lays out 8 practical tips for writing better agent skills, covering specific descriptions, layered information organization, avoiding overfitting, and testing with 10 to 20 prompts before publishing. If there’s anything I took from this, it’s that my own prompts for previous models can probably be more direct.
Composing a search engine with Exa (a search API). AI Search APIs are growing. Every time your agent makes a search tool call, it could be using Exa. In this technical post, Exa describes Canon, a search pipeline orchestrator that compiles independent nodes into a serializable graph for debuggable, maintainable infrastructure at scale.
LLM coding style tips from Matthew Honnibal (founder of spaCy). Honnibal shares tactics for keeping LLM-assisted code maintainable, with a companion piece on auditing try/except blocks.

My favourite quotes:

“Most of software engineering art is to stop complexity creeping in exponentially.”

“Slow and steady wins the race.”

“Use types, as strictly as possible.”

“Use Pyright to check for types in your IDE.”

“Always ask: ‘should the types have caught this?’”

“Make sure attributes are being annotated”, the right syntax is:

class Foo:
  bar: str

  def __init__(self, bar: str) -> None:
    self.bar = bar

“Use data classes and Pydantic models.” Instead of record["name"], use record.name when Record is a dataclass or Pydantic model.

“Scrutinise unions and optionals”, whenever an LLM says a value can be a None, ask “Can it though? Why?”

“Control complexity at the interface”, make it so if a particular file comes into your program, it gets manipulated in the way it needs to be straight away (e.g. formatting a JSON in a certain way before propagating it through the rest of your program).

“Crack down on exception handlers”, you generally want exactly one statement under the try block with except handlers for specific conditions.

“Try to avoid workflow loops where you go off and do things, paste the error back into the LLM, and then go do whatever it tells you. I’ve seen this anti-pattern called a ‘reverse centaur’: you want to be the head of the human/horse hybrid, not the ass. Another way to put it is just, probably don’t let the LLM make you its bitch. If it has, notice that. Sometimes it’s fine, but you know, check in with yourself. It’s not a good long-term arrangement.”

Hugging Face LLM Evaluation Handbook. This is an excellent read if you’re looking into how to evaluate LLM models, and almost all AI problems become and evaluation problem. Covers what LLM evaluation can and cannot do, how to select benchmarks, and how to design your own evaluations.
Hugging Face release a Skill to port models from Transformers to MLX. The Skill can be used to instruct an agent on how to discover models in Hugging Face transformers, write code to convert them to MLX implementations, and then runs a battery of tests to validate them end-to-end so models run efficiently on Apple Silicon.
Why aren’t we using uv yet? (Python package manager). By Q2 2026, uv was used in roughly 30% of new Python repos despite 74% Stack Overflow admiration, with the article noting LLM agents still default to pip and suggesting library authors include uv install instructions alongside pip. Most of this is likely due to momentum. In my experience, new tools are hard to adopt fully. I’m slowly moving over to uv.
Are benchmarks worth it? (Berkeley RDI). Berkeley researchers built an automated agent that exploited eight AI benchmarks including SWE-bench and WebArena, achieving near-perfect scores through legitimate vulnerabilities like file path disclosure and fake curl wrappers.
Hugging Face guide to distilling 100B+ models 40x faster with TRL. The TRL DistillationTrainer enables knowledge distillation for 100B+ teacher models through a generation buffer that decouples batch size from generation, external teacher servers, and binary-encoded logprob payloads that shrink transfer payloads by 5x.
Mario Zechner on slowing the fuck down. Zechner argues coding agents introduce unsustainable error accumulation at scale and recommends hand-written architecture and pair programming to maintain discipline. He goes deeper on his minimal coding agent in the Pi video (Pi is an agent Mario built because Claude Code became unpredictable).

“Tiny booboos compound at a rate that’s unsustainable.”

SynthVision, 110k synthetic medical images for fine-tuning smaller models. SynthVision uses two frontier VLMs to create a 110K medical dataset by annotating images with multi-turn conversations and cross-validating outputs to filter hallucinations, improving small-model performance across three model families by 8 to 15%.
RL Environment Field Guide. A practical guide to RL environments by the same people who created LLM post-training 101 (an excellent guide with data examples on how language models are post-trained), covering observation and action spaces, grader, state transitions, termination, and sandboxing using Pokémon Red as a case study with real code. See also Unsloth’s What are RL environments and how to build them which explores how RL is evolving for agentic AI and how a decoupled framework lets teams swap optimizers and hardware backends without modifying environment logic.
NVIDIA share how to train your own custom embedding model in a day. NVIDIA’s recipe fine-tunes a domain-specific embedding model in under a day on a single GPU using synthetic data, lifting Recall@60 on Jira data from 0.751 to 0.951. The companion NeMo Retriever agentic pipeline hit #1 on the ViDoRe v3 leaderboard with score 69.22 by using a ReAct agent that dynamically picks search and reasoning strategies.

Open Source

trafilatura. A pure-Python tool and CLI for gathering text and metadata from web pages, with support for crawling, scraping, and output in CSV, JSON, HTML, Markdown, and XML.
IBM release Granite 4.1 (Apache 2.0). Granite 4.1 is a family of dense models (3B, 8B, 30B parameters) trained on 15T tokens with long-context extension up to 512K, where the 8B instruct model matches or surpasses previous 32B MoE models.

“Carefully trained dense 8B models can rival much larger MoE architectures.”

Talkie LM, an LM trained only on text from before 1930. Talkie-1930-13B is a 13B parameter model trained on 260B tokens of pre-1931 English text with a knowledge cutoff of December 31, 1930, useful for testing whether LLMs can generalise to ideas/inventions that came after the cutoff.
PleIAs release CommonLingua, a tiny byte-level language ID model. A 2M parameter byte-level language identification model covering 334 languages including 61 African languages, using trigram hash embeddings with conv1D and attention layers and remaining script-agnostic across Latin, Arabic, Ethiopic, N’Ko, Tifinagh, Devanagari, and CJK.
Meta release Sapiens v2 for human segmentation. A family of high-resolution vision transformers (0.4 to 5B parameters) at native 1K resolution, pretrained on 1 billion human images for human-centric tasks like pose estimation and body-part segmentation.
DeepSeek release DeepSeek-V4-Flash and DeepSeek-V4-Pro. DeepSeek-V4-Flash is a 284B parameter (13B activated) MoE (Mixture of Experts model) and DeepSeek-V4-Pro is a 1.6T parameter (49B activated) MoE supporting 1M token context. Both are designed to be efficient, requiring 27% of single-token inference and 10% of KV cache compared with DeepSeek-V3.2.
Hugging Face release ml-intern. ml-intern is an open-source agent built on smolagents that autonomously reads research papers, discovers datasets, and runs/evaluates training scripts, pushing Qwen3-1.7B’s GPQA score from 8.5% to 32% in under 10 hours. Try it in the demo Space.
Moonshot release Kimi K2.6. A 1 trillion parameter (32B activated) multimodal MoE focused on long-horizon coding, agentic workflows, and autonomous task execution, with thinking mode enabled by default.
Google DeepMind release TIPSv2, a zero-shot image classification model with spatial awareness. TIPSv2-b14 combines iBOT++, head-only EMA, and multi-granularity text captions, displaying excellent performance across 9 tasks and 20 datasets. See the GitHub repo and project page for more.
Baidu release ERNIE-Image and ERNIE-Image-Turbo. ERNIE-Image is an 8B Diffusion Transformer for text-to-image generation that excels at dense text rendering, posters, and comics, while ERNIE-Image-Turbo is a distilled variant generating in 8 inference steps and deployable on consumer 24GB GPUs.
MiniMax release MiniMax M2.7. A 230B parameter (10B activated) MoE for software engineering and agentic tool use with a 200K context window, scoring 56.22% on SWE-Pro. The MiniMax announcement frames it as the first model deeply participating in its own evolution, with an 88% win-rate vs M2.5.
NVIDIA release Efficient Grounding Models (EGM). EGM-8B reaches 91.4 average IoU on RefCOCO with 737ms latency by training small VLMs with reinforcement learning, closing accuracy gaps without increasing model size.
Tencent release small embodied VLMs (HY-Embodied-0.5). A suite of embodied foundation models (2B and 32B variants) using a Mixture-of-Transformers architecture for spatial-temporal perception in robotics, with the 32B variant claimed comparable to Gemini 3.0 Pro on embodied tasks.
SteerViT, steering a vision encoder with text inputs. SteerViT injects text into ViT layers via cross-attention with only 21M extra trainable parameters, enabling zero-shot generalization to anomaly detection and personalized object discrimination. Code on GitHub and weights on Hugging Face.
Meta release TRIBE v2. A state of the art brain encoding model that predicts fMRI brain responses to naturalistic stimuli (text, audio, visual) and maps multimodal representations onto the cortical surface, with 70x resolution improvement over prior models trained on 500+ hours of fMRI from 700+ subjects.
Allen AI release WildDet3D. A 3D object detection model with a paired iPhone app to detect 3D objects through an iPhone camera (the model runs on a remote server). The blog post covers the 1M+ image WildDet3D-Data dataset spanning 13.5K categories with 22.6+ AP3D performance.
How Hugging Face engineer Niels Rogge OCR’d 30k papers. Niels Rogge writes up converting 30,000 arXiv papers to Markdown using Chandra-OCR 2 with vLLM for GPU inference and Hugging Face Jobs for serverless compute scaling.
Jackrong LLM fine-tuning guide PDF. A detailed guide to fine-tuning your own LLMs (specifically Qwopus 3.5 27B on Google Colab) with mostly open-source software, covering environment setup, data processing from 24 curated reasoning datasets, and GGUF quantization for local deployment.
OpenBMB release VoxCPM2. A 2B tokenizer-free TTS model supporting 30 languages with voice design from natural language descriptions, controllable voice cloning, and 48kHz studio-quality audio at RTF 0.3 on RTX 4090, Apache-2.0 licensed.
Google release Gemma 4. Gemma 4 ships in four sizes (E2B, E4B, 26B MoE, 31B) with multimodal text and image input (audio on small models), context windows up to 256K tokens, and Apache 2.0 licensing. The Google blog post covers day-one support for Transformers, vLLM, llama.cpp, and MLX, and there are docs, a cookbook, and a Hugging Face writeup plus a multimodal demo notebook. Maarten Grootendorst published a visual guide to Gemma 4 covering MoE, the vision encoder, per-layer embeddings, and the audio encoder, and a video shows Gemma 4 on Raspberry Pi 5 hitting 3 to 8 tokens per second. Red Hat shipped quantized variants (NVFP4 and FP8-block).
Netflix release VOID for video object removal. VOID (Video Object and Interaction Deletion) removes objects from videos along with all the items they interact with, preferred over Runway 64.8% vs 18.4% in human tests, built on CogVideoX-Fun with interaction-aware quadmask conditioning.
Falcon release SigLINO, a SigLIP2 + DINOv3 fusion. SigLINO distills both SigLIP2 and DINOv3 into one feature space (MoE and Dense variants), with the paper introducing Agglomerative MoE knowledge distillation and code on GitHub.
Falcon Perception, SAM3-like and OCR models with speed in mind. Falcon Perception is a 0.6B early-fusion vision-language model for open-vocabulary grounding with zero-shot localization under free-form text queries. There’s a smaller 300M variant, a perception agent demo notebook, and an Apple Silicon MLX demo notebook. I really liked this quote from the release blog post (a focus on data quality):

Falcon Perception is intentionally minimal: one backbone, one objective family, and small heads only where outputs are continuous and dense. The working assumption is that most gains should come from data, compute, and training signals, rather than continually expanding the pipeline with specialized modules.

Releases

Google “Learn Mode” in Colab. Google Colab introduces Learn Mode, turning Gemini into a coding tutor that breaks down complex topics step by step rather than generating code, with Custom Instructions for tailoring assistance per notebook.

“Learn Mode shifts from being a code generator to a coding tutor, breaking down complex topics into manageable, step-by-step instructions.”

Research

Vision Banana by Google DeepMind. Image Generators are Generalist Vision Learners: Vision Banana is instruction-tuned from Nano Banana Pro and parameterizes vision tasks as image generation, beating Segment Anything on segmentation and Depth Anything on metric depth estimation.

vision-banana

With instruction fine-tuning, Nana Banana Pro (an image generation model) can be customized to do other computer vision tasks such as semantic segmentation, instance segmentation and depth estimation on par or better than specialist models. Source: Vision Banana project page.

Google DeepMind release research on AI co-clinician, an agent designed to function as a member of the care team that interacts with patients under clinical supervision. The co-clinician system outperformed Primary Care Physicians and GPT-5.4-thinking-with-search on Medical Reasoning benchmark RxQA (multi-choice questions) as well as scored higher than GPT-5.4-thinking-with-search (95.0% vs 90.9%) on RxQA open-ended questions.

Case studies

Harness engineering: leveraging Codex in an agent-first world. OpenAI details how a five-month internal experiment shipped a beta product with roughly a million lines of code without any manually written source code, with the engineering team’s primary job becoming enabling the agents to do useful work. The system uses Codex agents to write code, generate tests, and manage observability based on declarative prompts, with documentation treated as the table of contents rather than the encyclopedia.

Videos

Is RAG still needed? Short answer: yes. Medium answer: it depends. Longer answer: think about your specific problem. If it often needs up to date or external information, RAG (Retrieval Augmented Generation) is a potential tool you could use. Fantastic video from Martin Keen from IBM Technology.

See you next month!

What a massive month for the ML world in April!

As always, let me know if there's anything you think should be included in a future post.

Liked something here? Share it with someone.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.