[March 2025] AI & Machine Learning Monthly Newsletter 💻🤖

63rd issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in March 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

[Coming Soon] Project: Build a custom object detection model with Hugging Face Transformers — I’m working on a new ZTM project to build Trashify 🚮, a custom object detection model to incentivise picking up trash in a local area. The code is complete and I’m in the process of making supplementary materials (tutorial text, slides, videos, evaluation breakdowns). Stay tuned for the completed release!
Video version of ML Monthly February 2025 — If you like seeing video walkthrough’s of these kinds of materials (videos tend to be better for demos), check out the video walkthrough of last month’s ML Monthly. The video walkthrough for this issue (March 2025) should be live a couple of days after the text version gets posted!

From the Internet

Blog posts

Hamel Husain writes A Field Guide to Rapidly Improving AI Products — From how error analysis consistently reveals the highest-ROI improvements to why a simple data viewer is your most important AI investment to why your AI roadmap should count experiments, not features this guide is a must read for anyone building AI products or systems.

hamel-field-guide-post-summary

A summary of points from Hamel’s field guide to improving AI products. My favourite point is the last one. Creating anything with AI requires a relentless spirit of experimentation, so prioritise experimenting to improve your models and system and the features will come. Source: Hamel’s blog.

Airbnb share two case studies of how they introduced embedding-based search on the platform (a good write up on how they created a dataset for this too) which led to a good boost in bookings as well as another where they leveraged LLMs to help rewrite testing code from Enzyme to React Testing Language in 6 weeks rather than an estimated 1.5 years.

airbnb-retry-loop-to-rewrite-tests-with-llms

Airbnb’s workflow diagram for using foundation LLMs to help rewrite test cases from one language to another. The article shares an extra breakdown of the prompt inputs they used where they found context inputs to be most important for more complex rewrites. Source: Airbnb tech blog.

Alex Strick van Linschoten writes about the experience of building for a week with local LLMs, one of my favourite takeaways is the “reflect, iterate and improve” loop as well as the tip for breaking tasks into smaller pieces to help the smaller models along. Also a good mention of using FastHTML + llms.txt to create small applications to go along with your model experiments. Alex also writes about using MCP (Model Context Protocol, a standard for connecting AI models to tools and data) to connect Claude to a personal habit tracking database.
Emerging Patterns in Building GenAI Products by Martin Fowler by Martin Fowler — Now that LLMs and other forms of GenAI models are making their way into more and more products, several building patterns are starting to emerge. In this article, Martin Fowler, a software developer with three decades of experiences breaks down the patterns he’s seen in practice. From direct prompting to embeddings to evals to query rewriting and reranking. A highly recommended read for those looking to build GenAI applications.

putting-together-a-realistic-rag-by-martin-fowler

Example of the parts of a system involved in a realistic RAG (Retrieval Augmented Generation) setup. When it comes to building a production system, there are often a few more parts involved compared to the demo. Source: Martin Fowler's blog.

A breakdown of LLM benchmarks, evals and tests by Thoughtworks explores the different ways to evaluate generative AI models such as LLMs. It’s one thing for researchers and companies to claim their models perform the best on various benchmarks but how do these compare to your own evaluations (evals for short)? Best practice is creating an evaluation set for your own use case so when a new model gets released, you can evaluate it on your own data.

genai-llm-evaluations-table

Comparison of different kinds of GenAI and LLM evaluations. It is often best practice to evaluate any form of GenAI or LLM model on all three criteria: benchmarks, evals and tests. Image by the author.

Neural Magic shows to get faster VLM models through quantization — Modern LLMs and VLMs have so many parameters which enable them to learn incredible patterns in data. However, many of these parameters are redundant or aren’t required to be in full precision. In turn, they can be quantized (e.g. reduced precision from Float16 to FP8, Int8 or Int4) and thus retain performance whilst having a much lower memory footprint and 3.5x faster throughput using vLLM. Get a collection of quantized Qwen2.5-VL (3B, 7B, 72B) as well as Pixtral (12B, Large) models on Hugging Face.

My top open-source AI resources of the month

olmOCR is a powerful 7B model focused on OCR which rivals GPT-4o — A fine-tuned version of Qwen2-VL-7B on 250,000 pages of PDF documents, olmOCR is a model and pipelined focused on creating high-quality text extractions from documents and images of documents.

These are my favourite of model.

A smaller model that’s been specifically tuned for a certain task which performs almost as good as a much larger model.

The paper contains a series of nice tidbits about the creation of olmOCR model, including:

32x cheaper than GPT-4o (extract ~1 million pages of documents for $190USD) and can run on your own hardware.
Outputs structured data reliably. Because the model was extensively fine-tuned on structured outputs, it outputs structured outputs naturally.
LoRA (Low Rank Adaptation) model had a higher loss than a fully fine-tuned model.
Researcher’s note: The order of the outputs in the JSON generation schema helps the model to examine the whole page first before outputting specific information. For example, the schema starts with metadata outputs which require whole page examination.
Fine-tuned using Hugging Face Transformers (fine-tuning code + data is available).

olmOCR-example-output

Example input and output of olmOCR. The model even works for non-PDF style images with text and handles tables quite well.

See the code on GitHub, model on Hugging Face, blog post write up, read the paper, try demo.

Teapot LLM is a small (~800M parameters) model trained designed to run on low-resource devices such as CPUs and smartphones. Really cool training techniques here to customize a small model for specific purposes. The model was trained on synthetic data created by DeepSeekV3 and human verified on a single A100 GPU on Google Colab.
MoshiVis is a speech-vision model capable of discussing images with natural voice and language — MoshiVis adds the vision modality to Moshi (an already performant speech/text model) by adding a PaliGemma2 vision encoder and cross attention. The result is a model that is capable of conversationally interacting with images in real-time on local hardware such as a Mac Mini M4 Pro.
DeepSeek release DeepSeek-V3-0324 a base model with significant improvements over its predecessor DeepSeek-V3, notably outperforming GPT-4.5 and Claude-Sonnet-3.7 on several benchmarks. Available under MIT license.
Mistral release Mistral-Small-3.1, a 24B parameter model with vision capabilities. With a large context window of 128k, and native JSON outputting, it’s capable of local inference on devices such as RTX 4090 or a 32GB RAM MacBook after quantization. Performs incredibly well for its size and is available under Apache 2.0 license. Read the release blog post for more information.
Qwen release Qwen2.5-VL-32B, a VLM capable of extracting information out of images and text with incredible performance (similar to the larger Qwen2.5-VL-72B but with less than half the parameters). They also release Qwen2.5-Omni, a model which can process inputs across video, text, audio and images as well as output text and audio. So now you can use Qwen2.5-Omni to go from text to audio or image to audio or video to text + more. Read the blog post announcement for more details. Both models are available under Apache 2.0 license.

qwen omni

The Qwen2.5-Omni architecture which allows a model to interact with multiple modalities. The model is able to take in audio and produce audio as well as take in images and produce text/audio. Source: Qwen blog.

Hugging Face and IBM research release SmolDocling-256M, a small model focused on efficient information extraction from documents — At 256M parameters, this model shines in small VLM category. It has been trained to output a new format called “DocTags” which gives a clear structure to documents allowing them to be parsed easily for conversion. The ideal workflow is to go from a document or image to DocTags (or another format) and then to markdown. I tried running the MLX version on my MacBook Pro M1 Pro and it took about 7-8s per page with streaming output (I could probably improve the speed here but I just tried the default settings), see below for the results. The researchers found that it can perform at about 0.35s per page on an A100 GPU (though it depends how much is on the page). The paper also contains many good tidbits and details about how they trained the model, particularly around synthetic data creation. You can also try the demo online.

docling-ocr-workflow

Example document workflow with Docling which works on images and digital files such as PDFs. The model extracts the text as well as layout details which can easily be converted to markdown and displayed/further analysed.

Roboflow release RF-DETR, an Apache 2.0 real-time object detection model — YOLO-like models are often the most referenced when it comes to real-time object detection. However, the license of YOLO models can sometimes be prohibitive to developers. The good news is Roboflow’s RF-DETR performs on par or better than the best YOLO models in terms of both mAP (mean average precision) and speed and is available under Apache 2.0 meaning you can “do what you want” with the model. The model comes in two variants, a base variant with 28M parameters and a large variant (better performance but slower) with 128M parameters. There’s also a demo Google Colab notebook for fine-tuning RF-DETR on your own custom dataset.
Google Introduce Gemma 3 an open-source series of VLMs — Ranging from 1B parameters (text-only) to 27B parameters the Gemma 3 models perform on par with some of the best models in market, all whilst still being able to fit on a single GPU (albeit you’ll need a larger GPU with the 27B model). One of my favourite things is that the 12B and 27B models are on par with Gemini 1.5 Flash and Pro (see the Table 6 in the release paper) meaning you can now deploy close to you own version of Gemini locally. There is also a ShieldGemma-2 model which is designed to be a filter for undesired images (e.g. sexual, NSFW, violence) before they go into your model. Get the Gemma 3 models on Hugging Face, read the blog post, see the dedicated Gemma library on GitHub for fine-tuning and inference.
NVIDIA release Canary 1B Flash and Canary 180M Flash for super faster automatic speech recognition — If you need to transcribe lots of audio to text at 1000x real-time speed, you should checkout the latest models from NVIDIA. Ranking both in the current top 10 of the open ASR (Automatic Speech Recognition) leaderboard both models are available under Creative Commons license. Try out the demo on Hugging Face Spaces for yourself.
Starvector is an Apache 2.0 foundation model for generating SVG code from images and text — Input an image of an icon and get SVG code back. The StarVector models come in two variants: 1B and 8B parameters. You can try out the demo on Hugging Face as well as get the code on GitHub to run the models locally.
SpatialLM is an LLM which can process 3D point cloud data and generate structured 3D scene understanding outputs — Using a RGB (red, green, blue) video, a 3D point cloud is generated using MASt3R-SLAM, this 3D point cloud is then feed to an LLM (e.g. Llama-3-1B) to create structured outputs such as where the walls, doors and other objects are. See the website for more details, try the code for yourself, get the models on Hugging Face.

spatial-llm-gif

Example of SpatialLM outputs being visualized on a home walkthrough video. Video is sped up 5x for this post, see the original video on the SpatialLM website.

Releases and notables

Google releases Gemini 2.5 Pro in experimental mode, a model which performs significantly better than other flagship models across several benchmarks (though always take benchmarks with a grain of salt and test on your own use case) and Gemini 2.0 native image output enabling conversational image editing.
OpenAI announces they’ll release an open-weight language model in the coming months (you can be sure we’ll have this covered in a future issue of ML monthly!) and release GPT-4o native image generation which is capable of creating images which align very well with the input prompt, these images also maintain their styling throughout several prompt steps.

openai-want-to-release-an-open-model

Sam Altman Twitter announcing that OpenAI will soon release an open-weight language model.

See you next month!

What a massive month for the ML world in March!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.