62nd issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Here's what you might have missed in February 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!
My work
- A Guide to Bounding Box Formats and How to Draw Them — I’ve been working on an object detection project lately (see below) and one of the most confusing parts I’ve found is dealing with the different bounding box formats (e.g. XYXY, XYWH, CXCYWH). So I wrote a guide to all of the different major formats, where they come from and how to draw them on images.
- [Coming Soon] Project: Build a custom object detection model with Hugging Face Transformers — I’m working on a new ZTM project to build Trashify 🚮, a custom object detection model to incentivise picking up trash in a local area. The code is complete and I’m in the process of making supplementary materials (tutorial text, slides, videos). Stay tuned for the completed release!
- ML Model Memory Calculator — With the help of Claude, I built a simple ML model memory calculator to figure out how much GPU VRAM you’d need to run various sized models at different levels of precision. For example, a 7B model requires about 13GB of RAM at float16 precision or 6.5GB of RAM at int8 precision.
- Beginner’s Guide To Embedded Machine Learning — Your smartwatch isn’t magic — it’s embedded ML! Come discover how AI runs on tiny devices and learn to build smarter, faster, real-world tech (with examples).
- Video version of ML Monthly January 2025 — If you like seeing video walkthrough’s of these kinds of materials (videos tend to be better for demos), check out the video walkthrough of last month’s ML Monthly. The video walkthrough for this issue (February 2025) should be live a couple of days after the text version gets posted!
From the Internet
Daniel’s open-source AI resources of the month
- Segment Anything with text using Sa2VA — Sa2VA from ByteDance combines SAM-2 (from Meta) and LLaVA (Large Language and Vision Assistant) enabling you to segment any object or item in an image or video with text. You can also get grounded information from the image or video input based on a segmented input. Trying out the models myself, I noticed they get incredible results with the right prompts (see image below). I also really liked the section in the paper where the authors detailed their automated data engine. Get the models and code on Hugging Face, read the paper.

SaV2A is capable of segmenting items in images based on natural language, even able to use concepts such as “main subject” with no specific details. It’s also powered by a multi-stage automated data annotation pipeline which creates object-level, scene-level and video-level annotations. Source: Images by author and graphic from SaV2A paper.
- Google Release SigLIP2 — SigLIP1 (Sigmoid Language-Image Pretraining) is one of my most used computer vision models. It’s been trained on billions of images text pairs so it can match images to text with incredible accuracy. SigLIP1 is the default vision encoder for many open-source VLMs (Vision-Language Models) and SigLIP2 is a drop-in replacement with improvements across the board. SigLIP2 comes in four size variants (86M parameters, 303M, 400M and 1B) for use on a wide-range of devices. It also comes with a
NaFlex
variant which is capable using native image resolutions (this is helpful for images such as documents which require high resolution for text visibility). Read the blog post on Hugging Face (includes links to many different model variants in Transformers and JAX), read the paper, try the demo and get the weights in OpenCLIP/timm.

Example use case comparing SigLIP1 and SigLIP2 for matching an image to text. The more detailed or better matched the text is to the image, the higher score. This is seen by the model giving a higher score to the text with the most specific details. Other simpler texts match the image but they get less of a score. SigLIP variants score better matching texts higher. Source: SigLIP2 demo + author’s own image.
- Ovis2 lands in 6 different variants — Ovis stands for Open Vision. And the Ovis series are quickly becoming some of my favourite open-source VLMs. The Ovis2 models combine Apple’s AIMv2 image encoder and Qwen2.5 LLMs into a single model which is capable of handling text and images. For example, you can input an image and ask a question of it or ask to turn the image into structured data. The Ovis2 series comes in six sizes: 1B parameters, 2B, 4B, 8B, 16B and 34B all of which perform better than other VLMs in the same size category. The largest variant, Ovis2-34B even performs on par with models twice its size. All are licensed under Apache 2.0. Get the code on GitHub, get the models on Hugging Face, try the demo (upload your own image + text questions).
- Microsoft launch Phi-4-Multimodal and Phi-4-mini — Mentioned in last month’s AI/ML Monthly, Phi-4 is a smaller LLM with a focus on high quality data inputs. However, Phi-4 focused on text-only inputs. This month, Microsoft is back with
Phi-4-multimodal-instruct
and Phi-4-mini
. Phi-4-multimodal-instruct
has 5.57B parameters and is built on Phi-4-mini
with speech and image input capabilities whereas Phi-4-mini
is text-only and has 3.84B parameters. Despite their small size (compared to some larger models), both models perform incredibly well on many benchmarks. With the audio capabilities of Phi-4-multimodal-instruct
surpassing even specialist models such as OpenAI’s Whisper V3 on ASR (Automatic Speech Recognition) tasks. Another really cool feature of Phi-4-multimodal-instruct
is that you can transcribe and translate in one prompt (see code example below for transcribing an audio file I recorded myself, if you speak French, you can critique the output!). Phi-4 models are capable of handling 23 different languages. See the Phi-4 model collection on Hugging Face, read the blog post, read the paper, see an example vision-based fine-tuning script.
input_prompt = """Transcribe the audio to text, and then translate the audio to
French. Use <sep> as a separator between the original transcript
and the translation.
"""
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
audio, samplerate = sf.read(PATH_TO_AUDIO_FILE)
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
Output:
>>> """"Hello, Daniel Bourke here. I am testing the new Phi 4
multimodal model from Microsoft, and it is capable of taking
in audio inputs and turning them into text. It's capable
of taking image inputs and turning or returning text from
the image input, and it is also capable of just
straight text to text. It's also quite small so it
can run on many devices locally, and it competes with
models that are much larger than it and are also
only available through a closed source API. So a really
cool open source release from Microsoft here."
<sep>
"Bonjour, je suis Daniel Bourke. Je teste le nouveau modèle
multimodal Phi 4 de Microsoft, et il est capable d'accepter
des entrées audio et de les transformer en texte. Il
est capable d'accepter des entrées d'image et de transformer
ou de retourner du texte à partir de l'entrée d'image,
et il est également capable de texte simple à texte.
Il est également assez petit pour qu'il puisse fonctionner
sur de nombreux appareils localement, et il rivalise avec
des modèles qui sont beaucoup plus grands que lui
et qui sont également uniquement disponibles à travers une
API de source fermée. Donc une sortie vraiment cool et
open source de Microsoft ici."
"""
- Opencompass OpenVLM leaderboard — There’s a plethora of fantastic open-source (and closed-source) VLMs coming out these days. But one of the best ways to find how various models are performing is to check out the OpenVLM leaderboard. You can sort by results across 31 different benchmarks, size of model (e.g. models under 10B parameters) and when they were evaluated (e.g. models in the last month). For example, the Ovis2 models mentioned above are currently ranked the highest for their model size.
- Hugging Face introduce SmolVLM2 — The smallest video model ever made! SmolVLM2 can now handle video inputs and has better visual understanding. For example, you could input a video and ask the model to generate highlight chapters based on what’s in the frames. The model comes in three sizes: 2.2B, 500M and 256M. Coming in such small sizes means the models are capable of running on-device (e.g. on an iPhone). See the release blog post, the collection of different SmolVLM2 models as well as the online demo with the 2.2B version.
- IBM Release Granite 3.2 models under Apache 2.0 — The Granite series of models are a group of LLMs with a focus on business use cases. Granite 3.2 comes in two sizes 8B parameters and 2B parameters and both models excel in document-related tasks as well as tasks which require multiple thinking steps. Granite 3.2 Vision model is 2.9B parameters in size and has been trained specifically on visual document understanding tasks such as data extraction from tables, charts, infographics, plots, diagrams and more. All Granite models perform very favourable against other models in similar size brackets. The IBM Granite documentation is also a very high quality source for recipes and tutorials focused on enterprise use cases. Get the code on GitHub, models on Hugging Face and read the paper.

IBM Granite Vision’s model architecture follows the common VLM pattern of using a visual encoder to encode an image and a language encoder to encode text before passing both of them to a language model to create an output. Source: Granite Vision paper.
- Arc Institute release Evo2, a language model trained on 9.3 trillion DNA nucleotides — Evo2 is trained on the DNA of over 100,000 species across the entire tree of life. This means it find patterns such as disease-causing mutations in various genomes, including humans. Evo2 can handle sequences of up to 1 million nucleotides at once (single genes can vary from 10s of nucleotides to millions long). When predicting benign versus potentially pathogenic for the BRCA1 gene (a gene associated with breast cancer), the model achieved over 90% accuracy. The model is released as a foundation model so it be used as a baseline for building future and more specific models on. Get the code on GitHub, read the paper.
- Hugging Face release Open Deep Research — Google and OpenAI have their own versions of Deep Research where you can put in a search query such as “Please find me all of the nutrition information for Dominos Australia” and get back a detailed report with links back to sources and results for that particular topic. The Hugging Face team decided to replicate this workflow with open-source models such as the
smolagents
library (covered in last month’s ML monthly) as well as Qwen2.5-Coder-32B-Instruct
to write code to take steps to perform the research (see image below). The demo agent failed on my request due to not being able to save a file locally. However, this could be fixed in a future custom agent workflow. See the code for Hugging Face’s Open Deep Research workflow on GitHub, try out the demo.

Example of using Hugging Face’s Open Deep Research demo for searching a query. The Agent takes the query and then plans steps it can take and write Python code to execute for those steps. For example, the first step starts with a web_search and leads to finding a PDF which can be extracted for more information. Source: Hugging Face Open Deep Research demo.
- Hugging Face release an open-source AI Agents course — To go along with the
smolagents
library, Hugging Face have created an AI Agents course designed to introduce Agents, how they work and finally create your own Agent you can share to the Hugging Face Hub.
- Zyphra realease Zonos-v0.1, an Apache 2.0 Text-to-Speech Model — There are two variants a 1.6B parameter Transformer-based model and 1.6B Hybrid-based model which is faster at generating longer sequences as well as getting closer to the first token generation. Quantitatively, text-to-speech models are hard to evaluate, however, qualitatively these models sound sensational. They can also run at above real-time (e.g. generating lengths of speech from text in similar times to the length of the speech itself) on a single Nvidia RTX 4090. Get the models on Hugging Face and try the demo to produce your own text-to-speech examples.
- Ai2 Release OLMoE for iOS, a free-to-use on-device LLM — On-device AI is beginning to become more and more prevalent. And Ai2’s OLMoE app shows that. OLMoE stands for “Open Language Mixture of Experts” which means that under the hood the iOS app runs
allenai/OLMoE-1B-7B-0125-Instruct
completely on device (no API calls or data storage other than on your own device). This means the model handles all of your requests using your iPhone’s (or iPad’s) local processor. It can even work with Aeroplane Mode switched on, as seen in the demo below. Ai2 have even open-sourced the app’s code so you can reproduce similar-style apps. Get the app on the App Store, get the app code on GitHub.

Example of OLMoE iOS App running on-device. Notice the Aeroplane Mode setting is turned on meaning the model is running offline. Device is an iPhone 15 Pro and video is sped up 3x for brevity, however, real-life speed is well and truly fast enough for every day use.
Case studies

Example of Finegrain’s box prompt segmentation model before and after the release. Source: Finegrain blog.
- The makers of the Zed code editor share how they created Zeta (an open-source code editing model) — Zeta is a fine-tuned version of
Qwen2.5-Coder-7B
for code prediction. In this case study the Zed team share how they created a customized code model which optimizes for speed with sacrificing performance. Some good discussion about getting good results with supervised fine-tuning but getting even better results with reinforcement learning, specifically, DPO (Direct Preference Optimization). There are also good points in the discussion on using tools such as Cloudflare Workers to route requests to the nearest data centre to improve latency.
- LMStudio introduces speculative decoding for faster local LLM inference — LMStudio is a tool for running local LLMs, such as, Llama 3.2, Mistral, Gemma and more on your own device. And in their recent update they introduced speculative decoding, a technique for speeding up LLM generation by first creating a draft with a smaller model (e.g. Llama 1B) and then going over that draft with a larger model (e.g. Llama 8B). This update makes LLMs run on device up to 1.5x-2.5x faster. The blog post also has a good overview of how LMStudio leverages frameworks such as MLX (one of Apple’s framework for machine learning on Apple Devices) to improve speeds on local devices. Bonus: For more on speculative decoding in practice, see Google’s blog posts on Speculative RAG and Looking back at speculative decoding.
Releases

Example of using Gemini 2.0 to create more data. Input image is a real image of food, Gemini 2.0 Flash captions the image with detailed text and then Imagen 3 generates an image based on the generated text.
- Anthropic releases Claude 3.7 Sonnet with improved reasoning and coding capabilities, they also release Claude Code, an agent-like coding tool that lives in your terminal (see the video example).
- X.ai introduce Grok 3, the most performant LLM in the world (across the benchmarks they show in the blog post). What’s most impressive is that the X.ai team were formed ~18 months ago and in that time have created one of the best LLMs in the world.
- OpenAI release GPT-4.5, the most scaled up version of GPT to date. One of its most important improvements is a lower rate of hallucination. Where GPT-4o would hallucinate 61.8% on the SimpleQA benchmark, GPT-4.5 hallucinates 37.1%, close to a 2x reduction. According to OpenAI’s evaluations on human testers, the model has a better level of “vibe”, as in, it sounds more human in various responses. One potential downside is that it’s 30x more expensive than GPT-4o on input tokens and 15x more expensive than GPT-4o on output tokens.
- Ai2 release Scholar QA for searching across 8M weekly updated research papers with references and citations. You can ask a query and get back an answer with links back to the original research paper where the information came from. Code for the system will be open-sourced in the future.

Example of using ScholarQA for searching for the query “How does magnesium assist in sleep”. This is 1 of 5 different sections breaking down information from the literature across 22 different papers. Source: Scholar QA webpage.
See you next month!
What a massive month for the ML world in February!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
www.mrdbourke.com | YouTube
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.