[July 2025] AI & Machine Learning Monthly Newsletter 💻🤖

67th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in July 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

My work

ZTM Object Detection with Hugging Face Transformers Project

Code finished, notes finished, recordings finished!

~64 total videos are currently being edited for publishing to the ZTM platform.

Inside the project we’ll customize an object detection model (RT-DETRv2) to identify “bin”, “hand”, “trash” in images whilst building a model for Trashify 🚮, an app to help incentivise people to clean up their local areas.

This is an end-to-end project.

We’ll go from raw dataset to preprocessed data to custom model to shareable demo.

Stay tuned for the full project release on ZTM!

00-what-were-going-to-build-trashify

In the ZTM Object Detection project we’ll build Trashify, using various tools from the Hugging Face ecosystem including Datasets, Transformers, the Hugging Face Hub and Spaces. We’ll finish the project with a custom demo anyone can try out.

From the Internet

My Favourite AI Use Case for AI is Writing Logs by Vicky Boykis

PyCharm, a Python IDE by JetBrains includes a small LLM for single-line code completion.

It takes into account the previous 384 characters of context and then provides a fast and locally running option to complete the line of code.

That’s it.

But what makes the product beautiful is how well it integrates with the IDE.

Rather than trying to write many lines of code, it focuses on writing one good line of code at a time.

JetBrains even shared a paper on how they creating the model and shipped it.

They used a Llama2-like architecture with 100M parameters and trained on a curated dataset of 45GB of Python code with a custom tokenizer which used a 16,384 vocabulary size.

My favourite takeaways echo with Vicky’s:

Screenshot 2025-07-31 at 5.36.10 pm

Screenshot of requirements for the Jet Brains code completion model to be able run locally and in real-time. Source: Vicky Boykis blog.

Vicky uses the single line completion to help write log statements where it might’ve been annoying to do so otherwise:

single-line-code-completion

Single-line code completion running locally based on the previous context. Source: Vicky Boykis blog.

Finally, I really like this quote:

In LLM land, there’s both a place for large, generalist models and there’s a place for small models, and while much of the rest of the world writes about the former, I’m excited to also find more applications built with the latter.

Cloudflare helps creators to block AI from scraping their sites

Most of the data for training LLMs comes from scraping the internet.

Some of this data is completely open-source.

But much of it isn't.

In retaliation to this, Cloudflare releases a series of a blog posts and features to block AI bots, scrapers and crawlers with a single click.

They're also starting to release features to help creators get paid if an AI bot scrapes their website for content with a new "pay per crawl" offering.

Finally, there's an article which investigates how crawler traffic has changed over the past 12 months.

In short, AI crawler traffic is up a lot but so is general Google crawler traffic.

Personally, many of my searches are now with a form of LLM but there still times I go directly to Google for products, directions and short searches.

If you want to block your website from AI traffic and scrapers, check out Cloudflares offerings.

Google DeepMind embeds the entire Earth

Global scale satellite embeddings.

That's what Google DeepMind's AlphaEarth Foundations model does.

It combines information from multiple sources, optical satellite images, radar, 3D laser mapping, climate simulations and more.

The final model was trained on over 3 billion individual frames from over 5 million locations.

alphaearth-foundations-model-architecture

AlphaEarth Foundations brief overview. Source: Google DeepMind blog.

It produces 64-dimensional geospatial embeddings which can be used with clustering and classifier models to extract targeted information.

Each embedding is done at 10x10-meter pixel resolution, which at satellite image scale, is quite fine-grained.

What else can you do with the embeddings?

From the Google Earth blog:

Similarity Search: You can pick a point anywhere on Earth — say, in a specific type of farmland or forest — and instantly find and map all other locations with similar surface and environmental conditions anywhere in the world.
Change Detection: By comparing the embedding vectors for the same pixel from different years, you can easily spot changes and track processes like urban sprawl, wildfire impacts and recovery, and fluctuating reservoir water levels.
Automatic Clustering: Without any pre-existing labels, you can use clustering algorithms to automatically group pixels into distinct categories. This spatio-temporal segmentation can reveal hidden patterns in the landscape, differentiating various types of forests, soils, or urban development.
Smarter Classification: You can create accurate maps with far less training data. For example, instead of needing tens of thousands of labeled points to map crop types with more conventional inputs, you might only need a few hundred per class, saving time and compute.

For more on AlphaEarth Foundations check out Google Earth's blog post on AI powered pixels, view and download the dataset on Earth Engine and read the AlphaEarth Foundations paper.

Apple discuss how they built FastVLM

FastVLM (mentioned in ML Monthly May 2025) is a VLM that runs locally on your device (e.g. an iPhone).

In a technical blog post, Apple shares how the FastViT-HD vision encoder enables using high resolution image inputs whilst increasing TTFT (time to first token) speed.

fastvlm-architecture

FastVLM architecture using a FastViT-HD vision backbone as well as a text tokenizer which feeds inputs into a large language model. Source: Apple Machine Learning blog.

Google’s SensorLM trained on 60M hours of sensor data

SensorLM is a sensor-language found model.

60 million hours of multimodal sensor data across 103,000 individuals.

The sensor data was retrieved from Pixel Smartwatches and includes heart rate, steps and temperature and more.

Trained in two ways:

Contrastive Learning: The model learns to match a segment of sensor data with its corresponding text description from a set of options. This teaches it to discriminate between different activities and states (e.g., distinguishing a "light swim" from a "strength workout").
Generative Pre-training: The model learns to generate text captions directly from the sensor data. This equips it with the ability to produce rich, context-aware descriptions from understanding the high-dimensional sensor signals.

This led to significantly outperforming other generalist models such as Gemma 3 27B and Gemini 2.0 Flash on sensor-related prediction tasks such as whether a person was doing a cardio or strength workout.

google-sensorlm-results

Thanks to its rich sensor-related training data, SensorLM significantly outperforms generalist LLMs on human activity recognition. Source: Google Research blog.

For more technical details, see the SensorLM paper.

Daniel’s Open-Source AI of the Month

Mistral release Voxtral, two public models for speech recognition, translation and understanding

Voxtral Mini has 3B parameters and is the best for local deployment on smaller devices whereas Voxtral Small has 24B parameters and is meant for production-level deployment.

Each model performs close to or better than state-of-the-art when compared on several benchmarks versus models such as GPT-4o mini Transcribe and Gemini 2.5 Flash.

A 32K context window means the models are able to handle up to 40 minutes of audio.

Along with the extra audio capabilities, the models retain the text understanding of Mistral's text-only models.

voxtral-architecture

Voxtral architecture couples an audio encoder with a language decoder to turn the audio into text. Source: Voxtral paper.

See the paper for technical details, and try out the demo on your own audio data.

Allen AI introduce FlexOlmo, a collaborative LLM training paradigm

FlexOlmo enables training one of several expert models on private data and then having that model contribute back to a larger combined model.

The data from each individual model can remain private whilst still providing benefits to the overall system.

Experiments show that the combined models (a mixture of experts) perform better than the individual experts.

flexolmo

FlexOlmo enables the training of multiple separate expert models before fusing them together into a Mixture of Experts model. Source: Allen AI blog.

The FlexOlmo series of models are available for download on Hugging Face.

Franca is a collection of high-performing vision models trained on only public data

Open model, open code and open data (the model variants are trained only on ImageNet-22k or LAION-600M) vision-based backbones which are on par or better than DINOv2.

Models are loadable via torch.hub (Hugging Face likely coming soon).

See the paper for more technical details.

image-features-dinov2-vs-franca

Example of the learned image features out of Franca versus DINOv2. It can be seen Franca is able to cluster more cleanly around areas of importance and produces less fragments. Source: Franca research paper.

Hugging Face releases SmolLM3 with open training recipe, data and model weights

Alongside a super highly performant model, Hugging Face release a super high quality blog post with a detailed training blueprint describing how SmolLM3-3B came to be.

All the way from architecture choices to data choices to training (pre/mid and post) choices.

I liked the mentions of using synthetic supervised fine-tuning (SFT) data to get reasoning and non-reasoning models to work reliably.

SmolLM3-3B performs the best for its size on several benchmarks and even reaches close to the performance of larger models such as Qwen3-4B and Gemma-3-4B.

smollm3-overview

SmolLM3 blueprint and synthetic data setup for improving reasoning. See the SmolLM3 blog post for full resolution images.

Bonus: There’s a quantized version of the SmolLM3 model designed to run on a mobile device on the PyTorch Hugging Face organization (cool to see PyTorch creating its own repos!).

MedGemma goes multimodal

Google recently released MedGemma-27B, a medical focused LLM.

At the time, the model could only handle text inputs.

However, their latest release introduces MedGemma-4B and MedGemma-27B multimodal.

This means the models can now take text and images as inputs.

For example, you could take a photo of a rash on your arm and send it along with questions to MedGemma-27B and see what it responds with (of course, remember the results may be incorrect).

To make the MedGemma text-only models multimodal, Google created a medical-focused version of SigLIP, MedSigLIP which retains the base capabilities of SigLIP but includes capabilities for diverse medical imaging data such as chest X-rays, histopathology patches, dermatology images and fundus images (images of the rear of the eye).

medgemma-overview

Example MedGemma use cases. Because of the new image and text input modalities, you can now ask questions related to medical images. And because the models are open-source, they can be run offline in a private environment. Note: Always beware that models may make mistakes. Source: Google MedGemma blog.

Roboflow upgrades their already epic RF-DETR detection models

Roboflow releases upgraded RF-DETR checkpoints.

The new DEtection TRansformer models are best-in-class for their size for real-time detection performance on the COCO benchmark.

Released variants include Nano (30.5M parameters), Small (32.1M parameters) and Base (32.2M parameters).

Strange sizes compared to other models in their class (as in, they all seem to be quite close to each other) but I'm sure there's a reason for it.

Bonus: Results for these models and more can be seen on the Roboflow Computer Vision Model Leaderboard.

MM-GroundingDINO gets added to Hugging Face for easy zero-shot object detection

Several new zero-shot object detection models from the OpenMMLab team are now available on Hugging Face. These models are able to detect items in images given a text-based prompt such as "bird".

See the demo I created for LLMDet for an example of a similar performing model.

Meta adds 8 new variants of Perception Encoder models

Meta's new collection of Perception Encoders (vision/language models for computer vision and vision understanding) has been updated with 8x new sets of weights at varying sizes.

2 small core models (Tiny and Small): PE-Core-T16-384, PE-Core-S16-384
2 tiling-tuned language models for larger resolution images (Large and Giant): PE-Lang-L14-448-Tiling, PE-Lang-G14-448-Tiling
4 smaller spatial models distilled from the largest spatial model, these might be a good upgrade to object detection backbones (Tiny, Small, Base Large): PE-Spatial-T16-512, PE-Spatial-S16-512, PE-Spatial-B16-512, PE-Spatial-L14-448

These models can be accessed through open_clip or the perception_models repositories.

Apple uses a pixel-level fallback network to expand LLM vocabulary

What do you do when your English-centric LLM runs out of tokens in the tokenizer to perform translation?

As in, if your model only has English-level tokens, how can you adapt it to unseen languages?

Apple tried three ways: expanding the tokenizer, byte-level encodings and pixel-level encodings.

The results?

It turns out pixel-level encodings of unseen tokens turned out to be best on several translation and classification benchmarks.

Pixel-level encodings plus an English-only BERT model even started to perform close to a multilingual trained BERT model.

Example of using a pixel-based (vision-based) model for capturing information where there are no tokens. Source: Apple machine learning blog.

From the Overcoming Vocabulary Constraints with Pixel-level Fallback paper:

The fallback network is designed to function similarly to a vocabulary lookup, providing non-contextual embeddings which the language model can later contextualize.

Specifically, we (1) pretokenize inputs into words, (2) encode words independently of one another, and (3) apply average pooling over the patch encodings corresponding to a word to obtain a single word-level representation yi ∈ Rd.

Two key adjustments enable the efficient handling of multiple rendered words in a single forward pass: we concatenate the patches of individual words into a single sequence, resetting positional embeddings at each word boundary; and we restrict attention so that patches only attend to other patches within the same word.

A very cool and interesting way to augment an LLM's default vocabulary with vision-only encodings.

Z.ai releases a series of open weight models

Text only:

GLM-4.5 is a 355 billion parameter model with 32 billion active parameters on par with modes such as Grok 4, OpenAI o3 and Claude 4 Opus.
GLM-4.5-Air is a 106 billion parameter model with 12 billion active parameters on par with OpenAI o4-mini, Claude 4 Sonnet and Gemini 2.5 Pro.

Read the technical blog post, see the demo.

Text and vision:

GLM-4.1V-9B-Thinking is a vision-language model with reasoning/thinking capabilities which lands it as the best performing model for its size while being competitive with much larger models such as Qwen2.5-VL-72B as well as GPT-4o. Try the demo, read the paper.

Really incredible to see the collection of world-class open weight models grow like this.

We're now at least at half a dozen or so open-weight models which perform at GPT-4o level or better.

Most of which coming from Chinese companies such as Qwen, DeepSeek and now Z.ai.

Qwen3 models get substantial updates and a dedicated coding model

Speaking of model updates...

The Qwen3 flagship models got a good boost for July 2025.

Notable updates include:

Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage.
Substantial gains in long-tail knowledge coverage across multiple languages.
Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation.
Enhanced capabilities in 256K long-context understanding.

They also released a 405B (35B activate) parameter coding model, Qwen3-Coder-480B-A35B-Instruct.

The technical blog post shows the coding model performs close to Claude 4 Sonnet and outperforms other models such as GPT-4.1 and Kimi-K2 on SWE-Bench which is a suite of tests of tests for real-world software engineering problems.

rednote-hilab releases dots.ocr

dots.ocr is a 1.7B vision-language model capable of high-quality OCR (optical character recognition) that performs on par or exceeds much larger models such as Gemini 2.5 Pro.

The model is able to not only extract text from documents, it also produces bounding boxes and reading order for different items.

This makes for a highly flexible OCR input and output pipeline.

See the blog post for technical details, download the model on Hugging Face and try out the demo on your own documents.

dots-ocr-example-with-attention-paper

Example of dots.ocr model parsing the front page of the Attention Is All You Need paper. From my inspection, the results on reading order and section breakdown are flawless. And as for text, I read through the OCR’d version and it seems to read exactly like the paper. If there are mistakes, it is my fault for missing them. Source: dots.ocr demo.

Releases

Google DeepMind releases the genai-processors library for getting data in and out of generative models. The library has several pre-built Processor parts for image, text and audio and allows you to create your own custom processors. I understand it so far as similar to how a neural network stacks layers, future LLM applications may stack various forms of data via different processors. Read the release blog post for more.
OpenAI's ChatGPT study mode, a mode focused on helping students learn about a topic with exploratory teaching and question asking. One of my favourite insights is from Simon Willison investigating the system prompt for study mode. His quote:

I'm still fascinated by how much leverage AI labs like OpenAI and Anthropic get just from careful application of system prompts - in this case using them to create an entirely new feature of the platform. - Simon Willison

Videos

John Carmack, one of the creators of DOOM and previous lead of VR technology at Meta, now AI researcher at Keen Technologies did an excellent recent talk on research directions.
François Chollet, creator of Keras and co-founder of Ndea did a talk on getting to AGI at a recent AI Startup School event covering topics such as the ARC benchmark (version 1 and version 2) and what might be required for the next steps to AGI.

See you next month!

What a massive month for the ML & AI world in July!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.