68th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
Iām an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
ZTM object detection project with Hugging Face Transformers
The code and tutorial are done! And the videos are being edited! Stay tuned for them to go live on ZTM soon.
In progress: ZTM LLM fine-tuning with Hugging Face Transformers
Iāve begun work on the next tutorial/project for ZTM. Inside weāll fine-tune a small LLM to do a specific task. That way youāll be able to run it on your own hardware without the need to call to APIs.
A few vibe coding related pieces to begin.
Many of them biased by my own conception: vibe code when it doesnāt matter or you need a quick demo, write it when it does.
We already have a phrase for code that nobody understands: legacy code.
Bonus talk to go along with the article, The Role of the Human Brain in Programming.
An eloquent post discussing how more code was never really the problem.
Most code is thinking, reviewing and designing. All of those steps take time. And if theyāre done right, writing code becomes the easy part. Source: ordep.dev blog.
Understanding what the problem was and how to best solve it (a moving target) is the problem.
Most code is thinking, reviewing and designing. All of those steps take time. And if theyāre done right, writing code becomes the easy part. Source: ordep.dev blog.
Working with ML models in production for a while now I liked this line:
The goal isn't perfection: by definition you can't nor should aim for it. The goal is to manage the uncertainty.
How do you build a product where the outputs are probabilistic?
Tip: You try as best you can to tip the probability in your favor.
One of the levers to do so in machine learning is data.
Or in LLM world, itās having good context (more on that soon).
I also liked this passage about how an improvement in underlying model capabilities can change your whole product:
When Replit moved from Sonnet 3.5 to 3.7, Replitās President Michele had the company rewrite theĀ entire productĀ in less than 3 weeks. We called it Replit v2, but that was quite an understatement. In reality, it was a brand new product.
The architecture, data model, prompting technique, context management, streaming design⦠it was all new. 3.7 was highly agentic in an entirely novel way, Michele understood it, and decided to lean into it instead of trying to control it.
But of courseā¦
Just because a new model comes out doesnāt mean you have to change to it.
Nor should you expect such large changes each time:
Every model update doesnāt necessarily mean a complete rewrite every time, but it does force you to fundamentally rethink your assumptions each time, making a rewrite a perfectly plausible hypothesis.
You have to follow an empirical method, where the only valid assumption is āI donāt knowā. Being an empiricist first is diametrically opposed to being an engineer first.
The motto I say in all of my courses rings true: experiment, experiment, experiment!
Context Engineering is a new term arising to add a bit more finesse to Prompt Engineering.
Where prompts can be throw away things, context for an LLM is more of a carefully engineered substance.
Call it semantics.
But I like it.
Prompt to explore.
Context to get repeatable (as possible) workflows.
Drewās series of blog posts (every so often you stumble upon someoneās blog and devour all of their recent posts, I live for these moments) had me enthralled this week.
I like his definition of context engineering:
Context engineering = systematically engineer contexts in pursuit of an outcome
And this table breaks it down well:
Comparing prompts versus contexts. Source: dbreunig.com
For more on this, Iād recommend the following of Drewās posts:
Models are getting cheaper.
But theyāre getting used more.
Wayyyyyyy more.
A typical query to an LLM used to be:
Simple right?
Now thereās an unknown middle:
Ethan writes an excellent article stating how because the models are now getting better and better, theyāre getting used more and more.
And the fixed pricing of $20/month isnāt going to be enough to cover large-scale consumer usage.
If you charge per million tokens in an API, youāre fine.
Price scales with usage.
But flat-fee pricing can only go so farā¦
I like this point on always just using the best model:
When a new model drops, if itās significantly better, you donāt hold onto using the previous one. You use the new one. Source: Ethan Ding Substack.
Another takeaway I liked were optimizations Anthropic tried to do on their Claude Code product to save on cost/token usage breakdown:
Even with these, token usage still went off the charts (when you have no limits, people will figure this out), see viberank.app for an active Claude Code token usage leaderboard.
Are these people paying for the tokens?
Or have they found out that even at $200/month Claude Code is a bargain?
Who knows.
Mistral show how they took Pixtral-12B, an open-source vision language model from 56% accuracy using prompting to 91% accuracy on 30 class satellite imagery classification using fine-tuning (see chart below).
The model started out with okay results on 30 classes.
But fine-tuning on 8000 training images really stepped things up a notch.
We see similar effects at Nutrify.
Off the shelf models start out with okay performance but dramatically improve with fine-tuning.
Note: The authors suggest that similar results could be achieved with a smaller, specialized vision model (I agree with this) but VLMs also provide the opportunity for more nuanced use cases such as talking with images compared to traditional classification models.
See the Mistral cookbook for code on how to fine-tune Pixtral-12B.
Accuracy comparison of Pixtral-12B using prompting for classification versus using fine-tuning. Source: Mistral blog.
Written in 2002 but still rings true today.
Joel compares the macroeconomics and microeconomics of software.
With the key takeaway being: Smart companies try to commoditize their productsā complements.
Relevant to todayās world of AI when comparing open-source to proprietary models.
OpenAI drives most of their revenue from consumer-based subscriptions.
Where as Anthropic gets most of their revenue from API usage.
So what did OpenAI do?
They open-sourced two powerful models gpt-oss-120b
and gpt-oss-20b
(more on these below), not as good as Claude 4 but quite good.
And they lowered the prices of GPT-5 in the API.
In essence, OpenAI donāt need to make money on the API since they make so much from the consumer app ChatGPT.
On the other hand, Anthropic relies on revenue from their API.
So by open-sourcing high quality models and lowering the price of their API for GPT-5, OpenAI is effectively commoditizing their productsā complements.
When you start working with large datasets, uploading and downloading becomes quite a bottleneck.
For example, the image dataset I work with on Nutrify is about 100GB.
Not the largest.
But not the smallest either.
Hugging Face stores 21 Petabytes of data.
So uploads and downloads matter even more to them.
Good news is with the new Parquet Content-Defined Chunking (CDC) available in PyArrow and Pandas, you can now perform efficient data operations on Hugging Face storage repos.
For example, instead of uploading and downloading a whole dataset each time, you can just work with the changes.
Say I make changes to 100 rows out of 100,000, I can upload and download the 100 rows instead of the full 100,000!
It all starts with the new use_content_defined_chunking
parameter:
import pandas as pd
import pyarrow.parquet as pq
df = ...
df.to_parquet("hf://datasets/{user}/{repo}/path.parquet",
use_content_defined_chunking=True)
table = ...
pq.write_table(table,
"hf://datasets/{user}/{repo}/path.parquet",
use_content_defined_chunking=True)
See more use cases in the Hugging Face Parquet Content-Defined Chunking guide as well as the article *From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub.*
The Ovis (Open Vision) series are some of my favourite VLM models.
And the 2.5 upgrade introduces the following:
Both models use SigLIP-2 as the vision encoder.
And Qwen3-1.7B and Qwen3-8B as the language models.
The 9B version beats out GPT-4o on several benchmarks.
And the 2B version is best in class for its size.
One of my favourite new use cases is the grounding capabilities.
For example, the models are able to extract tables from images as well as detect objects in images based on a prompt.
Ovis2.5 models can extract text and detect objects based on prompt inputs. For example, you can instruct the model to āextract bounding boxes for all unique foods in the imageā. Source: Ovis2.5 paper and custom image.
The models are currently available in Hugging Face Transformers and will be in vLLM soon.
Perch 2.0 is a bio-acoustics foundation model.
Able to create high quality embeddings of animal and environment sounds from the wild and classify them into 14,795 classes.
The model uses an EfficientNetB3 backbone and is trained on a large corpus of 1.5M soundbites from public sources.
Researchers can use the Perch 2.0 model backbone to generate embeddings on their own custom data and quickly build classifiers with a small number of examples.
Or you could take an existing recording, say of a bird singing, embed it with Perch 2.0 and then find similar sounds in a dataset.
See the demo video on YouTube, read the paper, download the model from Kaggle.
DINOv3 is a self-supervised computer vision model trained on 1.7B (Meta call this dataset LVD-1689M) curated public images from Instagram.
Using such a large dataset means the model learns incredibly high quality visual features.
DINOv3 short overview. Image data is from public images on Instagram, results show performance on classification tasks and the output features on the right show more defined characteristics than previous foundation model backbones. Source: DINOv3 paper.
The models come in two variants, ViT and ConvNeXt with various sizes.
Vision Transformer (ViT) models (distilled from the ViT-7B model):
ConvNeXt models (for efficient deployment):
DINOv3 backbones can be used as feature extractors and combined with different output heads (e.g. linear layers or clustering models) for various tasks.
For example, you could use DINOv3 models to embed your large image dataset and then use the embedding database for retrieval (pass in a target image and retrieve similar images from the database).
The training code and example use cases are available on GitHub, the models are on Hugging Face and the paper is on arXiv.
Gemma 3 270M is the latest addition to the Gemma series of models.
This is the perfect model for a specific task which can be fine-tuned.
From the Google release notes:
Use cases for the Gemma 3 270M model are focused on high-volume, specific tasks. Because of the smaller model size, it can be run on lightweight systems and even on-device. This means you can experiment faster with task-specific fine-tunes. Source: Google Developers Blog.
The model has a large vocabulary of 256K different tokens. This means it can be used in environments where specific language or style matters.
Prior to quantization (making the model weights smaller) the model has a footprint of 536MB which means it can fit on small devices such as mobile phones.
And after quantization it will likely be 2-3x smaller.
See the following resources for more:
gemma-3-270m-it
version is instruction-tuned, the gemma-3-270m
is the base model and the gemma-3-270m-qat-q4_0-unquantized
are the models ready to be quantized)In the short time Iāve had hands-on with these models, theyāre good.
Most of the tasks I do with LLMs is for structured data extraction/creation.
So I can attest to them being good there.
I had the 20B version running on my Mac Mini M4 Pro quite fast (50+ tokens/second) in LM Studio.
Example of running gpt-oss-20b locally on my Mac Mini M4. The model works quite fast and the output quality is great for small tasks (based on what Iāve tried so far). Source: Author created image.
For a good technical breakdown of the models Iād recommend the introductory post on Hugging Face.
Itās where I found the following paragraph:
The model weights are quantized in mxfp4 format, which was originally available on GPUs of the Hopper or Blackwell families, but now works on previous CUDA architectures (including Ada, Ampere, and Tesla).
Installing triton 3.4, together with the kernels library, makes it possible to download optimized mxfp4 kernels on first use, achieving large memory savings. With these components in place, you can run the 20B model on GPUs with 16 GB of RAM. This includes many consumer cards (3090, 4090, 5080) as well as Colab and Kaggle!
The tidbit here is the MXFP4 quantization.
This means even though the model says 20B parameters (usually ~48GB of GPU RAM required), it can be run on GPUs with ~16GB of RAM.
And similar with the 120B model.
Thanks to the quantization, it can also be run on a single 80GB GPU such as a NVIDIA H100.
It also means they run fast!
Thatās a big deal!
For a visual breakdown of the architecture, Iād recommend The Illustrated GPT-OSS by Jay Alammar.
The models come with three different levels of reasoning: low, medium (default) and high.
The fastest way to get hands-on with the models is at gpt-oss.com (a small web app running the models through Hugging Faceās Inference Providers service).
Or you can also download them directly from Hugging Face, LM Studio or Ollama.
MolmoAct is a Action Reasoning Model (ARM), a model which combines vision, language, reasoning and action into one.
The model is able to take in visual perceptions and voice (natural language) commands and then use those for planning actions to take in the real world.
Given instructions, the model draws an action trace in 3D space and then a robot follows those lines. Source: MolmoAct blog.
The model weights and datasets are available online, see the demo video for more.
Two highly performant and open-source models.
Qwen-Image generates images with quality comparable to the best in class.
And Qwen-Image-Edit is capable of taking existing images and applying edits whilst retaining the original details.
canary-1b-v2
expands the transcription and translation to 25 different languages (English + 24 European languages).parakeet-tdt-0.6b-v3
expands support for speech transcription to 25 total languages (English + 24 European languages).Both models provide timestamps, punctuation and capitalization.
NVIDIA-Nemotron-Nano-9B-v2
provides better performance and up to 6x faster throughout than Qwen3-8B (an equivalent model size) thanks to a hybrid architecture of Mamba and Transformer layers. See the blog post for more.All models are available for commercial use.
What if you had a large number of email chains with various stakeholders and discussion topics and you wanted to extract structured data from them?
Or a history of call logs?
Or even a whole novel?
Thatās where LangExtract comes in.
You provide instructions, examples of extractions and a model to use and LangExtract handles the rest.
Example showing LangExtract instructed to extract food and drink entries of a passage of text. Source: Author created.
Get the code for LangExtract as well as an example of working with extremely long text on GitHub and see an example of a LangExtract demo on Hugging Face Spaces.
The InternVL-3.5 family is a series of VLMs all achieving close to state-of-the-art on several VLM benchmarks for their relative size.
My favourite is the InternVL3.5-8B model (a model capable of running on a local GPU) performing on par with Claude Sonnet 3.7.
See the InternVL3.5 paper for more.
MiniCPM-V-4.5 is a 8.7B parameter VLM which achieves efficient video inferencing as well as outstanding OCR results, outperforming models such as GPT-4o-latest on OCRBench.
Thereās even a demo of the model running locally on an iPad with an M4 chip.
Generating multi-speaker (up to 4) audio for up to 90 minutes long is impressive.
I tried it out on a few smaller pieces of audio and it did⦠okay?
Perhaps Iām not using it right (most likely).
You can prompt the model with turn by turn examples, such as:
Speaker 0: Welcome to Machine Learning monthly!
Speaker 1: This is the video walkthrough of the text-based newsletter that covers the latest and greatest in the world of AI and machine learning.
Speaker 0: But not always the latest...
Speaker 1: It's been a big month so let's see what happened in August 2025.
And then you can choose which voice each speaker gets.
The audio that came out of the example above was about 5/10 for me.
More experimenting requiredā¦
To curate a subset of high quality samples, Google researchers first labelled a large set of data with LLMs and then iteratively worked through finding samples which had differing labels but overlapped. Source: Google Research blog.
nano-banana
before dropping. I like the alias. It makes you go, what could that be? And itās a bit easier to remember than gemini-2.5-flash-image-preview
. Either way, this is the best image generation model Iāve tried. Even better than Imagen 4 (Googleās other image generation model) in a handful of cases Iāve tried. The model is capable of editing as well, not just generation. Sam Witteveen has a great overview of it.Gemini 2.5 Flash enables you to generate an edit images all in the same interface. It also does well at keeping consistency through time. For example, keep the same subject throughout subsequent edits. Source: Author created via Google AI Studio.
What a massive month for the ML world in August!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.