66th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
ZTM Object Detection with Hugging Face Transformers Project
The code and notes have been completed!
Next up is the video course.
I’ve recorded 18 videos so far and the rest will be done before the next machine learning monthly.
If you prefer going through code and text, check out the notebook.
Inside the project we’ll customize an object detection model (RT-DETRv2) to identify “bin”, “hand”, “trash” in images whilst building a model for Trashify 🚮, an app to help incentivise people to clean up their local areas.
In the ZTM Object Detection project we’ll build Trashify, using various tools from the Hugging Face ecosystem including Datasets, Transformers, the Hugging Face Hub and Spaces. We’ll finish the project with a custom demo anyone can try out.
Has AI science progress accelerated over the past decade?
Or are more people just entering the field? It can be both. Jack Morris argues it’s the latter. And that in order for more progress, people should be more ambitious with their research.
This might result in less papers being published.
But it means the research that does get published will potentially be more impactful, since people have had longer to grapple with ideas.
I like the analogy of Einstein imagining he was riding a wave of light to help with his research.
Perhaps more people should pursue the ideas that genuinely interest them rather than only aiming to publish the next paper.
George Mandis hit the limits on GPT-4o's audio transcribe function (25 min limit, 40 min video) so he decided to extract the audio from the video and speed it up 3x.
It turns out it works.
Not only did it bypass the limit but the results were legible and cost cheaper (since the inputs get charged per input token).
He found that 2-3x speed was the sweet spot but 4x was too fast.
Just like humans can comprehend audio at faster speeds, it seems models can too.
Costs for transcribing the same audio at different speeds. Source: George Mandis blog.
Using a simple but clever trick, Answer.AI converted text-based LLM benchmarks into reading based VLM benchmarks by turning the text-based samples into images.
The result is ReadBench, a benchmark for seeing how well VLMs can read from images.
They tested several models such as Gemini-Flash-2.0, GPT-4o and Qwen2.5-VL.
Some interesting findings below.
Shorter inputs experience some degradation inline with task difficulty:
Performance degradation on short context seems to be somewhat correlated with task difficulty. MMLU-Redux is easier than MMLU-Pro which is easier than GPQA-Diamond, and we can see that models seem to be pretty decent across the board at extracting easy answers from images, but less so when things get tougher and require more reasoning.
Longer inputs (multiple pages) experience large performance degradation:
On longer inputs, all models experience very significant performance degradation, meaning that passing multiple pages to your Visual RAG pipeline is not yet a viable solution.
Resolution of the images doesn't seem to matter as much:
It turns out that resolution, for current VLMs, indeed matters very little: Gemini 2.0 Flash performs more or less exactly the same at 72 PPI as it does at 300 PPI.
Performance changes on various LLM benchmarks when testing reading text versus reading images. Source: Answer.AI blog.
The image generation and editing series from Black Forest Labs expands with FLUX.1 Kontext models (available in dev, pro and max options).
Each of these excels in maintain context (’kontext’ is the German spelling of ‘context’) across images as well as inserting and editing text. The model weights are available on Hugging Face as well as via several inference providers, you can also try out the demo on your own images.
Bonus: Hugging Face released a guide on fine-tuning FLUX.1-dev (open-weight image generation models) on consumer hardware, this enables you to get the image generation model to create images in your own style.
Example of FLUX.1 Kontext [dev] model changing the text in a text-heavy image whilst maintain the style. Input prompt: ‘Change "AI & MACHINE LEARNING MONTHLY" to "ML IS COOL! MONTHLY”’
Google's Gemma 3n series gets released in 2B and 4B parameters versions (these are actually 5B and 8B, however due to memory improvements they function with 2B and 4B parameters).
The Gemma 3n models are designed to be multimodal from the ground up, meaning they work with text, images, audio and video.
They're a smaller variant style of model with the goal of being run on smaller compute environments such as directly on a desktop or mobile phone (rather than via API call).
Hugging Face also released a series of Gemma cookbooks for examples of fine-tuning the models on your own data.
Read more of the technical details such as the audio encoder (Universal Speech Model) and new vision encoder (MobileNet-V5) in the Google Developer blog.
Finally, there's a Kaggle competition running for the best Gemma 3n offline use cases.
Gemma 3n’s architecture allows it to ingest images, text, audio and videos. Source: Google Developer blog.
The NuExtract 2.0 series of models (comes in 2B, 4B and 8B parameters using Qwen2-VL and Qwen2.5-VL models as the base) from NuMind are capable of extracting structured data from an input (image/text) in a given format.
For example, you decide the data format schema you'd like, pass the schema and the input data (e.g. an image of an invoice) and then the model will output the extracted values in the desired schema.
The models have been trained to extract various different types of data such as string
, date-time
, number
and null
(if the desired data type isn't found).
See the example workflow below or try the model on your own data via the online demo.
Example workflow of using the NuExtract-2.0 model(s) to extract data from an image of an invoice in a desired structured format. The input structure can be tailored to the type of data to extract from. Source: NuExtract-2.0 demo and author created.
Essential AI filtered and labelled 100TB of Common Crawl web data with document-level categories to enable SQL-like filtering for various document types.
For example, you could search for math-related documents using subject == math .
Used Qwen2.5-32B-Instruct
to label 104M documents then fine-tuned a small model, Qwen2.5-0.5B-Instruct
, to reproduce those labels.
After training, the smaller model reproduces the larger model outputs with Cohen's Kappa or K (a score for agreement) within 3% difference (high agreement) but performs inference 50x faster than the larger model.
The trained smaller model, eai-distill-0.5b
is available for use on Hugging Face, a very cool example of training a smaller model to perform a specific task!
When training an LLM using data from the filtered corpus, can reach equivalent or better to state-of-the-art results on various benchmarks.
See the GitHub for the prompts used to label the initial corpus with as well as the prompt used to fine-tune the smaller model.
Essential AI’s workflow for distilling Qwen2.5-32B to Qwen2.5-0.5B to get 50x faster inference with minimal (less than 3%) performance loss.
The Qwen3 rerank and embeddings models are based on the Qwen3 LLM series.
They come in 0.6B, 4B and 8B sizes and all perform at the top or near the top when compared with other similar sized models.
Embedding models are used to represent text numerically for tasks such as semantic search and retrieval.
Reranking models reorder retrieved samples in an order that best suits the query.
BioCLIP 2 is a foundation model for organismal images.
It specializes in matching images of animals to their species name and can be used in a zero-shot setting.
The model outperforms other CLIP-style models such as OpenAI's CLIP and SigLIP.
Alongside the dataset comes a large dataset of organismal images, TreeOfLife-200M, comprising of over 200 million images with extensive biology-related labels.
Try the online demo with your own animal images, see the project website.
Example of BioCLIP 2 running on an image of a Kookaburra taken by myself. It returns the right bird with a high prediction probability.
V-JEPA 2 stands for Video Joint Embedding Architecture.
It is trained in a self-supervised way specifically for video understanding.
One of the main learning objectives for V-JEPA 2 is to predict "what happens next" given a series of input frames in a video.
This "what happens next" learning objective attempts to embed world understanding into the model so it can build an internal representation of many different scenarios which hopefully generalizes to unseen events.
They also release 3 new benchmarks for physics-based understanding (IntPhys2), minimal video pairs (MVPBench, measuring small changes in video pairs) and CausalVQA (measuring the ability for models to answer questions related to physical cause-and-effect).
Read more on the V-JEPA 2 website, download the model weights from Hugging Face and see the code on GitHub.
The GLiNER-X collection are a series of named-entity extraction models capable of performing zero-shot entity extraction on 20 different languages.
The models come in small, base and large sizes.
With these models you can input a passage of text as well as a list of desired entities to extract and the model will extract those entities.
For example:
from gliner import GLiNER
# Create a model
model = GLiNER.from_pretrained("knowledgator/gliner-x-large")
# Create a passage of text to extract entities from
text = """A plate of steak, white rice, guindilla peppers, olives,
parmesan cheese, olive oil drizzle.
A pint of Guinness sits in the background with a glass of water next to it."""
# Create a series of target entities (can be almost anything)
labels = ["food item", "drink item"]
# Extract the entities with a given threshold (higher = only most likely entities extracted)
entities = model.predict_entities(text,
labels,
threshold=0.35)
# Print out the extracted entities
for entity in entities:
print_string = f'{entity["text"]} => {entity["label"]} ({round(entity["score"], 3)})'
print(print_string)
Output:
plate => food item (0.388)
steak => food item (0.755)
white rice => food item (0.726)
guindilla peppers => food item (0.657)
olives => food item (0.651)
parmesan cheese => food item (0.694)
olive oil => food item (0.606)
Guinness => drink item (0.723)
water => drink item (0.433)
OCR (Optical Character Recognition) models are going through a transformative period.
They are now not only able to recognize only characters but full document structures.
The current practice is to use a strong base model such as Qwen2.5-VL-3B and fine-tune it specifically for document structure recognition and information extraction.
For example, the new Nanonets-OCR-s model (try the online demo, read the blog post) is capable of recongizing:
- LaTeX equations
- Images (it stores image descriptions between <img> tags)
- Signatures
- Watermarks (it stores the watermark in <watermark> tags)
- Checkboxes (including whether they are checked or not)
- Tables (by converting them to markdown/HTML formats)
There's also MonkeyOCR, a model that functions in a similar way but is especially good at eChinese and even outperforms models such as Gemini 2.5 Pro on OmniDocBench.
Example of the Nanonets-OCR-s model in action with an invoice image as input and then correctly extracting the data to a HTML table (the raw output on the right has been slightly edited for brevity). Notice the correct extraction of the checkbox fields. Source: Nanonets blog with author modifications for presentation.
Pass in an input tensor as well as a model and get a step by step interactive graph to view how the tensor moves through your model.
A trend is forming to combine understanding and generation into a single model (also see BAGEL from ByteDance).
These models can not only answer questions and describe images but also change and generate them.
Amongst the release are several sizes of dense and mixture of experts (MoE) models (from 0.3B to 424B parameters) with text-only and text and vision (VLM) versions.
All models are available as Apache 2.0.
Read the blog post post release, get the code on GitHub, read the paper.
Example of using Gemini 2.5 Flash natively in Google Colab as well as a screenshot of the privacy terms when using it.
Apple hosted their WWDC 2025 event at the start of June and shared a bunch of AI and machine learning specific videos as well as a blog post about updates to the foundation models which power Apple Intelligence.
Apple's foundation models might not be as powerful as others (according to benchmarks and human comparisons) but one advantage they do have is enabling them to run directly on device with the new Foundation Models framework.
The Foundation Models framework opens up direct developer access to a 3B parameter foundation model that runs directly on device (there is a larger model that runs on servers but I don't think this is developer accessible).
This means inference is free and requires no internet connection (as a small app developer, this a big win).
There's also a Foundation Model adapter toolkit which includes a Python training workflow to customize the functionality of the base 3B Foundation Model to your own use case.
As for the base models, the release blog post shares a treasure trail of details and tidbits about they were created.
Apple’s foundational models training and deployment pipeline.
Some of my favourite takeaways below.
Using LLMs for data extraction:
We also incorporated large language models (LLMs) into our extraction pipeline, particularly for domain-specific documents, as they often outperformed traditional rule-based methods.
Model-based data filtering rather than aggressive heuristic-rules:
We've refined our data filtering pipeline by reducing our reliance on overly aggressive heuristic rules and incorporating more model-based filtering techniques.
A big step up in synthetic data:
We used synthetic image captioning data to provide richer descriptions. We developed an in-house image captioning model capable of providing high-quality captions at different levels of detail, ranging from key words to a paragraph-level comprehensive description, generating over 5B image-caption pairs that we used across the pre-training stages.
And for text-rich visuals:
To improve our models' text-rich visual understanding capabilities, we curated various sets of text-rich data, including PDFs, documents, manuscripts, infographics, tables and charts via licensed data, web crawling, and in-house synthesis. We then extracted the texts and generated both transcriptions and question-answer pairs from the image data.
For visual encoders:
And to enable visual perception, we trained both the on-device and server visual encoders using a CLIP-style contrastive loss to align 6B image-text pairs, resulting in an encoder with good visual grounding.
Adding more tokens to increase language representation:
In order to better support new languages during this stage (pre-training), we extended the text tokenizer from a vocabulary size of 100k to 150k, achieving representation quality for many additional languages with just 25% more tokens.
For tool use annotations (during Supervised Fine-tuning or SFT):
We designed a process-supervision annotation method, where annotators issued a query to a tool-use agent platform, returning the platform's entire trajectory, including the tool invocation details, corresponding execution responses and the final response. This allowed the annotator to inspect the model's predictions and correct errors, yielding a tree-structured dataset to use for teaching.
Benefits of applying RLHF (Reinforcement Learning from Human Feedback):
Our evaluations showed significant gains with RLHF for both human and auto benchmarks. And, while we introduced multilingual data in both the SFT and RLHF stages, we found that RLHF provided a significant lift over SFT, leading to a 16:9 win/loss rate in human evaluations.
On creating an on-device foundation model capable of efficient structured generation:
Next, an OS daemon employs highly optimized, complementary implementations of constrained decoding and speculative decoding to boost inference speed while providing strong guarantees that the model's output conforms to the expected format. Based on these guarantees, the framework is able to reliably create instances of Swift types from the model output.
For customizing the on-device 3B foundation model for special use cases (beyond prompting):
For specialized use cases that require teaching the ~3B model entirely new skills, we also provide a Python toolkit for training rank 32 adapters. Adapters produced by the toolkit are fully compatible with the Foundation Models framework. However, adapters must be retrained with each new version of the base model, so deploying one should be considered for advanced use cases after thoroughly exploring the capabilities of the base model.
On an example of creating and evaluating adaptor-based workflow:
In addition to evaluating the base model for generalist capabilities, feature-specific evaluation on adaptors is also performed. For example, consider the adaptor-based Visual Intelligence feature that creates a calendar event from an image of a flyer. An evaluation set of flyers was collected across a broad range of environmental settings, camera angles and other challenging scenarios.
For more hands-on examples and deep dives, check out the YouTube playlist of AI and machine learning related WWDC 2025 videos.
A great in-depth write up of how Anthropic built their Research feature for Claude.
Holy smokes. Another huuuuuuge month of releases for Google.
[Video] Andrej Karpathy on the next generation of software (Software 3.0, agents and more)
A sensational talk on the eras of Software 1.0 (traditional computer code), Software 2.0 (programming data to make neural nets) and the newer Software 3.0 (programming LLMs/neural nets to make code). All three blend each other at differnet points.
I also loved Andrej's breakdown of designing LLM-powered applications:
Screenshot from Andrej Karpathy’s Software 3.0 talk of the different levers and components in a modern LLM app.
[Videos/Podcasts] Pivot to AI is a great YouTube channel I’ve been enjoying with straight forward analysis on different updates in the AI world (mostly how AI marketing is different from actual AI results).
[Videos] Nate B Jones is another YouTuber who posts straight forward ideas and takes on different AI releases and updates. I particularly liked the recent video discussing how many people might be using AI backwards.
What a massive month for the ML world in June!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.