58th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
I’ve spent the past couple of weeks refining and updating the ZTM Data Science and Machine Learning materials for 2025.
The majority of these have been completed and you can see the 2025 updates discussion thread for more specific details as well an updated course book containing all of the course materials.
LLMs are very capable.
But are they pattern matching (e.g. reproducing inputs seen in the training data) or are they performing true reasoning?
What is reasoning anyway?
Mitchell defines it as (bold is mine):
The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference.
Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.
Several papers have come out lately arguing for and against LLMs ability to reason.
One paper from Apple showed that with only slight changes in a math word problem (e.g. swapping the number 7 for 9) several LLMs produced significantly worse results, in turn, arguing against LLMs ability to reason.
Read more on Melanie’s Substack.
Rather than trying to turn the whole world into a different one (e.g. the Metaverse), Google’s latest research project aims to enhance existing objects.
The system combines multiple models and technologies into a single pipeline:
Google’s demonstration of an XR (X-Reality) workflow with objects as the centrepiece. Objects are recognized with object detection and grounded in 3D space using ARCore. The object is then enhanced with metadata thanks to a VLM and menu items are displayed inline with the object metadata. Source: Google AI Blog.
I like this kind of workflow. It brings useful and helpful intelligence right into an existing scenario.
It’s where I’d like to take Nutrify, a similar setup but focused on information for foods.
The OGs of the generative image game are back with a suite of open and permissive licensed models.
Stable Diffusion 3.5 comes in Large (8B parameters), Large-Turbo (distilled version of Large for faster inference) and Medium (2.6B parameters).
The models feature improved text generation (see image below) as well as better prompt adherence for longer prompts.
Get the models on Hugging Face and code on GitHub.
Generated image with prompt to Stable Diffusion 3.5 Large: A photo of Albert Einstein writing "machine learning is cool" on a chalk board in a old styled lecture theatre with wooden furniture, there is a bright red apple on the desk and carved in the apple's skin are the words, "the devil is in the details”.
ExecuTorch helps you run PyTorch models on edge devices. This means running models like Segment Anything and Llama on mobile phones. The benefit of running models on device (rather than an API call) is that you can leverage the compute power you have with you and don’t have to transfer any data over the internet.
Meta’s latest lightweight Llama 3.2 models (Llama 3.2 1B and 3B) are capable of running on mobile devices thanks to ExecuTorch and their are several demos in the GitHub repository provided.
Doug writes an incredible tech blog at softwaredoug.com. He also does search at Reddit.
I love when I stumble upon these kind of people.
Real world knowledge + shares what they learn.
In the first article, The hidden dangers that kill search products, Doug shares:
Lately product companies focus on RAG (Retrieval Augmented Generation) and target solutions there. Yet we all need to appreciate that search, RAG, etc solution require as much customization as building traditional apps. There is no silver bullet, only hard work.
So true. It can tempting to think that a newer technology such as RAG can solve all your problems. And it might help but the reality is that’ll often take plenty of effort to get it to work really well (in the world of ML, demos are easy, products are hard).
In, Generative AI Augmented Retrieval (GAR), Doug shares timeless tips for all machine learning projects as well as newer techniques to use Generative AI to help improve existing systems:
Getting structured outputs from an LLM is not only one of my favourite use cases, it’s one of the most useful.
Structured outputs include JSON format, CSV (comma separated values) and more.
The benefit of having structured outputs from an LLM is that you can parse them into a database or show them in a specific interface.
In a recent article, Brett Young showcases how to use the response_format in the OpenAI API to ensure that GPT-4o outputs structured data. Using three examples, including, categorising machine learning research papers, turning restaurant menus into a dish database as well as generating code from voice commands.
I also really liked the use of the Weights & Biases product Weave (a tool for helping tracking the performance of Generative AI models).
By using the @weave.op
decorator, you can have all of your inputs and outputs of a function tracked (e.g. track the inputs and outputs of your LLM). This enables you to go back and examine what’s happening at each step of your program.
In another article, Young shares how to fine-tune Phi-3-Vision (an open-source small VLM from Microsoft) for a specific task based on a custom dataset.
With fine-tuning, the model improves substantially on a relatively small dataset of ~3000 samples containing images and their matching fashion metadata.
This time, weave.init()
was used to track the projects metadata and @weave.op was used again to track the models inputs and outputs.
Example of what happens when you use Weights & Biases Weave to track your generative model’s inputs and outputs. You can see that the model does a good job of predicting an aligned text output given the input image. Source: Brett Young blog on Weights & Biases.
Let’s say you find a large model that performs well.
So you decide to deploy it to your app.
However, you find that it takes far too long to make a prediction (e.g. 1 second per image).
Well, there are often a handful of tricks you can use to improve the performance of your model.
And that’s what Dickson Neoh’s latest article goes through.
From using a GPU instead of a CPU (immediate 10-20x speedup).
To converting your model to ONNX format (Open Neural Network Exchange).
To including the preprocessing steps in the ONNX format.
And finally to using TensorRT (NVIDIA’s framework for speeding up inference on NVIDIA GPUs).
Spoiler: Dickson combines all of these tricks to take one of the best performing models in the timm
(Torch Image Models) library, eva02_large_patch14_448.mim_m38m_ft_in22k_in1k
(90.05% accuracy on ImageNet) from running at 12.95 FPS on the GPU/0.63 FPS on the CPU to over 77 FPS on the GPU (8x improvement for GPU, 123x improvement for CPU).
D-FINE real-time detection model Pareto curves showing outstanding performance at low latency and size. Source: D-FINE GitHub.
Example of using Apple’s Depth Pro model on a custom image. The model is able to going from RGB pixels to a depth map with darker portions being further away and lighter portions being closer. Source: Hugging Face demo of Depth Pro.
Workflow for CtrlSynth to generate vision data based on LLM generated captions or existing alt-text captions. Source: CtrlSynth Paper.
Altogether model example of upscaling an existing alt-text into a more enriched caption over multiple rounds. Source: Altogether paper.
Results for XGBoost beating GPT-4 on text classification tasks (slightly) all the while taking 6 orders of magnitude less memory. Source.
Magika architecture for classifying bytes of files into their different content types. Notice how the whole model comprises of only a handful of different layers including a one-hot encoding layer at the start, this design helps keep the memory requirements of the model low. Source: Magika paper.
EVF-SAM-2 inference example running on a custom image. Notice how only the prompted item (”pizza top left”) is segmented in the image on the right. Source: EVF-SAM-2 demo.
What a massive month for the ML world in October!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.