AI & Machine Learning Monthly Newsletter

Daniel Bourke
Daniel Bourke
hero image

58th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in October 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

I’ve spent the past couple of weeks refining and updating the ZTM Data Science and Machine Learning materials for 2025.

The majority of these have been completed and you can see the 2025 updates discussion thread for more specific details as well an updated course book containing all of the course materials.

Full Course Here.

From the Internet 🌎

Melanie Mitchell Investigates Whether LLMs Can Reason

LLMs are very capable.

But are they pattern matching (e.g. reproducing inputs seen in the training data) or are they performing true reasoning?

What is reasoning anyway?

Mitchell defines it as (bold is mine):

The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference.

Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.

Several papers have come out lately arguing for and against LLMs ability to reason.

One paper from Apple showed that with only slight changes in a math word problem (e.g. swapping the number 7 for 9) several LLMs produced significantly worse results, in turn, arguing against LLMs ability to reason.

Read more on Melanie’s Substack.

Google Showcases and Object-Augmented Reality for Real World Interaction

Rather than trying to turn the whole world into a different one (e.g. the Metaverse), Google’s latest research project aims to enhance existing objects.

The system combines multiple models and technologies into a single pipeline:

  1. Detecting objects with vision (using a combination of MediaPipe and COCO dataset objects).
  2. Detected objects are localized and anchored in the position they were found using ARCore (Google’s framework for augmented reality).
  3. Objects are then enriched with infromation using a Multimodal Large Language Model (MLLM or Vision-LM) such as PaLI.
  4. Executable actions are displayed directly on the object and can produce object-specifc outputs.

google-xr-objects-workflow Google’s demonstration of an XR (X-Reality) workflow with objects as the centrepiece. Objects are recognized with object detection and grounded in 3D space using ARCore. The object is then enhanced with metadata thanks to a VLM and menu items are displayed inline with the object metadata. Source: Google AI Blog.

I like this kind of workflow. It brings useful and helpful intelligence right into an existing scenario.

It’s where I’d like to take Nutrify, a similar setup but focused on information for foods.

Stability AI releases Stable Diffusion 3.5 models under a permissive license

The OGs of the generative image game are back with a suite of open and permissive licensed models.

Stable Diffusion 3.5 comes in Large (8B parameters), Large-Turbo (distilled version of Large for faster inference) and Medium (2.6B parameters).

The models feature improved text generation (see image below) as well as better prompt adherence for longer prompts.

Get the models on Hugging Face and code on GitHub.

albert-einstein-stable-diffusion-3.5

Generated image with prompt to Stable Diffusion 3.5 Large: A photo of Albert Einstein writing "machine learning is cool" on a chalk board in a old styled lecture theatre with wooden furniture, there is a bright red apple on the desk and carved in the apple's skin are the words, "the devil is in the details”.

ExecuTorch (PyTorch inference engine for on-device models) hits Beta status

ExecuTorch helps you run PyTorch models on edge devices. This means running models like Segment Anything and Llama on mobile phones. The benefit of running models on device (rather than an API call) is that you can leverage the compute power you have with you and don’t have to transfer any data over the internet.

Meta’s latest lightweight Llama 3.2 models (Llama 3.2 1B and 3B) are capable of running on mobile devices thanks to ExecuTorch and their are several demos in the GitHub repository provided.

Doug Turnbull shares helpful tips for improving search with LLMs rather than replacing it

Doug writes an incredible tech blog at softwaredoug.com. He also does search at Reddit.

I love when I stumble upon these kind of people.

Real world knowledge + shares what they learn.

In the first article, The hidden dangers that kill search products, Doug shares:

Lately product companies focus on RAG (Retrieval Augmented Generation) and target solutions there. Yet we all need to appreciate that search, RAG, etc solution require as much customization as building traditional apps. There is no silver bullet, only hard work.

So true. It can tempting to think that a newer technology such as RAG can solve all your problems. And it might help but the reality is that’ll often take plenty of effort to get it to work really well (in the world of ML, demos are easy, products are hard).

In, Generative AI Augmented Retrieval (GAR), Doug shares timeless tips for all machine learning projects as well as newer techniques to use Generative AI to help improve existing systems:

  • Use LLMs to enhance old school hacks & existing data (e.g. adding descriptions + tags).
  • Look at your data over and over to figure out where the bugs are.
  • Information retrieval research != your own dataset (find in-domain samples and use those for benchmarking).
  • “Many teams can’t prototype” - as in, how quickly can you iterate on your experiments? Can you practice with 10s of queries before moving to 1000s?

Brett Young’s articles help you get reliable structured outputs from GPT-4o as well as fine-tune Phi-3 Vision on a custom dataset

Getting structured outputs from an LLM is not only one of my favourite use cases, it’s one of the most useful.

Structured outputs include JSON format, CSV (comma separated values) and more.

The benefit of having structured outputs from an LLM is that you can parse them into a database or show them in a specific interface.

In a recent article, Brett Young showcases how to use the response_format in the OpenAI API to ensure that GPT-4o outputs structured data. Using three examples, including, categorising machine learning research papers, turning restaurant menus into a dish database as well as generating code from voice commands.

I also really liked the use of the Weights & Biases product Weave (a tool for helping tracking the performance of Generative AI models).

By using the @weave.op decorator, you can have all of your inputs and outputs of a function tracked (e.g. track the inputs and outputs of your LLM). This enables you to go back and examine what’s happening at each step of your program.

In another article, Young shares how to fine-tune Phi-3-Vision (an open-source small VLM from Microsoft) for a specific task based on a custom dataset.

With fine-tuning, the model improves substantially on a relatively small dataset of ~3000 samples containing images and their matching fashion metadata.

This time, weave.init() was used to track the projects metadata and @weave.op was used again to track the models inputs and outputs.

example-of-phi-3-fine-tuning

Example of what happens when you use Weights & Biases Weave to track your generative model’s inputs and outputs. You can see that the model does a good job of predicting an aligned text output given the input image. Source: Brett Young blog on Weights & Biases.

How to improve your PyTorch image model inference speed by 8x by Dickson Neoh

Let’s say you find a large model that performs well.

So you decide to deploy it to your app.

However, you find that it takes far too long to make a prediction (e.g. 1 second per image).

Well, there are often a handful of tricks you can use to improve the performance of your model.

And that’s what Dickson Neoh’s latest article goes through.

From using a GPU instead of a CPU (immediate 10-20x speedup).

To converting your model to ONNX format (Open Neural Network Exchange).

To including the preprocessing steps in the ONNX format.

And finally to using TensorRT (NVIDIA’s framework for speeding up inference on NVIDIA GPUs).

Spoiler: Dickson combines all of these tricks to take one of the best performing models in the timm (Torch Image Models) library, eva02_large_patch14_448.mim_m38m_ft_in22k_in1k (90.05% accuracy on ImageNet) from running at 12.95 FPS on the GPU/0.63 FPS on the CPU to over 77 FPS on the GPU (8x improvement for GPU, 123x improvement for CPU).

Open-source models and datasets

dfine-model-performance-stats

D-FINE real-time detection model Pareto curves showing outstanding performance at low latency and size. Source: D-FINE GitHub.

apple-depth-pro-demo

Example of using Apple’s Depth Pro model on a custom image. The model is able to going from RGB pixels to a depth map with darker portions being further away and lighter portions being closer. Source: Hugging Face demo of Depth Pro.

Research and papers 📰

  • Apple’s CrtlSynth is a framework for generating synthetic images to help with vision-language models. Synthetic images created with CrtlSynth help to improve both downstream classification and text-to-image and image-to-text retrieval models. Read the paper.

apple-crtl-synth-framework-for-generating-synthetic-images

Workflow for CtrlSynth to generate vision data based on LLM generated captions or existing alt-text captions. Source: CtrlSynth Paper.

  • The team behind the MetaCLIP model (open-source replication of OpenAI’s CLIP) release Altogether: Image Captioning via Re-aligning Alt-text. Their method combines LLMs and image grounding to rewrite existing image alt-texts to keep original information whilst adding further enriched details.

altogether-example-of-upscaling-alt-text

Altogether model example of upscaling an existing alt-text into a more enriched caption over multiple rounds. Source: Altogether paper.

  • Older models still matter! Researchers find that XGBoost models can beat GPT-4 models on text classification despite requiring an estimated ~3000x less memory. This shows that if you have a specialised task (e.g. classifying news articles into different topics), it definitely worth trying to fine-tune a smaller model to see if it works and going to larger scale models if necessary. I would’ve liked to have an encoder model such as BERT added to the mix, as I’ve found excellent results with these models (see the ZTM Hugging Face text classification project). Read the paper.

xgboost-beats-gpt-4

Results for XGBoost beating GPT-4 on text classification tasks (slightly) all the while taking 6 orders of magnitude less memory. Source.

  • Researchers from Google release a technical paper detailing Magika: AI-Powered Content-Type Detection. The model has been available under Apache 2.0 on GitHub for a while, however, the paper discusses how it came about. Magika takes the bytes of a file and classifies it into a file type, for example, .jpeg , .html , .docx and more. This model is useful for malware detection in the sense that it helps make sure the file type a file says it is, is the right file type. Because the model has to run at scale (e.g. on every file going through Gmail), it was required to execute on a CPU with just 1MB of memory. Even with this restriction, the model performs at an F1 of 99% or more across over 200 file types. Get the model and example code on GitHub.

magika-architecture

Magika architecture for classifying bytes of files into their different content types. Notice how the whole model comprises of only a handful of different layers including a one-hot encoding layer at the start, this design helps keep the memory requirements of the model low. Source: Magika paper.

sam-2-evf-inference-example

EVF-SAM-2 inference example running on a custom image. Notice how only the prompted item (”pizza top left”) is segmented in the image on the right. Source: EVF-SAM-2 demo.

See you next month!

What a massive month for the ML world in October!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

Python Monthly Newsletter 💻🐍 preview
Python Monthly Newsletter 💻🐍

59th issue of Andrei Neagoie's must-read monthly Python Newsletter: Python 3.13, Django Project Ideas, GPU Bubble is Bursting, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.