[October 2024] AI & Machine Learning Monthly Newsletter 💻🤖

58th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in October 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

I’ve spent the past couple of weeks refining and updating the ZTM Data Science and Machine Learning materials for 2025.

The majority of these have been completed and you can see the 2025 updates discussion thread for more specific details as well an updated course book containing all of the course materials.

Full Course Here.

From the Internet 🌎

Melanie Mitchell Investigates Whether LLMs Can Reason

LLMs are very capable.

But are they pattern matching (e.g. reproducing inputs seen in the training data) or are they performing true reasoning?

What is reasoning anyway?

Mitchell defines it as (bold is mine):

The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference.

Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.

Several papers have come out lately arguing for and against LLMs ability to reason.

One paper from Apple showed that with only slight changes in a math word problem (e.g. swapping the number 7 for 9) several LLMs produced significantly worse results, in turn, arguing against LLMs ability to reason.

Google Showcases and Object-Augmented Reality for Real World Interaction

Rather than trying to turn the whole world into a different one (e.g. the Metaverse), Google’s latest research project aims to enhance existing objects.

The system combines multiple models and technologies into a single pipeline:

Detecting objects with vision (using a combination of MediaPipe and COCO dataset objects).
Detected objects are localized and anchored in the position they were found using ARCore (Google’s framework for augmented reality).
Objects are then enriched with infromation using a Multimodal Large Language Model (MLLM or Vision-LM) such as PaLI.
Executable actions are displayed directly on the object and can produce object-specifc outputs.

google-xr-objects-workflow Google’s demonstration of an XR (X-Reality) workflow with objects as the centrepiece. Objects are recognized with object detection and grounded in 3D space using ARCore. The object is then enhanced with metadata thanks to a VLM and menu items are displayed inline with the object metadata. Source: Google AI Blog.

I like this kind of workflow. It brings useful and helpful intelligence right into an existing scenario.

It’s where I’d like to take Nutrify, a similar setup but focused on information for foods.

Stability AI releases Stable Diffusion 3.5 models under a permissive license

The OGs of the generative image game are back with a suite of open and permissive licensed models.

Stable Diffusion 3.5 comes in Large (8B parameters), Large-Turbo (distilled version of Large for faster inference) and Medium (2.6B parameters).

The models feature improved text generation (see image below) as well as better prompt adherence for longer prompts.

Get the models on Hugging Face and code on GitHub.

albert-einstein-stable-diffusion-3.5

Generated image with prompt to Stable Diffusion 3.5 Large: A photo of Albert Einstein writing "machine learning is cool" on a chalk board in a old styled lecture theatre with wooden furniture, there is a bright red apple on the desk and carved in the apple's skin are the words, "the devil is in the details”.

ExecuTorch (PyTorch inference engine for on-device models) hits Beta status

ExecuTorch helps you run PyTorch models on edge devices. This means running models like Segment Anything and Llama on mobile phones. The benefit of running models on device (rather than an API call) is that you can leverage the compute power you have with you and don’t have to transfer any data over the internet.

Meta’s latest lightweight Llama 3.2 models (Llama 3.2 1B and 3B) are capable of running on mobile devices thanks to ExecuTorch and their are several demos in the GitHub repository provided.

See an example of exporting to ExecuTorch in the PyTorch documentation for running PyTorch models on device.
See an example of what’s possible when you export Segment Anything and run it on device to perform image in-painting (removing an object and then filling the pixels it left).

Doug Turnbull shares helpful tips for improving search with LLMs rather than replacing it

Doug writes an incredible tech blog at softwaredoug.com. He also does search at Reddit.

I love when I stumble upon these kind of people.

Real world knowledge + shares what they learn.

In the first article, The hidden dangers that kill search products, Doug shares:

Lately product companies focus on RAG (Retrieval Augmented Generation) and target solutions there. Yet we all need to appreciate that search, RAG, etc solution require as much customization as building traditional apps. There is no silver bullet, only hard work.

So true. It can tempting to think that a newer technology such as RAG can solve all your problems. And it might help but the reality is that’ll often take plenty of effort to get it to work really well (in the world of ML, demos are easy, products are hard).

In, Generative AI Augmented Retrieval (GAR), Doug shares timeless tips for all machine learning projects as well as newer techniques to use Generative AI to help improve existing systems:

Use LLMs to enhance old school hacks & existing data (e.g. adding descriptions + tags).
Look at your data over and over to figure out where the bugs are.
Information retrieval research != your own dataset (find in-domain samples and use those for benchmarking).
“Many teams can’t prototype” - as in, how quickly can you iterate on your experiments? Can you practice with 10s of queries before moving to 1000s?

Brett Young’s articles help you get reliable structured outputs from GPT-4o as well as fine-tune Phi-3 Vision on a custom dataset

Getting structured outputs from an LLM is not only one of my favourite use cases, it’s one of the most useful.

Structured outputs include JSON format, CSV (comma separated values) and more.

The benefit of having structured outputs from an LLM is that you can parse them into a database or show them in a specific interface.

In a recent article, Brett Young showcases how to use the response_format in the OpenAI API to ensure that GPT-4o outputs structured data. Using three examples, including, categorising machine learning research papers, turning restaurant menus into a dish database as well as generating code from voice commands.

I also really liked the use of the Weights & Biases product Weave (a tool for helping tracking the performance of Generative AI models).

By using the @weave.op decorator, you can have all of your inputs and outputs of a function tracked (e.g. track the inputs and outputs of your LLM). This enables you to go back and examine what’s happening at each step of your program.

In another article, Young shares how to fine-tune Phi-3-Vision (an open-source small VLM from Microsoft) for a specific task based on a custom dataset.

With fine-tuning, the model improves substantially on a relatively small dataset of ~3000 samples containing images and their matching fashion metadata.

This time, weave.init() was used to track the projects metadata and @weave.op was used again to track the models inputs and outputs.

example-of-phi-3-fine-tuning

Example of what happens when you use Weights & Biases Weave to track your generative model’s inputs and outputs. You can see that the model does a good job of predicting an aligned text output given the input image. Source: Brett Young blog on Weights & Biases.

How to improve your PyTorch image model inference speed by 8x by Dickson Neoh

Let’s say you find a large model that performs well.

So you decide to deploy it to your app.

However, you find that it takes far too long to make a prediction (e.g. 1 second per image).

Well, there are often a handful of tricks you can use to improve the performance of your model.

And that’s what Dickson Neoh’s latest article goes through.

From using a GPU instead of a CPU (immediate 10-20x speedup).

To converting your model to ONNX format (Open Neural Network Exchange).

To including the preprocessing steps in the ONNX format.

And finally to using TensorRT (NVIDIA’s framework for speeding up inference on NVIDIA GPUs).

Spoiler: Dickson combines all of these tricks to take one of the best performing models in the timm (Torch Image Models) library, eva02_large_patch14_448.mim_m38m_ft_in22k_in1k (90.05% accuracy on ImageNet) from running at 12.95 FPS on the GPU/0.63 FPS on the CPU to over 77 FPS on the GPU (8x improvement for GPU, 123x improvement for CPU).

Open-source models and datasets

Spawning releases an open-source dataset of 12.4M image-text pairs with the CC0 license (meaning they are free to use for all kinds of purposes, including training ML models).
DocLayout-YOLO is a model capable of detecting document layout boxes such as titles, plain texts, figures and more.
Moonshine are a pair of open-source ASR (automatic speech recognition) models which outperform their equivalent sized Whisper (from OpenAI) models. Having a speech recognition model this fast and small opens up a wide range of capabilities for the internet of things (IoT) world. Read the blog post from the founder of Useful Sensors (creators of moonshine), Pete Warden (also a founding member of the TensorFlow team) for more.
Speaking of ASR, Whisper Large V3 Turbo, a pruned version of Whisper Large V3 with 4 decoder layers instead of 32 (8x smaller) is available in Hugging Face Transformers (try the demo). This model performs speed-to-text transcription at around 25-30x realtime on a NVIDIA A100 GPU with minimal hit to performance. See the GitHub discussion for more technical details.
Reverb is another open-source ASR model which outperforms Whisper V3 large, especially on non-English languages. The model is trained on over 200k expertly transcribed English audio, the largest amount ever used in an open-source model. Try the demo out on Hugging Face, get the code on GitHub.
rerankers is a Python library which provides a unified API to many different reranking methods. A reranker is a model you generally use after an initial retrieval step to rerank the samples in a better order. For example: query → retrieval (top 100 documents) → reranker (reorder the top 100 to be better suited to the query) → output. Read the blog post on Answer AI for more about the library.
Rhymes AI release Aria an open-source VLM which utilizes mixture of experts (MoE) for efficient inference. The model equals or betters Gemini 1.5 Flash, GPT-4o-mini and Llama 3.2 11B. Get the model weights on Hugging Face and model code on GitHub.
Pangea is a fully open (open weights and open data) 7B multi-modal langiage model covering 39 different languages from the team at Carnegie Mellon University. Very impressive that they open-sourced the instruction tuning dataset that they used too. See instruction dataset, code on GitHub, demo on Hugging Face, paper on arXiv.
OpenFlux.1 is a fine-tune of Flux Schnell (Apache 2.0 version of Flux 1) to remove the distillation out of it and enable fine-tuning.
Meta releases a new open-source models and updates including MEXMA for better sentence representation (e.g. create better sentence embeddings), SAM 2.1 for improved segmentation scores at the same speed as SAM 2.0 and quantized Llama 3.2 1B and 3B models for faster inference and reduced memory requirements (this enables these LLMs to run on mobile devices).
The IBM Granite team release their series 3.0 LLMs ranging from 1B to 8B parameters. The 8B model outperforms Llama 3.1 8B on a number of benchmarks. All models are available under the Apache 2.0 license meaning they are available for commercial use. See the GitHub for a bunch of helpful tidbits about the model training and setup, read the paper, get the models on Hugging Face.
D-FINE is a new group of DETR (Detection Transformer) based real-time object detection models which achieve state-of-the-art results for both inference time and performance on the COCO dataset, outperforming several versions of YOLO models. All versions of D-FINE are available under the Apache 2.0 license. Get the models/code on GitHub, read the blog post discussion, read the paper.

dfine-model-performance-stats

D-FINE real-time detection model Pareto curves showing outstanding performance at low latency and size. Source: D-FINE GitHub.

Apple’s Depth Pro model enables sharp monocular depth estimation in less than a second. Read the paper, see the code on GitHub, try the demo on Hugging Face.

apple-depth-pro-demo

Example of using Apple’s Depth Pro model on a custom image. The model is able to going from RGB pixels to a depth map with darker portions being further away and lighter portions being closer. Source: Hugging Face demo of Depth Pro.

Research and papers 📰

Apple’s CrtlSynth is a framework for generating synthetic images to help with vision-language models. Synthetic images created with CrtlSynth help to improve both downstream classification and text-to-image and image-to-text retrieval models. Read the paper.

apple-crtl-synth-framework-for-generating-synthetic-images

Workflow for CtrlSynth to generate vision data based on LLM generated captions or existing alt-text captions. Source: CtrlSynth Paper.

The team behind the MetaCLIP model (open-source replication of OpenAI’s CLIP) release Altogether: Image Captioning via Re-aligning Alt-text. Their method combines LLMs and image grounding to rewrite existing image alt-texts to keep original information whilst adding further enriched details.

altogether-example-of-upscaling-alt-text

Altogether model example of upscaling an existing alt-text into a more enriched caption over multiple rounds. Source: Altogether paper.

Older models still matter! Researchers find that XGBoost models can beat GPT-4 models on text classification despite requiring an estimated ~3000x less memory. This shows that if you have a specialised task (e.g. classifying news articles into different topics), it definitely worth trying to fine-tune a smaller model to see if it works and going to larger scale models if necessary. I would’ve liked to have an encoder model such as BERT added to the mix, as I’ve found excellent results with these models (see the ZTM Hugging Face text classification project). Read the paper.

xgboost-beats-gpt-4

Results for XGBoost beating GPT-4 on text classification tasks (slightly) all the while taking 6 orders of magnitude less memory. Source.

Researchers from Google release a technical paper detailing Magika: AI-Powered Content-Type Detection. The model has been available under Apache 2.0 on GitHub for a while, however, the paper discusses how it came about. Magika takes the bytes of a file and classifies it into a file type, for example, .jpeg , .html , .docx and more. This model is useful for malware detection in the sense that it helps make sure the file type a file says it is, is the right file type. Because the model has to run at scale (e.g. on every file going through Gmail), it was required to execute on a CPU with just 1MB of memory. Even with this restriction, the model performs at an F1 of 99% or more across over 200 file types. Get the model and example code on GitHub.

magika-architecture

Magika architecture for classifying bytes of files into their different content types. Notice how the whole model comprises of only a handful of different layers including a one-hot encoding layer at the start, this design helps keep the memory requirements of the model low. Source: Magika paper.

EVF-SAM (Early Vision-Language Fusion for Text-Prompted Segment Anything Model) shows how you can extend SAM with text-prompting capabilities. This enables you to segment something in an image using just text. For example, “the pizza on the left” in the image below. See the demo on Hugging Face.

sam-2-evf-inference-example

EVF-SAM-2 inference example running on a custom image. Notice how only the prompted item (”pizza top left”) is segmented in the image on the right. Source: EVF-SAM-2 demo.

See you next month!

What a massive month for the ML world in October!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.