[December 2024] AI & Machine Learning Monthly Newsletter 💻🤖

60th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in December 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

Hello hello!

Firstly, Merry Christmas and Happy New Year!

Secondly, last month, I entered a Kaggle Competition to showcase Google’s Gemini model capabilities (mentioned in last month’s AI/ML monthly).

Well…

I’m excited to announce my entry placed 3rd out of 5,682 entrants 🥉!

This is the highest I’ve ever placed in a Kaggle competition, so I’m quite stoked.

For my entry, I built KeepTrack: A system for keeping track of anything with video.

It enables you to go from video to structured database of items using a few Gemini API calls.

More specifically, I tracked every item in my house for ~$0.07.

For more, you can check out the video on YouTube or get the full code walkthrough on Kaggle.

On the whole, entering a Kaggle competition has been one of the funnest experiences I’ve had in machine learning. And I’d highly recommend anyone looking to practice ML give it a go.

From the Internet

1 — Answer.ai release ModernBERT, an upgraded replacement for BERT

BERT (Bi-directional Encoder Representation Transformer) models are some of the most popular models in the open-source world (with over 1 billion downloads on Hugging Face).

However, the original BERT-style models were released in 2018.

Good news is, there’s an upgraded version for modern era (hence ModernBERT).

The model performs in a pareto optimal way (faster and better) compared to other similar models and is a drop-in replacement for many existing BERT models.

modernbert pareto curve

ModernBERT performance versus speed, up and to the left is better. Source: Hugging Face blog.

For more checkout the following:

Blog post write up about ModernBERT.
ModernBERT documentation on Hugging Face.
ModernBERT paper.
Bonus: Check out the ZTM Hugging Face text classification project for an example of how to use BERT-like models for your own custom projects. In the project we use DistilBERT but ModernBERT should be a drop-in replacement.

2 — Anthropic shares an excellent and practical guide to building agents

AI-powered agents are easily one of the hottest topics in tech right now.

But do you need one?

Perhaps it’s better with a definition.

Workflows are systems where LLMs and other tools are orchestrated through predefined code paths.

Whereas agents are systems where LLMs take control and make decisions on which paths to traverse (the exact pathways are not predefined in advance).

My take: build a simple workflow first and if this doesn’t work, introduce more steps as needed.

Anthropic’s is similar:

when-and-when-not-to-use-agents

Anthropic’s guide to when and when not to use agents. Source: Anthropic blog.

I especially like the opening paragraph (bold is mine):

“Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.”

3 — The problem with reasoners by Aidan McLaughlin

New LLM models such as OpenAI’s o1 are being touted as “reasoning” models.

And they perform incredibly well on many different benchmarks including code and math.

However, Aidan McLaughlin has some good arguments on why reasoning models such as o1 which leverage reinforcement learning might not transfer well to domains without large amounts of traceable paths.

rl-rewards-and-o1-type-models

Excerpt from Aidan’s blog post on how o1 may work and why RL (reinforcement learning) may not be as helpful for domains which don’t have verifiably traceable paths to solve (e.g. a winning chess move can be directly traced back to the exact steps it took to get there, real-life often doesn’t come with these kind of paths). Source: Aidan McLaughlin blog.

For example, with math problems, there are often many examples of steps to solve a problem, so the reinforcement learning model can find these steps and optimize to follow these steps for a future problem.

However, the real-world often lacks these specific steps towards solving a problem.

That’s not to take away from o1’s epic performance, it’s more to offer an insight into how the technique to get o1’s performance may not generalize to other domains.

4 — Write code with your Alphabet Radio on by Vicki Boykis

Vicki Boykis is one of my favourite voices of reason in the tech world.

And her recent essay on “whether LLMs can replace developers” only adds to that view.

While modern LLMs are incredible at writing code to get you started, they’re still not at the stage of echoing the advice of current developers and developers gone by (after all, LLMs are trained on code from actual people).

So like Vicki says, use LLMs to help, sure. But also be sure to “build your own context window”:

“Nothing is black and white. Code is not precious, nor the be-all end-all. The end goal is a functioning product. All code is eventually thrown away. LLMs help with some tasks, if you already know what you want to do and give you shortcuts. But they can’t help with this part. They can’t turn on the radio. We have to build our own context window and make our own playlist.”

5 — Two evergreen articles by Ethan Rosenthal

Every so often you stumble across a tech blog (or regular blog) and immediately consume almost of all the articles.

This is what I did with Ethan Rosenthal’s blog.

Two favourites:

Data scientists work alone and that’s bad — An argument for how working with others often makes your own work better (just like an editor reviewing your essay helps your writing improve).
Do you actually need a vector database? — Vector databases have gathered plenty of attention over the last year or two, so much so, you might think you need one. But depending on your problem and data size, NumPy arrays might be suffice. While the technologies mentioned in this post may change (e.g. newer data formats requiring actual databases), the thinking is something I’m big fan of: start with the simplest thing that works and improve it if needed.

6 — Two ways Spotify uses Generative AI + LLMs

We’re now entering the era where generative AI models and LLMs are being used more and more in every day scenarios.

In the following two case studies, Spotify shares how they use LLMs for generating podcast chapter boundaries and titles (e.g. similar to how some YouTube videos have timestamps for different sections) and how they use a fine-tuned version of Meta’s Llama model to create personalized explanations for various recommendations (e.g. “You will love this song because it’s got a hip and poppy vibe to it”).

PODTILE: Facilitating Podcast Episode Browsing with Auto-generated Chapters — Fine-tuning LongT5 (an LLM with a 16k context window) for taking in a podcast transcript and outputting chapter timestamps and titles.
Contextualized Recommendations Through Personalized Narratives using LLMs — Fine-tuning Meta’s Llama for crafting contextual recommendation explanations.

spotofy-podcast-chapter-workflow

How Spotify goes from long podcast transcript texts to time stamps and chapter titles. Source: Spotify blog.

7 — Google Releases Gemini 2.0 in Experimental Mode

Google’s flagship AI model, Gemini 2.0 is out in experimental mode!

Some quick takeaways:

Gemini 2.0 Flash is now better than Gemini 1.5 Pro (Google’s previous best performing model) whilst being faster and cheaper.
Gemini 2.0 can now perform better bounding box generation (improved spatial understanding).
One of my favourites: Gemini 2.0 can now output across modalities, interleaving text and images together (e.g. you can now input an image with text instructions to edit the image and have an image naturally generated as output).

gemini-image-input-and-output

Gemini 2.0’s image + text input and native image output. Input an image, add some text instructions and get an image as output. Source: Google Developers YouTube Channel.

The new multimodal live API also allows developers to build real-time multimodal applications with audio and video streaming inputs (e.g. imagine having an AI model performance inference over a live video stream).

See more in the Gemini 2.0 documentation.

Bonus: Check out ZTM's course on Building AI Apps with the Gemini API.

8 — Google Release Veo 2 (video generation) and Imagen 3 (image generation)

Alongside Gemini 2.0, Google have also released Veo 2 for state-of-the-art video generation and Imagen 3 for best-in-class image generation.

Imagen 3 is available in ImageFX and Veo 2 is available in VideoFX.

Images and videos created with Imagen 3 and Veo 2 are watermarked with SynthID (an invisible identifier) to help ensure they can be discerned from real image/videos.

An excellent potential workflow for creating synthetic images for your own machine learning models could be to input a real image, caption it with Gemini 2.0 and then use Imagen 3 to generated a synthetic version of it (effectively increasing the amount of data you have).

create-images-with-gemini

Example workflow of inputing a real image, captioning it with Gemini and then generating a synthetic version of it with Imagen 3. This strategy could be used to enhance a computer vision pipeline with synthetic examples. Source: Animation created by the author, food platter designed and created by my fiancé.

9 — OpenAI 12 days of Christmas

OpenAI released a series of product updates over 12 days including:

Sora for video generation.
OpenAI o1 (a reasoning model designed for the most complex task) in the API as well as different kinds of fine-tuning for different requirements for GPT-4o models.

openai-different-kinds-of-finetuning

Different forms of fine-tuning available on the OpenAI platform, supervised fine-tuning requires direct input/output pairs whereas preference fine-tuning requires examples of preferred and non-preferred model outputs. Source: OpenAI release blog.

ChatGPT Pro mode for $200/month which enables OpenAI’s o1 model to “think” for even longer to generate better answers.
OpenAI o3 preview (o2 was skipped due to copyright issues), an upgraded version of o1 for even more advanced reasoning and thinking.

10 — Open source, research and tools

[Tool] nbsanity is a helpful an easy way to share Jupyter Notebooks, put in a link to a GitHub hosted notebook and get a generated site you can share. See an example from the ZTM PyTorch course here: original notebook (on GitHub), nbsanity notebook.

nbsanity-notebook

Example of a notebook from GitHub hosted on nbsanity for free.

[Tool] Fine-tuning your own Flux-powered image generation models on Replicate is now faster.
[Model + Code] Meta release a series of open-source updates including MetaCLIP 1.2 (an open-source version of CLIP trained on a series of synthetic captions).
[Model] The Qwen team release an open-weight vision reasoning model with QVQ. See the demo on Hugging Face, read the blog post.
[Models] The Knowledgator repository on Hugging Face hosts many valuable and open-source NLP models, including the recent modern-gliner-bi-large-v1.0 , a model for zero-shot named entity recognition (NER).
[Models] Google release PaliGemma 2, an open-source vision-language model (VLM) combining the SigLIP vision backbone as well as the Gemma 2 language model for incredible results and excellent fine-tuning potential. Get the model weights on Hugging Face, and see the release blog post for a demo on fine-tuning.
[Models] The OpenGVLab team release InternVL 2.5, a series of different size vision-languages model with performance rivalling or exceeding Claude 3.5 Sonnet, Google’s Gemini and OpenAI GPT-4o on several VLM benchmarks. Bonus: See the even more improved preference series of InternVL 2.5 for a nice boost in performance.
[Models] DEIM is a new series of Detection Transformers (DETRs) with state-of-the-art performance in real-time object detection. The models out perform YOLOv11 at every size in terms of latency and performance on the COCO detection dataset. They also come with an Apache 2.0 licence!
[Models] ColPali (a model for multi-modal retrieval) is now available in Hugging Face Transformers. You can use ColPali for RAG (retrieval augmented generation) pipelines to embed documents as screenshots (rather than OCR steps) and then use these embeddings as part of a retrieval pipeline.
[Talk] An older talk from 2021, but a very worthwhile and relative talk by Alec Radford (previously at OpenAI and one of the main authors on the original CLIP paper) about generalization of models at scale (e.g. how do you get a model to generalize to almost anything?… train it on as much data as you can… though this is not a silver bullet, as Alec mentions in the talk, there will still be things out of distribution).
[Fun] A student by the name of _v3 turned my Unofficial PyTorch Optimization Loop Song into an actual song using Suno (a music generation service) — I’m biased because I still prefer the original (:P) but this was catchy!

See you next month!

What a massive month for the ML world in December!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.