60th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Hello hello!
Firstly, Merry Christmas and Happy New Year!
Secondly, last month, I entered a Kaggle Competition to showcase Google’s Gemini model capabilities (mentioned in last month’s AI/ML monthly).
Well…
I’m excited to announce my entry placed 3rd out of 5,682 entrants 🥉!
This is the highest I’ve ever placed in a Kaggle competition, so I’m quite stoked.
For my entry, I built KeepTrack: A system for keeping track of anything with video.
It enables you to go from video to structured database of items using a few Gemini API calls.
More specifically, I tracked every item in my house for ~$0.07.
For more, you can check out the video on YouTube or get the full code walkthrough on Kaggle.
On the whole, entering a Kaggle competition has been one of the funnest experiences I’ve had in machine learning. And I’d highly recommend anyone looking to practice ML give it a go.
BERT (Bi-directional Encoder Representation Transformer) models are some of the most popular models in the open-source world (with over 1 billion downloads on Hugging Face).
However, the original BERT-style models were released in 2018.
Good news is, there’s an upgraded version for modern era (hence ModernBERT).
The model performs in a pareto optimal way (faster and better) compared to other similar models and is a drop-in replacement for many existing BERT models.
ModernBERT performance versus speed, up and to the left is better. Source: Hugging Face blog.
For more checkout the following:
AI-powered agents are easily one of the hottest topics in tech right now.
But do you need one?
Perhaps it’s better with a definition.
Workflows are systems where LLMs and other tools are orchestrated through predefined code paths.
Whereas agents are systems where LLMs take control and make decisions on which paths to traverse (the exact pathways are not predefined in advance).
My take: build a simple workflow first and if this doesn’t work, introduce more steps as needed.
Anthropic’s is similar:
Anthropic’s guide to when and when not to use agents. Source: Anthropic blog.
I especially like the opening paragraph (bold is mine):
“Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.”
New LLM models such as OpenAI’s o1 are being touted as “reasoning” models.
And they perform incredibly well on many different benchmarks including code and math.
However, Aidan McLaughlin has some good arguments on why reasoning models such as o1 which leverage reinforcement learning might not transfer well to domains without large amounts of traceable paths.
Excerpt from Aidan’s blog post on how o1 may work and why RL (reinforcement learning) may not be as helpful for domains which don’t have verifiably traceable paths to solve (e.g. a winning chess move can be directly traced back to the exact steps it took to get there, real-life often doesn’t come with these kind of paths). Source: Aidan McLaughlin blog.
For example, with math problems, there are often many examples of steps to solve a problem, so the reinforcement learning model can find these steps and optimize to follow these steps for a future problem.
However, the real-world often lacks these specific steps towards solving a problem.
That’s not to take away from o1’s epic performance, it’s more to offer an insight into how the technique to get o1’s performance may not generalize to other domains.
Vicki Boykis is one of my favourite voices of reason in the tech world.
And her recent essay on “whether LLMs can replace developers” only adds to that view.
While modern LLMs are incredible at writing code to get you started, they’re still not at the stage of echoing the advice of current developers and developers gone by (after all, LLMs are trained on code from actual people).
So like Vicki says, use LLMs to help, sure. But also be sure to “build your own context window”:
“Nothing is black and white. Code is not precious, nor the be-all end-all. The end goal is a functioning product. All code is eventually thrown away. LLMs help with some tasks, if you already know what you want to do and give you shortcuts. But they can’t help with this part. They can’t turn on the radio. We have to build our own context window and make our own playlist.”
Every so often you stumble across a tech blog (or regular blog) and immediately consume almost of all the articles.
This is what I did with Ethan Rosenthal’s blog.
Two favourites:
We’re now entering the era where generative AI models and LLMs are being used more and more in every day scenarios.
In the following two case studies, Spotify shares how they use LLMs for generating podcast chapter boundaries and titles (e.g. similar to how some YouTube videos have timestamps for different sections) and how they use a fine-tuned version of Meta’s Llama model to create personalized explanations for various recommendations (e.g. “You will love this song because it’s got a hip and poppy vibe to it”).
How Spotify goes from long podcast transcript texts to time stamps and chapter titles. Source: Spotify blog.
Google’s flagship AI model, Gemini 2.0 is out in experimental mode!
Some quick takeaways:
Gemini 2.0’s image + text input and native image output. Input an image, add some text instructions and get an image as output. Source: Google Developers YouTube Channel.
See more in the Gemini 2.0 documentation.
Bonus: Check out ZTM's course on Building AI Apps with the Gemini API.
Alongside Gemini 2.0, Google have also released Veo 2 for state-of-the-art video generation and Imagen 3 for best-in-class image generation.
Imagen 3 is available in ImageFX and Veo 2 is available in VideoFX.
Images and videos created with Imagen 3 and Veo 2 are watermarked with SynthID (an invisible identifier) to help ensure they can be discerned from real image/videos.
An excellent potential workflow for creating synthetic images for your own machine learning models could be to input a real image, caption it with Gemini 2.0 and then use Imagen 3 to generated a synthetic version of it (effectively increasing the amount of data you have).
Example workflow of inputing a real image, captioning it with Gemini and then generating a synthetic version of it with Imagen 3. This strategy could be used to enhance a computer vision pipeline with synthetic examples. Source: Animation created by the author, food platter designed and created by my fiancé.
OpenAI released a series of product updates over 12 days including:
Different forms of fine-tuning available on the OpenAI platform, supervised fine-tuning requires direct input/output pairs whereas preference fine-tuning requires examples of preferred and non-preferred model outputs. Source: OpenAI release blog.
Example of a notebook from GitHub hosted on nbsanity for free.
modern-gliner-bi-large-v1.0
, a model for zero-shot named entity recognition (NER).What a massive month for the ML world in December!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.