60th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
Iβm an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Hello hello!
Firstly, Merry Christmas and Happy New Year!
Secondly, last month, I entered a Kaggle Competition to showcase Googleβs Gemini model capabilities (mentioned in last monthβs AI/ML monthly).
Wellβ¦
Iβm excited to announce my entry placed 3rd out of 5,682 entrants π₯!
This is the highest Iβve ever placed in a Kaggle competition, so Iβm quite stoked.
For my entry, I built KeepTrack: A system for keeping track of anything with video.
It enables you to go from video to structured database of items using a few Gemini API calls.
More specifically, I tracked every item in my house for ~$0.07.
For more, you can check out the video on YouTube or get the full code walkthrough on Kaggle.
On the whole, entering a Kaggle competition has been one of the funnest experiences Iβve had in machine learning. And Iβd highly recommend anyone looking to practice ML give it a go.
BERT (Bi-directional Encoder Representation Transformer) models are some of the most popular models in the open-source world (with over 1 billion downloads on Hugging Face).
However, the original BERT-style models were released in 2018.
Good news is, thereβs an upgraded version for modern era (hence ModernBERT).
The model performs in a pareto optimal way (faster and better) compared to other similar models and is a drop-in replacement for many existing BERT models.
ModernBERT performance versus speed, up and to the left is better. Source: Hugging Face blog.
For more checkout the following:
AI-powered agents are easily one of the hottest topics in tech right now.
But do you need one?
Perhaps itβs better with a definition.
Workflows are systems where LLMs and other tools are orchestrated through predefined code paths.
Whereas agents are systems where LLMs take control and make decisions on which paths to traverse (the exact pathways are not predefined in advance).
My take: build a simple workflow first and if this doesnβt work, introduce more steps as needed.
Anthropicβs is similar:
Anthropicβs guide to when and when not to use agents. Source: Anthropic blog.
I especially like the opening paragraph (bold is mine):
βOver the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.β
New LLM models such as OpenAIβs o1 are being touted as βreasoningβ models.
And they perform incredibly well on many different benchmarks including code and math.
However, Aidan McLaughlin has some good arguments on why reasoning models such as o1 which leverage reinforcement learning might not transfer well to domains without large amounts of traceable paths.
Excerpt from Aidanβs blog post on how o1 may work and why RL (reinforcement learning) may not be as helpful for domains which donβt have verifiably traceable paths to solve (e.g. a winning chess move can be directly traced back to the exact steps it took to get there, real-life often doesnβt come with these kind of paths). Source: Aidan McLaughlin blog.
For example, with math problems, there are often many examples of steps to solve a problem, so the reinforcement learning model can find these steps and optimize to follow these steps for a future problem.
However, the real-world often lacks these specific steps towards solving a problem.
Thatβs not to take away from o1βs epic performance, itβs more to offer an insight into how the technique to get o1βs performance may not generalize to other domains.
Vicki Boykis is one of my favourite voices of reason in the tech world.
And her recent essay on βwhether LLMs can replace developersβ only adds to that view.
While modern LLMs are incredible at writing code to get you started, theyβre still not at the stage of echoing the advice of current developers and developers gone by (after all, LLMs are trained on code from actual people).
So like Vicki says, use LLMs to help, sure. But also be sure to βbuild your own context windowβ:
βNothing is black and white. Code is not precious, nor the be-all end-all. The end goal is a functioning product. All code is eventually thrown away.Β LLMs help with some tasks, if you already know what you want to do and give you shortcuts. But they canβt help with this part. They canβt turn on the radio. We have toΒ build our own context windowΒ and make our own playlist.β
Every so often you stumble across a tech blog (or regular blog) and immediately consume almost of all the articles.
This is what I did with Ethan Rosenthalβs blog.
Two favourites:
Weβre now entering the era where generative AI models and LLMs are being used more and more in every day scenarios.
In the following two case studies, Spotify shares how they use LLMs for generating podcast chapter boundaries and titles (e.g. similar to how some YouTube videos have timestamps for different sections) and how they use a fine-tuned version of Metaβs Llama model to create personalized explanations for various recommendations (e.g. βYou will love this song because itβs got a hip and poppy vibe to itβ).
How Spotify goes from long podcast transcript texts to time stamps and chapter titles. Source: Spotify blog.
Googleβs flagship AI model, Gemini 2.0 is out in experimental mode!
Some quick takeaways:
Gemini 2.0βs image + text input and native image output. Input an image, add some text instructions and get an image as output. Source: Google Developers YouTube Channel.
See more in the Gemini 2.0 documentation.
Bonus: Check out ZTM's course on Building AI Apps with the Gemini API.
Alongside Gemini 2.0, Google have also released Veo 2 for state-of-the-art video generation and Imagen 3 for best-in-class image generation.
Imagen 3 is available in ImageFX and Veo 2 is available in VideoFX.
Images and videos created with Imagen 3 and Veo 2 are watermarked with SynthID (an invisible identifier) to help ensure they can be discerned from real image/videos.
An excellent potential workflow for creating synthetic images for your own machine learning models could be to input a real image, caption it with Gemini 2.0 and then use Imagen 3 to generated a synthetic version of it (effectively increasing the amount of data you have).
Example workflow of inputing a real image, captioning it with Gemini and then generating a synthetic version of it with Imagen 3. This strategy could be used to enhance a computer vision pipeline with synthetic examples. Source: Animation created by the author, food platter designed and created by my fiancΓ©.
OpenAI released a series of product updates over 12 days including:
Different forms of fine-tuning available on the OpenAI platform, supervised fine-tuning requires direct input/output pairs whereas preference fine-tuning requires examples of preferred and non-preferred model outputs. Source: OpenAI release blog.
Example of a notebook from GitHub hosted on nbsanity for free.
modern-gliner-bi-large-v1.0
, a model for zero-shot named entity recognition (NER).What a massive month for the ML world in December!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.