[January 2025] AI & Machine Learning Monthly Newsletter 💻🤖

61st issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in January 2025 as an A.I. & Machine Learning Engineer... let's get you caught up!

Happy New Year all!

It’s already been an incredible start to the world of AI in 2025.

Particularly in open-source.

Many have been saying 2025 is the year of the AI Agent (we’ll see what that is shortly).

But I’m betting on it being the year open-source AI is well and truly on par or even better than proprietary AI models.

Here’s to a good year!

Let’s get into it.

From the Internet

From PDFs to AI-ready structured data: a deep dive by Explosion AI

The team behind the incredible NLP library spaCy have created an extension library called spacy-layout.

This library builds upon Docling (another library for document processing which reads and discovers the layout of documents) to turn PDFs into AI-ready structured data.

For example, say you had a PDF called finance_reports_2024.pdf , you could run the following:

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout("./finance_reports_2024.pdf")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)

# Get tabular data from the document
for table in doc._.tables:
    # Token position and bounding box
    print(table.start, table.end, table._.layout)
    # pandas.DataFrame of contents
    print(table._.data)

All of this information could then be processed with an LLM (either locally or via API).

A cool tidbit I enjoyed in the blog post was the future research avenue for describing a table in natural language rather than trying to predict it’s layout.

As table layouts can be quite tricky, it might be a more achievable task for an LLM to first describe the table, then extract information from the natural language description.

See the full article for an example of how to use spacy-layout to extract information whilst also annotating your own PDFs.

Hugging Face release an even smaller versions of their already small VLMs (vision-language model)

SmolVLM now comes in two newer sizes:

[New] SmolVLM 256M parameters (513MB file) — this model is tiny (in comparison to most modern models) but still surpasses Idefics 80B (~18 months old) with 312x less parameters!
[New] SmolVLM 500M parameters (1.02GB file).
SmolVLM 2B.

The team found that a smaller SigLIP vision model (93M parameters vs 400M parameters) seemed to work just as well.

And so they were able to lower the total number of parameters for the vision component of the VLM and combined it with the small LLMs (SmolLM2).

Why smaller?

These models are small enough to run in your browser or on your phone.

And I expect that to be even more of a trend over the next year.

Smaller but more performant models running locally on your own hardware.

See the demo of SmolVLM-256M on Hugging Face Spaces as well as an example of how to fine-tune SmolVLM models on GitHub.

Hugging Face release an open-source library dedicated to AI Agents

Many people have started to say 2025 is the year of AI Agents.

But what is an AI Agent?

An AI Agent is a program where LLM outputs control the workflow.

Rather than every step being programmed in an explicit workflow, the LLM decides which step to take next based on the inputs and the tools it has access to.

The amount of Agency an LLM has is not fixed.

An LLM may have full autonomy over a program (e.g. one LLM can talk to another LLM to continually facilitate a workflow).

levels-of-AI-agency

Levels of AI Agency, from zero control from the LLM to full control from the LLM. Source: Hugging Face blog.

Or an LLM may simply be a processing step, getting data ready for another step.

To build an agent, you need two ingredients:

Tools — A series of tools/small programs an LLM has access to. For example, there could be function called get_current_temperature() which gets the temperature of a given location.
Model — An LLM that will be the engine of your agent program (this can be almost any LLM, though LLMs which have been explicitly trained for agentic workflows will be best).

To help with this process, Hugging Face have released an open-source library called smolagents (pronounced similar to “small agents”).

For more on the smolagents library, check out the GitHub repo.

And for a walkthrough of Agents themselves, check out the Hugging Face guided tour.

Chip Huyen writes about the common pitfalls people run into when building Generative AI applications

Many of the same problems faced with traditional machine learning applications are now being re-discovered in the world of Generative AI applications.

The models are getting better and better.

But it’s the workflows and experience that counts.

Chip lists the common pitfalls she see’s as the following:

Using Generative AI when you don’t need Generative AI — for example, getting an LLM to process and optimize electricity usage instead of just not using large appliances during the day.
Confuse “bad product” with “bad AI” — as said before, the models are getting better and better every day. In fact, many undiscovered use cases could probably be accomplished even if the models didn’t get any better. More often than the AI being bad, the user experience/workflow is bad.
Start too complex (my favourite) — do you need an agentic workflow (see above) or could it be done with direct API calls/workflow steps? Can your use case be accomplished with good prompting rather than fine-tuning?

chip-huyen-ai-application-pitfalls-3-start-too-complex

Beware of starting an ML project with a system that’s too complex. What’s the simplest way to solve the problem (or at least see if it’s viable). Start there and add complexity when necessary. Source: Chip Huyen blog.

Over-index on early success - demos in the AI world are often awesome and incredible. But making them work in the real world is a lot harder. Often you'll get 80% of your results in the first few weeks. And every improvement after that becomes harder and harder.
Forgetting human evaluation — One of the first things I tell my clients when working on AI systems is to look at your data. And if they don't have any data to look at, I say well how do you know how the AI's performing on your task? Human evaluation of results should be automated into an AI pipeline (e.g. train a model → check results → improve data → retrain a model).
Crowdsourcing use cases — Does the AI project you're working on provide improvements for your business or is it being done to keep up with what other people are doing? Just because another company is doing something, doesn't mean you need to. I had a chat recently with an engineer at a flight sales company. And he said they aren't even running a recommendation model on their flight sales, only pure search and time/price based filters. The business is one of the biggest flight sales businesses in Australia. Could AI make it better? Yes, likely. But adding AI is not always a guaranteed benefit.

Answer.AI trials Devin, the AI software engineer and the results are… well… not great

Devin the AI-powered software engineer was announced at the start of 2024 with a lot of hype and promise.

The idea was you would have an AI-powered software engineer as a virtual employee you could chat to.

Naturally, the Answer.AI team decided to put Devin’s skills to the test.

Their results?

Over a month of experiments with 20 total tasks across creating a new project, performing research and analyzing existing code, they had:

14 failures
3 successes (including their 2 first tests, see “Over-indexing on early success” above)
3 inconclusive results

Not the most ideal outcome.

But this kind of technology is still new, so it may improve over time.

However, it’s another reason to be skeptical of AI demos and instead get hands-on with the tools yourself.

See a full breakdown of all the tasks and results in the Answer.AI Devin review blog post.

Open-source

The first month of the year was sensational for open-source AI.

Let’s check out what’s new!

Microsoft release Phi-4 — Incredible 14B LLM (a relatively small size) on par or better than GPT-4o-mini on several benchmarks, close to Llama 3.3 70B and GPT-4o performance on some benchmarks. Try the demo on Hugging Face.
Jina AI release ReaderLM-v2 — This model is capable of turning HTML into JSON/Markdown. This aids in making webpages more easily accessible for other kinds of LLM models. See the demo on Google Colab, read the blog post.

jinaai-readerlmv2-html-to-markdown

Jina AI’s ReaderLM-v2 helps go from a webpage to raw HTML to markdown. Making it easier for an LLM extract information from the text. Source: Image by the author.

Static embedding models in SentenceTransformers = ~400x faster on CPU — Embedding models enable you to represent text (and other forms of data) as numerical values and then find other similar numerical values. The new series of embedding models in SentenceTransformers allow for 400x faster performance on CPU whilst maintaining up to 85% performance of larger models, perfect for on-device or in-browser deployment.

static-embedding-models-nano beir vs speed cpu

The performance of static embedding models isn’t as high as some other models but they are much much faster. Source: Hugging Face blog.

Mistral add to their already epic open-source AI game with Mistral Small 3 under Apache 2.0 — Mistral Small 3 is a 24B parameter latency optimized open-source LLM on par with larger models such as Llama 3.3 70B whilst being ~3x faster. It's also competitive with GPT-4o-mini in head-to-head human evaluations. Even better, since it's a base model it's ready to be distilled by models such as DeepSeek-R1 (see below). Running Mistral Small 3 requires ~55GB of GPU RAM in bf16 or fp16 but can be run (when quantized) on an RTX 4090 or MacBook Pro with 32GB of RAM via ollama and MLX. Read the release blog post on Mistral's website for more.
You can now use the timm library directly in Hugging Face transformers — The outstanding timm library (PyTorch Image Models) is now available directly in Hugging Face transformers . This means you can now use a transformers pipeline for streamlined inference as well as fine-tune your own image model with the Trainer API. I’m a big fan of the timm library, it's one of my most used for computer vision tasks and it's where we get the base models for running custom FoodVision models in Nutrify.
DeepSeek release DeepSeek-R1 and several distilled variants — Just last month, DeepSeek released DeepSeek-V3, a model on par or better than GPT-4o. And now they've released DeepSeek-R1, a reasoning LLM model (a model that "thinks” before it answers a question to find the best possible way to answer it) on par or better than OpenAI's new o1 model. And best of all, the weights and research are open-source! The DeepSeek-R1 model (671B total parameters) can also be used to improve smaller models, such as Llama-3.3-70B to create DeepSeek-R1-Distill-Llama-70B, a version of the Llama-70B model capable of reasoning. This is one of the most exciting releases in the AI world over the past couple of years. Now many people will have access to the best of best AI there is, capable of running on their own hardware. You can try DeepSeek-R1 on their website or via providers such as NVIDIA, the research paper is also available on GitHub (one of my favourite highlights below).

aha-moment-of-deepseek-r1-part-2

From the DeepSeek-R1 paper, it seems that the LLM trained with reinforcement learning on chains of thought finds “aha moments” in it’s thinking. Source: DeepSeek-R1 paper.

DeepSeek release Janus-Pro-7B, a unified model for understanding and generation — Janus-Pro combines multimodal understanding (e.g. VLM-style interactions such as "what is in this picture?") with generation (e.g. “create an image of a cup of coffee on a table"). I've found it to be an ok image generation model and an ok image understanding model. Would be cool to see how a scaled up version of this architecture works. See below for comparison of various image generation models.

image-gen-comparison

Comparison of several different best-in-class image generation models all run with the same prompt. Some perform better than others when it comes to adherence. But it can be subjective. flux-1.1-pro looks the best to me but ideogram-v2 seems to have gotten most of the elements, even if they’re in the wrong order.

Open-source DeepSeek-R1 reproduction effort begins (by Hugging Face team) — The DeepSeek-R1 model weights have barely been public for a week. And already there are several attempts to try and replicate them (the DeepSeek team released the weights for the models but not the training code). How beautiful! The Hugging Face team have released a library called open_r1 with the goal of reproducing all the steps in the DeepSeek-R1 paper. Having the workflow for creating DeepSeek-R1 style models will be an incredible boon to the AI field. See the blog post writeup on Hugging Face for more.
MiniMax open-source MiniMax-Text-01 and MiniMax-VL-01 — Two foundation models on par with the performance of GPT-4o and Claude-3.5-Sonnet from MiniMax. MiniMax-Text-01 is a 456B parameter LLM capable of handling a super long context of 4 million input tokens (that’s about 3 million words, for context, the entire King James Bible has about ~780,000 words). And MiniMax-VL-01 is a multimodal model that further trains MiniMax-01 on 512 billion vision-language tokens to enable the LLM to see. See the MiniMax-AI GitHub and read the paper for more.
OpenBMB release MiniCPM-o-2.6, an omni modal model comparable with GPT-4o — This model packs a punch. At only 8B parameters (quite small for a modern large model), it outperforms several larger models such as Gemini 1.5 Pro and Claude 3.5 Sonnet on several vision benchmarks. It's also capable of automatic speech recognition (ASR) and speech-to-text (STT) translation. The model is small enough to be hosted on a single RTX 4090 and streamed to a mobile device. See the technical blog post for more.
Kokoro-82M is a tiny but mighty text-to-speech model — This is one of the best text-to-speech models I've heard. And with several different voices too. All small enough to run in the browser or on a local device. Try out the demo with your own custom text and see how it goes. A cool project would be to see if you could deploy this model in an application and make text-to-speech as simple as copy and paste. The model is available through the kokoro package on GitHub as well as through providers such as Replicate.
The Qwen team release two new Qwen2.5 updates — Qwen2.5-1M series is a collection of two open-source LLMs (14B and 7B) with a 1 million token context input capability, one of the longest available inputs of any open-source model. And Qwen2.5-VL is a collection of VLMs (3B, 7B and 72B) with incredible performance for their size. The new series of VLMs are now capable of outputting bounding box and point coordinates for different items in an image, for example, "return bounding box coordinates in JSON format for all XYZ in the following image”. The VLMs are also capable of taking videos as input. See the Qwen2.5-VL demo on Hugging Face to try the models out and read the blog post announcement for more.
Tencent’s new Hunyuan3D-2 is capable of creating high-detail 3D models from 2D images/text — Input a 2D image of an item, have one model remove the background and have another model create the 3D mesh of it. You can then use this 3D model in game development or it can be the start of a further endeavour. If you don't have a 2D image, you can use text as well. In my experience, it works really well with images that look like they could already be 3D items. Try out the model yourself in the Hugging Face demo.

rotating-burger-short-square

From a 2D image of a burger to a 3D model in a minute or so. Tencent’s Hunyuan3D-2 model is the best text-to-3D model I’ve seen so far and it’s available as open-source.

IDEA-Research release ChatRex, a VLM capable of answering questions about images whilst grounding its responses — Most VLMs these days simply answer questions in relation to an image in the form of a text reply. ChatRex goes a step further by providing bounding box coordinates for a given item in its answer. For example, if you asked it “where is the brown dog?”, it would describe where the brown dog is in the image whilst also returning bounding box coordinates for that dog. It does this through the use of a two-stage pipeline, one to detect all items (either "coarse" for large items or “fine” for smaller items) using a UPN (Universal Proposal Network) and then using an LLM to filter through those detected items, effectively framing the task as retrieval rather than straight bounding box coordinate output. See the research paper for more.

Other releases and talks

[Release] OpenAI release CUA (Computer-Use Agent), a system for enabling GPT-4o-like models to take steps on your computer.
[Presentation] Economist Benedict Evans gives a presentation on how AI is eating the world, including insights into how some changes may take longer than we think.
[Video] Evan Zhou shares how he replicated an open-source version of Apple's Genmoji in a day by fine-tuning an open-source FLUX model.
[Video/Podcast] Oriol Vinyals (Co-lead of Google Gemini) goes on the DeepMind Podcast and discusses the evolution of agents and Gemini 2.0. And excellent listen into how many different fields of research combined to create models like Gemini 2.0.
[Video/Podcast] Jürgen Schmidhuber, a prolific researcher in the field of AI goes on Machine Learning Street Talk and discusses everything from creating AI Agents over 30 years ago, to what it might look like when humans combine with a form of AGI (he's a big advocate for it being a good thing... so am I!). Listen to this talk for an inspiring look into the future (and past) and set a reminder to do the same thing in 6 months time.
[Video] Jeffrey Codes studied AI for one year trying to change careers from web development. In this video, he shares the resources and paths he took and what he’d do differently if he was to do it over again.

Final note

It’s an exciting time to be in the world of AI.

More and more powerful models are hitting the open-source world.

Which means that developers and people building applications have a chance to integrate some of the most powerful models into their workflows.

I’m looking forward to building and seeing the next wave of AI-powered applications.

I agree with Andrew Ng’s sentiment here. the-ai-application-layer-is-a-great-place-to-be-andrew-ng

See you next month!

What a massive month for the ML world in January!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.