[September 2024] AI & Machine Learning Monthly Newsletter 💻🤖

57th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in September 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

I released a new project-based course! — Build a custom text classification model and demo with Hugging Face Transformers

Text classification is one of the most common business problems out there.

Is that email spam or not spam?
Should that insurance claim go to department 1 or department 2?

As a machine learning engineer, I’ve built text classification models for other companies and I’ve created them for my own company (Nutrify).

In this project, we’ll code a custom text classification model using Hugging Face Datasets, Hugging Face Transformers and Hugging Face Spaces.

We’ll follow the motto of “data → model → demo”.

Meaning by the end of the project, you’ll have your own shareable demo live on Hugging Face.

The project we’ll work through together is Food Not Food, a text classification model which predicts whether a sentence is about food or not.

We’ll start with a dataset, build a model to fit the dataset and finally, demo the model as an interactive demo anyone can use in their browser.

Along the way we’ll also learn a bunch about the Hugging Face Ecosystem (one of the most important platforms in the modern AI era).

If you’re in or wanting to learn more, check out the following:

Learn Hugging Face Project on ZTM Academy
Text classification with Hugging Face Transformers (online code + source code for the project)

P.S. Don’t forget you can always ask me questions on the ZTM Discord. My handle is “@mrdbourke”.

From the Internet

1. [Blog] Why copilot is making programmers worse at programming by Darren Horrocks

AI and LLM-powered tools such as GitHub Copilot can be incredibly helpful programming assistants.

I’ve used Copilot extensively but have since taken a break from it because I found myself relying on it too much.

Yes, it was incredibly helpful for simple things like remembering the matplotlib API (e.g. creating a subplot of 10 images with different indexes).

However, there were several times I got too confident with it and let it write a little too much code I ended up having to debug.

In the end it may have been quicker for me to just write it thoughtfully myself.

Copilot tools help me write more code but often what I’m looking to do is to write more thoughtful code.

A quote from the article:

AI assistants, while useful, often allow developers to bypass these steps. For instance, rather than deeply understanding the underlying structure of algorithms or learning how to write efficient loops and recursion, programmers can now just accept auto-generated code snippets.

One of the most significant risks of relying on tools like Copilot is the gradual erosion of fundamental programming skills. In the past, learning to code involved hands-on problem-solving, debugging, and a deep understanding of how code works at various levels—from algorithms to low-level implementation details.

There are many more good arguments in the article too.

Like the one about learning opportunities arising when you have to search for the answer rather than it being presented to you.

2. ColPali + ColQwen2 change the document retrieval game (improve your multimodal RAG system)

A new way of performing document retrieval emerged a couple of months ago.

And that’s ColPali (first mentioned in ML/AI Monthly July 2024).

The idea of ColPali is: embed the page of the document (e.g. a PDF), text, figures, images and all and then match query embeddings to the embedded page.

This is a much less complicated system than using OCR (Optical Character Recognition) to recognize text and then using layout models to recognize the layout and then chunking the text into small pieces to embed.

I made a tutorial on RAG (Retrieval Augmented Generation) that uses the text chunking method above.

However, that method only works if the text you get is already formatted well.

Instead, ColPali (and now ColQwen2) enables you to embed a series of pages in a document, ask a query and turn it into an embedding and then compare the query embedding to the page embeddings and return the best page matches.

You can then use a VLM (vision-language model) to answer questions about the relevant pages.

how-colpali-works

An overview how ColPali/ColQwen2 work in practice. Source: Daniel van Strien blog.

If your PDFs and documents are rich in visual information as well as text, I’d recommend trying ColPali/ColQwen2 to power your RAG system.

For more on these workflows, check out the following:

AI/ML Monthly July 2024 (breakdown of ColPali, see number 4).
How to create a ColPali-style dataset by Daniel van Strien (helpful for fine-tuning your own retrieval models).
PDF Retrieval with Vision Language Models.
Beyond Text: The Rise of Vision-Driven Document Retrieval for RAG.
Document Similarity Search with ColPali.
ColQwen2-v0.1 — an advancement on the original ColPali based on the Qwen2-VL-2B-Instruct language model.
ColPali GitHub for resources and code relating to the ColPali paper.

3. Pixels have area by Lucas Beyer

Lucas Beyer is one of my favourite computer vision (and now VLM) researchers.

I’d highly recommend checking out his Twitter/X for papers + other ML tidbits.

His latest post shares an insight I didn’t think about.

Pixels have an area.

And if you’re not aware of this, it may throw your evaluation metrics off (e.g. imagine an off by 1 error across thousands/millions of pixels in a large image dataset).

4. Beating GPT-4 with open-source models by .txt

The team at .txt (pronounced “dot tee ex tee”) show how to fine-tune a small language model (Mistral/Phi-3) to beat GPT-4 on function calling (a technique where a language model learns to call a coded function to perform some kind of action).

And if you’d like to learn more about generating structured outputs (e.g. JSON) with LLMs, I’d recommend checking out the rest of their blog posts.

5. Tools on Hugging Face you may not be aware of by Derek Thomas

Hugging Face is the place to be for open-source models, datasets and demos.

But there are many more tools the platform offers such as Webhooks (do something when something else is changed), ZeroGPU (free GPU compute for demos), multi-process Docker (run two different applications simultaneously) and more.

In a recent blog post, Derek Thomas walks you through how to create workflow to pull data from a source, process it and store it as a processed dataset all using free Hugging Face tools.

hugging-face-hub-tools-workflow

A workflow for creating a continually updated and processed dataset on Hugging Face through the power of Webhooks, ZeroGPU and Docker by Derek Thomas.

6. To use an LLM or not to use LLM? (do’s and dont's of Generative AI)

For those who are new to the field of AI, it may seem as though LLMs (Large Language Models) are all there is.

It is true that LLMs are very useful but there are many more different kinds of AI models used for a wide range of problems.

To learn more about the different use cases of AI as well as the different kinds of models to use for each, check out Christopher Tao’s article Do Not Use LLMs or Generative AI For These Use Cases.

7. How do you evaluate LLMs (and other AI models)? by Clémentine Fourrier

It’s one thing to use an LLM in existing product such as ChatGPT but it’s another thing to evaluate your own LLM before publishing and sharing it with others.

Clémentime Fourrier writes for the Hugging Face blog about several ways LLM evaluation gets done, such as:

Benchmarks (e.g. academic datasets with automated testing).
Human as judge (ask people to rate different model outputs against each other).
Model as judge (use LLMs/trained models to rate the outputs of other models).

The article also discusses the why behind evaluation, such as, making sure your training is performing well, seeing which model performs best (leaderboards/comparisons) and overall where the field is going (can a model do X yet?).

8. Anthropic showcases a new contextual retrieval technique for improved retrieval results

Quick breakdown: rather than just embedding a chunk of text, get a model to generate the context related to that text and embed that as well.

That way, when you have a query, you can retrieve not only the relevant chunk but also the context around it.

Stacking this method alongside a handful of other tricks led to significant improvement across several retrieval experiments.

Techniques included:

Retrieve the top 20 relevant samples (this showed to be better than top 5 or 10).
Use a combination of contextual embeddings, contextual BM25 and reranker for the best results.
Make sure your embedding model can handle the context as if it’s context window is too short, your sequence may get truncated (e.g. your embedding model can handle 512 tokens but your input text is 1000 tokens, only the first 512 tokens will get used).

claude-contextual-embeddings-with-reranking

Workflow for contextual retrieval with reranking for the best results. Source: Anthropic blog.

9. Open-source tools, guides and models

The world of open-source VLMs continues to flourish!

Allen AI releases Molmo, a series of state-of-the-art VLMs with permissive licences. One of my favourite things was the “data quality wins over quantity” point in the blog post. The researchers collected a series of 712k images with very high quality captions as the basis of their model. Although seemingly large, this dataset is 100-1000x smaller than other datasets used to train similar quality models. Their largest model Molmo-72B is on par or better than GPT-4o and Claude 3.5 Sonnet on a wide range of VLM and LLM benchmarks.
Ovis1.6-Gemma2-9B is an open-source VLM that combines Gemma 2 and SigLIP for outstanding results.
Maestro is a fast developing open-source library for fine-tuning VLMs. See an example Jupyter Notebook for fine-tuning Florence2.
Reader-LM-1.5 (and reader-lm-0.5) are open-source language models that are designed to turn HTML into markdown. Bonus: there’s also an open-source Python library, markdownify on GitHub for doing the same thing in a manual way.
SAM2-Studio is a native Mac application for running SAM 2 (Segment Anything 2) locally to create segmentation masks for images.
RobustSAM is an open-source model that improves SAM (Segment Anything) masks to be higher quality in noisy images.
Marqo shares a guide to fine-tuning CLIP models for your own use case.
ZML is a high-performing inference stack for machine learning built on top of the programming language Zig.
Wikimedia open-sources a structured version of the Wikipedia English and French entires on Hugging Face.
Finegrain open-sources a high-quality box cutting model on Hugging Face. This model is able to cut out items of images at high-quality given a text prompt, e.g. “cup of coffee”. See demo to try the model out.
Apple publishes Mycelium, an open-source visualization tool for neural networks.
Stability AI release a tutorial on how to fine-tune a Stable Diffusion 3 Medium model.

10. Talks and presentations

[Talk] Piotr Skalski from Roboflow (and tutorial epicness on YouTube) teaches you (almost) everything you need to know about VLMs.
[Talk] Kaiming He (one of the original authors of the ResNet paper) gives a talk on the background of deep representations in image recognition and more.
[Tutorial] Adam Dhalla gives a 5 hour deep dive on the mathematics behind neural networks and backpropagation.

11. Bonus: strange internet samples

When you’re surprised at the outputs of an LLM or VLM or image generator, just remember that the internet runs incredibly deep.

As in, no matter how weird you think the output of a generative model may be, chances are, there’s something like it in its training set.

For example, there’s an image of a dog in a microwave (thankfully it doesn’t look real) as sample 564969 in the very commonly used COCO dataset (1,405 research papers have cited this dataset in 2024 as of September).

Not the strangest example on the internet for sure. But even common academic benchmarks have things you may not have ever thought of.

strange-internet-samples-in-datasets

See you next month!

What a massive month for the ML world in September!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.