[November 2024] AI & Machine Learning Monthly Newsletter 💻🤖

59th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in November 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

First of all, happy holidays!

Second of all, I recently entered a Kaggle Competition to use Google’s Gemini API for a “long context” use case.

Long context meaning to harness Gemini’s ability to process a large amount of input tokens (e.g Gemini Flash can handle 1M input tokens and Gemini Pro can handle 2M input tokens).

The more input tokens = the more information the model can use to produce an output.

I had an insurance company call me recently and ask how much my home and contents insurance should cover.

And I said, “good question…”

So I recruited Gemini’s help.

I took a video lap (Gemini can handle video inputs) of my house and asked Gemini to output every major item and the timestamp it occurred at as well as an estimated worth in a structured format.

And it turns out it worked!

As a bonus, Gemini can also create bounding boxes around each of the target items.

You can watch the video breakdown of the project on YouTube and get all of the code/prompts to make it happen on Kaggle.

On the whole, Gemini is very cool. In the past, the capability to process videos in such a way would’ve required several machine learning models (e.g. one for vision, another for transcription and another for extracting item names from text)…

Plus a lot more engineering work!

From the Internet

1. How to create an LLM as a judge by Hamel Husain

One of the most important things about any AI project is having a good set of evaluations.

This means whenever a new model comes along, you can evaluate it against your own use case and see if it’s worth adopting or not.

Good evaluations take time to create.

But so they should.

Evaluations are test cases for your AI product.

Models change but evals last forever.

To bootstrap evaluations, you can use an LLM as a judge.

For example, pass a sample through an LLM and get it to rate how high quality that sample was.

The actual workflow will depend on what your ideal data inputs and outputs are.

Once you’ve got some samples that have been rated by an LLM as a judge, you can continually refine this procedure to create better and better test cases.

Hamel Husain’s guide on creating an LLM as a judge to evaluate your samples walks through this process.

Bonus: See Hamel’s blog post Your AI Product Needs Evals for more on evaluations.

2. How Airbnb use Vision Transformers (ViTs) for creating listing room tours

One of my favourite things is using machine learning to enhance an existing workflow.

Airbnb have some great use cases for that.

To create their “Room Tour” feature which enables someone to see different rooms in a stay, they created a computer vision model to classify the rooms into different categories (e.g. “Bedrooms”, “Bathrooms”, “Kitchen”).

To do so, they leveraging their existing database of many different photos of rooms and fine-tuned a Vision Transformer to output their custom classes.

Some cool findings include that error rate reduced ~5% for every double in the data size and ensemble learning (combining models) combined with model distillation (teaching a smaller model to copy a larger model) improved results the best.

airbnb-data-doubling

Airbnb’s computer vision error rate declining based on the amount of data given to the model. Source: Airbnb tech blog.

3. Using Computer Vision for Waste Management

Ameru.ai is a company creating smart bins, bins which use computer vision to identify and sort waste into several categories.

The goal?

Make sure the right waste goes to the right place.

To do so the company uses Label Studio to label data, PyTorch to train their computer vision models and a Nvidia Jetson Nano computer with an 8MP camera mounted to the bin to make predictions on the waste.

I absolutely love this use case!

See the demo video for footage of the bins in action or read the blog post for an expanded explanation of the product.

01-ameru-label-studio-workflow

4. HuggingFace release transformers.js v3

Machine learning in the browser is becoming more and more possible!

With Transformers.js v3 you can now directly install it via NPM:

// Install the library
npm i @huggingface/transformers

// Import the library
import { pipeline } from "@huggingface/transformers";

// Use the library 
// Create a feature-extraction pipeline
const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" },
); 

// Compute embeddings (runs on the local GPU straight through the browser!!)
const texts = ["Hello world!", "This is an example sentence."];
const embeddings = await extractor(texts, { pooling: "mean", normalize: true });
console.log(embeddings.tolist());

There are many upgrades in v3 including, WebGPU support, 120 supported model architectures (including Whisper, Phi-3 and Florence-2) and 25 new example projects and templates.

For example, check out the demo of OpenAI’s Whisper (a model for speech-to-text) running directly in the browser (I tried it on a sample of my own audio and it worked incredibly well/fast on my MacBook Pro M1).

5. Vision Language Models (VLMs) explored and explained by Finbarr Timbers

Finbarr Timbers writes a short history of Vision-Language Models and how the landscape has converged to a similar structure over the past two years (hint: combine a LLM with a vision encoder).

VLM-training-advice

Steps for creating a modern VLM. Source: Finbarr Timbers AI blog.

A great read for those looking to learn more about VLMs, including recent models such as Mistral’s Pixtral.

6. A guide to RoPE (Rotary Positional Embeddings)

Ever wonder why the Transformer architecture requires a positional encoding?

Well, Christopher Fleetwood’s blog post paints an incredible picture on how it works.

It starts by explaining other forms of positional encoding such as absolute and sinusoidal.

And then talks about where these approaches might fall down.

Before then going to explain where RoPE makes up for the weaknesses of the others and why RoPE is used in many modern Transformer architectures today.

The blog post is full of great visuals and extra resources to learn more about positional embeddings and their history.

7. [Tools 🛠] Docling is a new open-source library for getting documents ready for AI models and RAG pipelines

Docling takes in PDF, HTML, Word or other similar documents and outputs the contains as JSON or markdown.

How?

Using two open-source document analysis models:

Layout analysis model — This model uses RT-DETR to find the layout components of a document, for example, captions, footnotes, formulas, text, titles, tables and more.
TableFormer — A transformer trained to go from image to output the structure of a table.

If you need a way to process large amounts of PDF documents into more manageable markdown and JSON files, try docling out.

docling-workflow

Docling processing pipeline from the Docling technical report.

8. [Tool 🔨] Anthropic Introduces a Prompt Improver Tool

I’ve tried it a few times already and it seems to work really well.

Goes through a ~6-step process to generate examples, create a Mermaid (Mermaid is like a combination of a programming and a diagram language) flowchart for steps and then rewrites the overall prompt a few times for instruction and clarity.

anthropic-prompt-improver

Anthropic’s new prompt improver tool. Available in the Anthropic developer console.

9. Daniel van Strien shows how you can run a local VLM to organize your screenshots

We’re now entering the era where local (local meaning running on your own computer) LLMs and VLMs are going to start being the norm.

These models will have good baseline capabilities, however, they will still have to be guided towards your own specific use case.

In Daniel van Strien’s case he wanted to organize all of screenshots on his desktop.

A simple task if you’ve got some spare time but also a task that would be a good challenge for recent VLMs.

He combined LM Studio with the Pixtral VLM from Mistral and gave it instructions to sort screenshots into various categories as well as write down information from them.

Aside: I wonder how far you could take this with SigLIP (a model that has been trained to contrast vision and language)? E.g. use SigLIP to zero-shot categorise the images into [“meme”, “document”, “receipt”, “diagram“, “unknown”] where the ”unknown” option could be those below a certain threshold.

VLMs-for-analyzing-screenshots

Workflow for organizing screenshots with a VLM. First the VLM reads and analyses the screenshot and then it gets moved to the appropriate location based on the outputs of the VLM. Source: Daniel van Strien blog.

10. Move over text-based RAG, here comes vision-based RAG

RAG = Retrieval Augmented Generation.

In other words, getting related documents to a query to improve LLM outputs.

For example, you could ask “what are our policies related to damaged delivery boxes?” and the retrieval system would look up the company documentation on dealing with damaged boxes and then return the highest scoring related resources.

These resources could then be used as input to an LLM to generate a response based on the actual company documentation.

All of this could be processed with text only.

However, what if there were some images showcases examples of damaged boxes too?

That’s where vision-based RAG comes in.

Instead of only looking for text, vision-based RAG looks for both text and images/figures.

It does this by embedding the screenshot/image of the page/resource and then performing similarity search on those embeddings with the query.

The results so far show that vision-based RAG can provide substantial improvements over pure text-based RAG.

But of course, this will all be dependent on your use case, so make sure to always test on your own problems.

Two resources I’ve found helpful if you’d like to learn more about vision-based RAG systems:

VisRAG

VisRAG (visual retrieval augmented generation) overall workflow. Source: VisRAG GitHub.

🔥 Open-source & Research Rapid Fire Round

[Model] SmolVLM = powerful and lightweight open-source VLM model.
[Model/blog post] Visual multilingual multimodal model for RAG. Uses DSE approach (document screenshot embedding) — Paper. Instead of complex document processing pipelines, just embed the document screenshot and use the whole thing for retrieval. Shows how you can do great open-source AI work with smaller & focused datasets on specific tasks. Created datasets using gemini-1.5-flash-002 and found models trained on these datasets performed better there, and models trained on Claude Sonnet datasets performed better there. Goes to show that not all LLM output data is created equal.
[Tool] Mindee for accurate document OCR.
[Tool] PydanticAI launches - a way to validate outputs from LLMs.
[Dataset] Allen AI release the PixMo dataset used for Molmo, a massive text-image pair dataset with descriptions of images created via voice labelling.
[Tutorial] Apple Release a Guide to Running MobileCLIP on iOS.
[Model] Standard Intelligence Releases hertz-dev speech model.
[Model] Qwen2.5 Coder + Artifacts is a model that outperforms or performs on par with GPT-4o and Claude Sonnet 3.5 for coding.
[Model] Qwen’s QwQ-32B-Preview outperforms or is on par with OpenAI’s o1 mini and preview and is open-source.
[Model] Remove Background 2.0 showcases the power of training on a professional grade dataset, instantly remove the background of images with incredible results.
[Model/paper] DINO-X is a huge open-vocabulary object detection model. Created a Grounding-100M dataset for many task prediction simultaneously. Incredible zero-shot results, including “prompt-free” detection capabilities to go from image → box detection without prompting for what boxes to discover.
[Paper] Many-Shot In-context Learning - use many shots + in-context learning to improve performance. Taking advantage of the large context windows of some models such as Gemini to give more examples can lead to significant improvements on several problems. Though I would like to see this compared to just actually using those examples to train a supervised model.

Bonus: When building AI apps, don’t forget about the experience

I really liked this note in the Vespa visual RAG in practice blog post.

Building an AI system that works is one thing.

Making sure it’s a nice user experience is the next step.

user-experience-in-ai-apps

See you next month!

What a massive month for the ML world in November!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.