๐ŸŽ Give the #1 gift request of 2024... a ZTM membership gift card! ๐ŸŽ

AI & Machine Learning Monthly Newsletter

Daniel Bourke
Daniel Bourke
hero image

59th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

Iโ€™m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in November 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work ๐Ÿ‘‡

First of all, happy holidays!

Second of all, I recently entered a Kaggle Competition to use Googleโ€™s Gemini API for a โ€œlong contextโ€ use case.

Long context meaning to harness Geminiโ€™s ability to process a large amount of input tokens (e.g Gemini Flash can handle 1M input tokens and Gemini Pro can handle 2M input tokens).

The more input tokens = the more information the model can use to produce an output.

I had an insurance company call me recently and ask how much my home and contents insurance should cover.

And I said, โ€œgood questionโ€ฆโ€

So I recruited Geminiโ€™s help.

I took a video lap (Gemini can handle video inputs) of my house and asked Gemini to output every major item and the timestamp it occurred at as well as an estimated worth in a structured format.

And it turns out it worked!

As a bonus, Gemini can also create bounding boxes around each of the target items.

You can watch the video breakdown of the project on YouTube and get all of the code/prompts to make it happen on Kaggle.

On the whole, Gemini is very cool. In the past, the capability to process videos in such a way wouldโ€™ve required several machine learning models (e.g. one for vision, another for transcription and another for extracting item names from text)โ€ฆ

Plus a lot more engineering work!

From the Internet

1. How to create an LLM as a judge by Hamel Husain

One of the most important things about any AI project is having a good set of evaluations.

This means whenever a new model comes along, you can evaluate it against your own use case and see if itโ€™s worth adopting or not.

Good evaluations take time to create.

But so they should.

Evaluations are test cases for your AI product.

Models change but evals last forever.

To bootstrap evaluations, you can use an LLM as a judge.

For example, pass a sample through an LLM and get it to rate how high quality that sample was.

The actual workflow will depend on what your ideal data inputs and outputs are.

Once youโ€™ve got some samples that have been rated by an LLM as a judge, you can continually refine this procedure to create better and better test cases.

Hamel Husainโ€™s guide on creating an LLM as a judge to evaluate your samples walks through this process.

Bonus: See Hamelโ€™s blog post Your AI Product Needs Evals for more on evaluations.

2. How Airbnb use Vision Transformers (ViTs) for creating listing room tours

One of my favourite things is using machine learning to enhance an existing workflow.

Airbnb have some great use cases for that.

To create their โ€œRoom Tourโ€ feature which enables someone to see different rooms in a stay, they created a computer vision model to classify the rooms into different categories (e.g. โ€œBedroomsโ€, โ€œBathroomsโ€, โ€œKitchenโ€).

To do so, they leveraging their existing database of many different photos of rooms and fine-tuned a Vision Transformer to output their custom classes.

Some cool findings include that error rate reduced ~5% for every double in the data size and ensemble learning (combining models) combined with model distillation (teaching a smaller model to copy a larger model) improved results the best.

airbnb-data-doubling

Airbnbโ€™s computer vision error rate declining based on the amount of data given to the model. Source: Airbnb tech blog.

3. Using Computer Vision for Waste Management

Ameru.ai is a company creating smart bins, bins which use computer vision to identify and sort waste into several categories.

The goal?

Make sure the right waste goes to the right place.

To do so the company uses Label Studio to label data, PyTorch to train their computer vision models and a Nvidia Jetson Nano computer with an 8MP camera mounted to the bin to make predictions on the waste.

I absolutely love this use case!

See the demo video for footage of the bins in action or read the blog post for an expanded explanation of the product.

01-ameru-label-studio-workflow

4. HuggingFace release transformers.js v3

Machine learning in the browser is becoming more and more possible!

With Transformers.js v3 you can now directly install it via NPM:

// Install the library
npm i @huggingface/transformers

// Import the library
import { pipeline } from "@huggingface/transformers";

// Use the library 
// Create a feature-extraction pipeline
const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" },
); 

// Compute embeddings (runs on the local GPU straight through the browser!!)
const texts = ["Hello world!", "This is an example sentence."];
const embeddings = await extractor(texts, { pooling: "mean", normalize: true });
console.log(embeddings.tolist());

There are many upgrades in v3 including, WebGPU support, 120 supported model architectures (including Whisper, Phi-3 and Florence-2) and 25 new example projects and templates.

For example, check out the demo of OpenAIโ€™s Whisper (a model for speech-to-text) running directly in the browser (I tried it on a sample of my own audio and it worked incredibly well/fast on my MacBook Pro M1).

5. Vision Language Models (VLMs) explored and explained by Finbarr Timbers

Finbarr Timbers writes a short history of Vision-Language Models and how the landscape has converged to a similar structure over the past two years (hint: combine a LLM with a vision encoder).

VLM-training-advice

Steps for creating a modern VLM. Source: Finbarr Timbers AI blog.

A great read for those looking to learn more about VLMs, including recent models such as Mistralโ€™s Pixtral.

6. A guide to RoPE (Rotary Positional Embeddings)

Ever wonder why the Transformer architecture requires a positional encoding?

Well, Christopher Fleetwoodโ€™s blog post paints an incredible picture on how it works.

It starts by explaining other forms of positional encoding such as absolute and sinusoidal.

And then talks about where these approaches might fall down.

Before then going to explain where RoPE makes up for the weaknesses of the others and why RoPE is used in many modern Transformer architectures today.

The blog post is full of great visuals and extra resources to learn more about positional embeddings and their history.

7. [Tools ๐Ÿ› ] Docling is a new open-source library for getting documents ready for AI models and RAG pipelines

Docling takes in PDF, HTML, Word or other similar documents and outputs the contains as JSON or markdown.

How?

Using two open-source document analysis models:

  1. Layout analysis model โ€” This model uses RT-DETR to find the layout components of a document, for example, captions, footnotes, formulas, text, titles, tables and more.
  2. TableFormer โ€” A transformer trained to go from image to output the structure of a table.

If you need a way to process large amounts of PDF documents into more manageable markdown and JSON files, try docling out.

docling-workflow

Docling processing pipeline from the Docling technical report.

8. [Tool ๐Ÿ”จ] Anthropic Introduces a Prompt Improver Tool

Iโ€™ve tried it a few times already and it seems to work really well.

Goes through a ~6-step process to generate examples, create a Mermaid (Mermaid is like a combination of a programming and a diagram language) flowchart for steps and then rewrites the overall prompt a few times for instruction and clarity.

anthropic-prompt-improver

Anthropicโ€™s new prompt improver tool. Available in the Anthropic developer console.

9. Daniel van Strien shows how you can run a local VLM to organize your screenshots

Weโ€™re now entering the era where local (local meaning running on your own computer) LLMs and VLMs are going to start being the norm.

These models will have good baseline capabilities, however, they will still have to be guided towards your own specific use case.

In Daniel van Strienโ€™s case he wanted to organize all of screenshots on his desktop.

A simple task if youโ€™ve got some spare time but also a task that would be a good challenge for recent VLMs.

He combined LM Studio with the Pixtral VLM from Mistral and gave it instructions to sort screenshots into various categories as well as write down information from them.

Aside: I wonder how far you could take this with SigLIP (a model that has been trained to contrast vision and language)? E.g. use SigLIP to zero-shot categorise the images into [โ€œmemeโ€, โ€œdocumentโ€, โ€œreceiptโ€, โ€œdiagramโ€œ, โ€œunknownโ€] where the โ€unknownโ€ option could be those below a certain threshold.

VLMs-for-analyzing-screenshots

Workflow for organizing screenshots with a VLM. First the VLM reads and analyses the screenshot and then it gets moved to the appropriate location based on the outputs of the VLM. Source: Daniel van Strien blog.

10. Move over text-based RAG, here comes vision-based RAG

RAG = Retrieval Augmented Generation.

In other words, getting related documents to a query to improve LLM outputs.

For example, you could ask โ€œwhat are our policies related to damaged delivery boxes?โ€ and the retrieval system would look up the company documentation on dealing with damaged boxes and then return the highest scoring related resources.

These resources could then be used as input to an LLM to generate a response based on the actual company documentation.

All of this could be processed with text only.

However, what if there were some images showcases examples of damaged boxes too?

Thatโ€™s where vision-based RAG comes in.

Instead of only looking for text, vision-based RAG looks for both text and images/figures.

It does this by embedding the screenshot/image of the page/resource and then performing similarity search on those embeddings with the query.

The results so far show that vision-based RAG can provide substantial improvements over pure text-based RAG.

But of course, this will all be dependent on your use case, so make sure to always test on your own problems.

Two resources Iโ€™ve found helpful if youโ€™d like to learn more about vision-based RAG systems:

  1. Visual RAG in practice by Vespa AI
  2. VisRAG paper and GitHub repository.

VisRAG

VisRAG (visual retrieval augmented generation) overall workflow. Source: VisRAG GitHub.

๐Ÿ”ฅย Open-source & Research Rapid Fire Round

Bonus: When building AI apps, donโ€™t forget about the experience

I really liked this note in the Vespa visual RAG in practice blog post.

Building an AI system that works is one thing.

Making sure itโ€™s a nice user experience is the next step.

user-experience-in-ai-apps

See you next month!

What a massive month for the ML world in November!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

Python Monthly Newsletter ๐Ÿ’ป๐Ÿ preview
Python Monthly Newsletter ๐Ÿ’ป๐Ÿ

60th issue of Andrei Neagoie's must-read monthly Python Newsletter: Octoverse Results Reveal, GPU Computing, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.