AI & Machine Learning Monthly Newsletter - July 2024

Daniel Bourke
Daniel Bourke
hero image

55th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in July 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

Several new projects are in the works... Stay tuned to the ZTM website for more in the coming weeks (drop your email into that purpley blue box to be notified)! 🤗

From the Internet 🌐

1. Meta Release Segment Anything 2

Segment Anything 2 (SAM 2) is live and it’s open-source (Apache 2.0 license)!

SAM 2 is a foundation model which is capable of producing high quality image segmentations.

You can create image segmentations via a point in an image (e.g. click on a subject and have the model draw a mask on it) or by using boxes as input (e.g. input a bounding box on a subject and have the model draw a mask on the subject within the box).

SAM 2 also works on video, meaning you can input a video, choose an item to segment in a single frame and have that item segmented throughout the video.

sam2-workflow-example

Example workflow of using SAM 2 to turn bounding boxes into segmentation masks. This kind of workflow can dramatically speed up data labelling. For example, you could start by drawing boxes (faster than segmentation labels) and then have SAM 2 create the segmentation masks for you.

SAM 2 models come in four variants all which use the Hiera (an efficient form of the vision transformers) model as a backbone:

  1. SAM 2 Tiny — 38.9M parameters, 149MB, 47.2 FPS (frames per second)
  2. SAM 2 Small — 46M parameters, 176MB, 43.4 FPS/53.0 FPS compiled
  3. SAM 2 Base Plus — 80.8M parameters, 309MB, 34.8 FPS/43.8 FPS compiled
  4. SAM 2 Large — 224.4M parameters, 856MB, 24.2 FPS/30.2 FPS compiled

For more on the SAM 2 models, I’d highly recommend checking out the following:

2. Meta Release Llama 3.1 Large Language Models

Meta continues their streak of open-sourcing incredible AI models with the new Llama 3.1 LLMs.

Llama 3.1 comes in three variants, each on-par with state of the art or best in class for its size:

  1. Llama 3.1 8B (8B = 8 billion parameters) — the smallest but still very capable variant, small enough to run on consumer GPUs.
  2. Llama 3.1 70B — a mid-tier variant for harder workloads.
  3. Llama 3.1 405B — the largest variant with performance on par with GPT-4 and Claude 3.5 Sonnet.

Llama 3.1 405B is perhaps the most impressive as it is the first open-source model to rival and outperform GPT-4 and Claude 3.5 on several tasks.

Developers are also able to use the outputs of Llama 3.1 405B for synthetic data generation as well as model distillation (use the outputs from Llama 3.1 405B to improve other models).

The model architecture of Llama 3.1 405B follows the decoder-only transformer model architecture (rather than a mixture of experts).

llama-3.1-405B-model-architecture

The Llama 3.1 405B model architecture follows the decoder-only transformer model pattern. Source: Meta Llama 3.1 blog.

Learn more and get your hands on Llama 3.1 at the following resources:

3. Mistral and NVIDIA team up to create an Apache 2.0 LLM

Mistral NeMo is an open-source 12B LLM with a context window of up to 128k tokens (75,000-100,000 words).

It was also trained with quantization awareness so it can be run in FP8 (floating point 8) for faster inference.

And it’s also multi-lingual, capable of being used in French, German, Spanish, Italian, Portuguese, Russian, Chinese and Japanese.

Find more about the model at:

4. ColPali is a new way to perform multi-modal retrieval on PDFs: just embed the page

Finding information in PDFs at scale can be challenging.

If the PDF is all text-based, you can often extract it and search it.

You can also extract the text, turn it into embeddings and perform similarity matching.

Finally, you can also use a RAG (Retrieval Augmented Generation) strategy to retrieve documents/passages based on a query and then generate a response based on the items retrieved (I cover an end-to-end example of this on my YouTube channel or you can watch it below).

However, PDFs are rarely all text-based.

They contain images, figures, tables, graphs and more.

Previously, models and search systems would try to segment each of these items out one by one.

But that’s quite a complicated workflow.

ColPali is a new approach that instead of trying to segment each section of a PDF it embeds the whole page.

The page embeddings can then be queried and a VLM (Vision-Language Model, in this case, PaliGemma) can be used to generate a response based on the returned page from the similarity score.

And it turns out that this approach is both quick to embed a page (0.39s/page vs 7.22s/page for layout detection) and achieves incredible results (best in class for many retrieval benchmarks).

colpali-layout

ColPali workflow and architecture overview. Instead of detecting each portion of the PDF and separating it, ColPali embeds the whole PDF page and performs retrieval via those page embeddings. Source: ColPali paper.

If you’re building a RAG pipeline and working with PDFs, I’d highly recommend checking out ColPali and the resources around it:

5. Vision-Language Models (VLMs) Explained by Hugging Face

Vision Language Models (VLMs) are becoming more and more popular.

These models embed vision and language data in the same space.

For example, a VLM such as Florence 2 is capable of locating objects, captioning images, optical character recognition and more.

vlms-explained

A Vision Language Model is capable of many tasks blending the language and vision space together. Source: Hugging Face VLMs explained blog post.

Hugging Face has many VLMs available in the transformers library such as llava-hf/llava-1.5-7b-hf.

If you’d like to learn more about VLMs and how they work, I’d suggest reading Hugging Face’s Vision Language Models Explained blog post.

There’s also a section in the blog post on how you can fine-tune a VLM to your own data.

6. An open-source version of BM25 (a powerful lexical search algorithm)

BM25 is a popular and powerful lexical search algorithm which estimates how relevant a document is given a query.

And BM25S is a new open-source version of the BM25 algorithm which is Python-based (only requires numpy and scipy) .

Lexical search means the search looks for exact matches to a query, for example, “What is machine learning?” would look for: “what”, “is”, “machine”, “learning?”.

BM25S is fast too.

Capable of searching over 1000s of documents per second and comparable or better than existing BM25 implementations.

A short code sample for using BM25S looks like:

# pip install bm25s
import bm25s

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

# Query the corpus and get top-k results
query = "who's your best friend?"
results, scores = retriever.retrieve(bm25s.tokenize(query), k=2)

# Let's see what we got!
doc, score = results[0, 0], scores[0, 0]
print(f"Rank {1} (score: {score:.2f}): {doc}")

>>> Rank 1 (score: 0.88): a dog is the human's best friend and loves to play

You can get the full BM25S code on GitHub.

7. Notes on training a customized LLM to beat GPT-4 by Alex Strick van Linschoten

Alex Strick van Linschoten has published an incredible series of articles which discuss how he fine-tuned a series of open-source LLMs (e.g. Llama 3-8b, Mistral-7b) to get better results than GPT-4 on a specific task.

His task was to extract structured information from an unstructured war report (for example, turn a paragraph of text into a JSON of statistics).

results-for-fine-tuning-llms-alex-strick-van-linschoten

Results comparing several fine-tuned LLM models versus GPT models. Notice how the fine-tuned Mistral model performs at over 10 points higher than GPT-4o. Source: Alex Strick van Linscoten blog.

For more details on how he did it, I’d recommend going through the following series:

8. LLM Evaluation doesn’t have to be complicated by Phil Schmid

Phil Schmid shares a guide for evaluating LLMs.

In short, you can bootstrap an evaluation dataset by starting with an LLM as a judge.

For example, you can ask an LLM (e.g. meta-llama/Meta-Llama-3.1-70B-Instruct) to rate an a response from 1-5 and then use those ratings with a human-in-the-loop setting to review the worst/best ratings and iterate from there, slowly improving the dataset over time.

Evaluations are one of the most important things to create when it comes to ML models, as if you have a good set of evaluations and your model does well on them, you can feel confident in deploying it.

9. What AI engineers should know about search by Doug Turnbull

Doug does search engineering at Reddit (aka optimizing a lot of searches).

His writeup on search for AI engineers is a required reading for anyone building RAG applications.

My favourite tip is number 1:

1. The fanciest solutions don’t matter as much as getting a good evaluation framework setup to evaluate the quality of search results.

Evals, evals, evals!

10. NVIDIA’s guide on generating synthetic data with LLMs

Many machine learning projects are hard to get started because of the lack of data.

However, with the release of Llama 3.1 405B and its power to create synthetic data, the “getting started” problem for ML projects will be less of a thing.

You could use Llama 3.1 405B to generate synthetic text as well as NVIDIA’s Nemotron-340B reward model to rate the outputs for producing a high-quality dataset.

Untitled

Example of a synthetic data generation pipeline started with documents and different personas. Source: NVIDIA blog.

There’s a step-by-step notebook to go with the blog post as well.

Papers that caught my eye 👁

paligemma-architecture

Architecture overview of PaliGemma, using SigLIP as the image encoder, as well as a Gemma language model. This enables the model to interact with visual data using language. Source: PaliGemma paper.

lora-land-finetuning-vs-gpt-4-models

Comparison of fine-tuned models versus base LLM models with weighted performance on 31 different tasks. GPT-4 has incredible base performance, however, it gets outperformed by fine-tuned smaller LLMs. Source: LoRA Land paper.

Presentations, Tutorials and Talks 👩‍🏫

See you next month!

What a massive month for the ML world in July!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

Python Monthly Newsletter 💻🐍 preview
Python Monthly Newsletter 💻🐍

56th issue of Andrei Neagoie's must-read monthly Python Newsletter: Python Web Apps, Python Oopsie with Apple, Some Advice From a 15+ Vet, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.