[July 2024] AI & Machine Learning Monthly Newsletter 💻🤖

55th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in July 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

Several new projects are in the works... Stay tuned to the ZTM website for more in the coming weeks (drop your email into that purpley blue box to be notified)! 🤗

From the Internet 🌐

1. Meta Release Segment Anything 2

Segment Anything 2 (SAM 2) is live and it’s open-source (Apache 2.0 license)!

SAM 2 is a foundation model which is capable of producing high quality image segmentations.

You can create image segmentations via a point in an image (e.g. click on a subject and have the model draw a mask on it) or by using boxes as input (e.g. input a bounding box on a subject and have the model draw a mask on the subject within the box).

SAM 2 also works on video, meaning you can input a video, choose an item to segment in a single frame and have that item segmented throughout the video.

sam2-workflow-example

Example workflow of using SAM 2 to turn bounding boxes into segmentation masks. This kind of workflow can dramatically speed up data labelling. For example, you could start by drawing boxes (faster than segmentation labels) and then have SAM 2 create the segmentation masks for you.

SAM 2 models come in four variants all which use the Hiera (an efficient form of the vision transformers) model as a backbone:

SAM 2 Tiny — 38.9M parameters, 149MB, 47.2 FPS (frames per second)
SAM 2 Small — 46M parameters, 176MB, 43.4 FPS/53.0 FPS compiled
SAM 2 Base Plus — 80.8M parameters, 309MB, 34.8 FPS/43.8 FPS compiled
SAM 2 Large — 224.4M parameters, 856MB, 24.2 FPS/30.2 FPS compiled

For more on the SAM 2 models, I’d highly recommend checking out the following:

2. Meta Release Llama 3.1 Large Language Models

Meta continues their streak of open-sourcing incredible AI models with the new Llama 3.1 LLMs.

Llama 3.1 comes in three variants, each on-par with state of the art or best in class for its size:

Llama 3.1 8B (8B = 8 billion parameters) — the smallest but still very capable variant, small enough to run on consumer GPUs.
Llama 3.1 70B — a mid-tier variant for harder workloads.
Llama 3.1 405B — the largest variant with performance on par with GPT-4 and Claude 3.5 Sonnet.

Llama 3.1 405B is perhaps the most impressive as it is the first open-source model to rival and outperform GPT-4 and Claude 3.5 on several tasks.

Developers are also able to use the outputs of Llama 3.1 405B for synthetic data generation as well as model distillation (use the outputs from Llama 3.1 405B to improve other models).

The model architecture of Llama 3.1 405B follows the decoder-only transformer model architecture (rather than a mixture of experts).

llama-3.1-405B-model-architecture

The Llama 3.1 405B model architecture follows the decoder-only transformer model pattern. Source: Meta Llama 3.1 blog.

Learn more and get your hands on Llama 3.1 at the following resources:

3. Mistral and NVIDIA team up to create an Apache 2.0 LLM

Mistral NeMo is an open-source 12B LLM with a context window of up to 128k tokens (75,000-100,000 words).

It was also trained with quantization awareness so it can be run in FP8 (floating point 8) for faster inference.

And it’s also multi-lingual, capable of being used in French, German, Spanish, Italian, Portuguese, Russian, Chinese and Japanese.

Find more about the model at:

Mistral NeMo blog post.
Get the base model and instruction tuned model on Hugging Face.
Bonus: Mistral also just released Mistral Large 2, their new flagship model which is on par with other top-tier LLMs. It’s available via API as well as on Hugging Face (though a license is required for commercial usage).

4. ColPali is a new way to perform multi-modal retrieval on PDFs: just embed the page

Finding information in PDFs at scale can be challenging.

If the PDF is all text-based, you can often extract it and search it.

You can also extract the text, turn it into embeddings and perform similarity matching.

Finally, you can also use a RAG (Retrieval Augmented Generation) strategy to retrieve documents/passages based on a query and then generate a response based on the items retrieved (I cover an end-to-end example of this on my YouTube channel or you can watch it below).

However, PDFs are rarely all text-based.

They contain images, figures, tables, graphs and more.

Previously, models and search systems would try to segment each of these items out one by one.

But that’s quite a complicated workflow.

ColPali is a new approach that instead of trying to segment each section of a PDF it embeds the whole page.

The page embeddings can then be queried and a VLM (Vision-Language Model, in this case, PaliGemma) can be used to generate a response based on the returned page from the similarity score.

And it turns out that this approach is both quick to embed a page (0.39s/page vs 7.22s/page for layout detection) and achieves incredible results (best in class for many retrieval benchmarks).

colpali-layout

ColPali workflow and architecture overview. Instead of detecting each portion of the PDF and separating it, ColPali embeds the whole PDF page and performs retrieval via those page embeddings. Source: ColPali paper.

If you’re building a RAG pipeline and working with PDFs, I’d highly recommend checking out ColPali and the resources around it:

[Paper] ColPali: Efficient Document Retrieval with Vision Language Models.
ColPali GitHub repo.
ColPali blog post on Hugging Face + ColPali model on Hugging Face.
Bonus: Vespa blog post with example of a retrieval pipeline using ColPali.

5. Vision-Language Models (VLMs) Explained by Hugging Face

Vision Language Models (VLMs) are becoming more and more popular.

These models embed vision and language data in the same space.

For example, a VLM such as Florence 2 is capable of locating objects, captioning images, optical character recognition and more.

vlms-explained

A Vision Language Model is capable of many tasks blending the language and vision space together. Source: Hugging Face VLMs explained blog post.

Hugging Face has many VLMs available in the transformers library such as llava-hf/llava-1.5-7b-hf.

If you’d like to learn more about VLMs and how they work, I’d suggest reading Hugging Face’s Vision Language Models Explained blog post.

There’s also a section in the blog post on how you can fine-tune a VLM to your own data.

6. An open-source version of BM25 (a powerful lexical search algorithm)

BM25 is a popular and powerful lexical search algorithm which estimates how relevant a document is given a query.

And BM25S is a new open-source version of the BM25 algorithm which is Python-based (only requires numpy and scipy) .

Lexical search means the search looks for exact matches to a query, for example, “What is machine learning?” would look for: “what”, “is”, “machine”, “learning?”.

BM25S is fast too.

Capable of searching over 1000s of documents per second and comparable or better than existing BM25 implementations.

A short code sample for using BM25S looks like:

# pip install bm25s
import bm25s

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

# Query the corpus and get top-k results
query = "who's your best friend?"
results, scores = retriever.retrieve(bm25s.tokenize(query), k=2)

# Let's see what we got!
doc, score = results[0, 0], scores[0, 0]
print(f"Rank {1} (score: {score:.2f}): {doc}")

>>> Rank 1 (score: 0.88): a dog is the human's best friend and loves to play

You can get the full BM25S code on GitHub.

7. Notes on training a customized LLM to beat GPT-4 by Alex Strick van Linschoten

Alex Strick van Linschoten has published an incredible series of articles which discuss how he fine-tuned a series of open-source LLMs (e.g. Llama 3-8b, Mistral-7b) to get better results than GPT-4 on a specific task.

His task was to extract structured information from an unstructured war report (for example, turn a paragraph of text into a JSON of statistics).

results-for-fine-tuning-llms-alex-strick-van-linschoten

Results comparing several fine-tuned LLM models versus GPT models. Notice how the fine-tuned Mistral model performs at over 10 points higher than GPT-4o. Source: Alex Strick van Linscoten blog.

For more details on how he did it, I’d recommend going through the following series:

8. LLM Evaluation doesn’t have to be complicated by Phil Schmid

Phil Schmid shares a guide for evaluating LLMs.

In short, you can bootstrap an evaluation dataset by starting with an LLM as a judge.

For example, you can ask an LLM (e.g. meta-llama/Meta-Llama-3.1-70B-Instruct) to rate an a response from 1-5 and then use those ratings with a human-in-the-loop setting to review the worst/best ratings and iterate from there, slowly improving the dataset over time.

Evaluations are one of the most important things to create when it comes to ML models, as if you have a good set of evaluations and your model does well on them, you can feel confident in deploying it.

9. What AI engineers should know about search by Doug Turnbull

Doug does search engineering at Reddit (aka optimizing a lot of searches).

His writeup on search for AI engineers is a required reading for anyone building RAG applications.

My favourite tip is number 1:

1. The fanciest solutions don’t matter as much as getting a good evaluation framework setup to evaluate the quality of search results.

Evals, evals, evals!

10. NVIDIA’s guide on generating synthetic data with LLMs

Many machine learning projects are hard to get started because of the lack of data.

However, with the release of Llama 3.1 405B and its power to create synthetic data, the “getting started” problem for ML projects will be less of a thing.

You could use Llama 3.1 405B to generate synthetic text as well as NVIDIA’s Nemotron-340B reward model to rate the outputs for producing a high-quality dataset.

Untitled

Example of a synthetic data generation pipeline started with documents and different personas. Source: NVIDIA blog.

There’s a step-by-step notebook to go with the blog post as well.

Papers that caught my eye 👁

Apple Released a paper, Apple Intelligence Foundation Language Models, formalising how they trained their foundation models for Apple Intelligence (coming in iOS 18 + macOS Sequioa). Interesting to note that they used Google’s TPUs to train their models rather than NVIDIA’s GPUs.
DataComp-LM: In search of the next generation of training sets for language models introduces a data competition for language models (DCLM) where the models are kept the same (412M to 7B parameters) but the data inputs can change. The first round of models are available on Hugging Face (DCLM-7B) which achieves similar performance to Mistral-7B and Llama-8B despite using an estimated 6.6x less compute. See more on the mlfoundations/dclm GitHub.
PaliGemma: A versatile 3B VLM for transfer discusses the latest open-source VLM (Vision-Language Model) from the Google Research team. It is a combination of SigLIP-So400M (as the vision encoder) and Google’s Gemma 2B as the language model. The model is capable of many visual/language based tasks but also tasks such as object detection and segmentation. You can get the PaliGemma models on Hugging Face.

paligemma-architecture

Architecture overview of PaliGemma, using SigLIP as the image encoder, as well as a Gemma language model. This enables the model to interact with visual data using language. Source: PaliGemma paper.

Are language models actually useful for time series forecasting? dives into whether language models provide improvements for time series forecasting or not. And their research hows that despite using far more compute, language models generally don’t offer much improvement compared to replacing the LLM component with an attention layer.
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report showcases how smaller, more-focused LLMs (such as Mistral 7B), when fine-tuned, can outperform GPT-4 on several tasks. The Predibase fine-tuning leaderboard also has a few more models which show improvements when fine-tuning.

lora-land-finetuning-vs-gpt-4-models

Comparison of fine-tuned models versus base LLM models with weighted performance on 31 different tasks. GPT-4 has incredible base performance, however, it gets outperformed by fine-tuned smaller LLMs. Source: LoRA Land paper.

Presentations, Tutorials and Talks 👩‍🏫

[Free Course] Parlance Labs have published a huuuuuuuuge course on LLMs with topics/modules covering: RAG, LLM evaluation, applications, fine-tuning, prompt engineering and more.
[Tutorial] Replicate shows how to get the best from Stable Diffusion 3 (the latest generation model from Stability AI). One of the main tips being that you can now use much longer prompts. How to get the best from Stable Diffusion 3
[Keynote] Andrej Karpathy’s keynote address at UC Berkley AI Hackathon has some great tips for people wanting to learn machine learning/AI. My favourite: “Spend your 10,000 hours”, as in, like many skills, learning ML & AI thoroughly takes time.
[Video Tutorial] Piotr Skalski has an excellent tutorial on fine-tuning Florence 2 (an open-source VLM from Microsoft) on the Roboflow YouTube channel.
[Video Tutorial] Artem Kirsanov’s video on backpropagation is one of the most high quality tutorials on the internet. I’d highly recommend checking it out if you’re interested in knowing more about the algorithm that powers most of the best modern day deep learning models.
[Tutorials] The Hugging Face Cookbook has been building up many different tutorials on LLMs and more. I’ve been reading through one of the newer ones on RAG evaluation.

See you next month!

What a massive month for the ML world in July!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.