[January 2024] AI & Machine Learning Monthly Newsletter 💻🤖

49th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in January 2024 as an A.I. & Machine Learning Engineer…

My Latest Work 👇

Nutrify is live on the App Store! 🍍

My brother and I have been working on a food education app called Nutrify.

And now version 1.0 is live!

Nutrify uses computer vision to allow you to take a photo of food and learn about it.

Stay tuned for an official launch video coming soon to my YouTube channel.

nutrify-app-store-screenshot-fullpage

Nutrify: The Food App live on the App Store — take a photo of food and learn about it! If you’ve got an iPhone with iOS 16.0+, search “nutrify” on the App Store and look for the pineapple.

From the Internet 👇

Two sensational blog posts from Vicky Boykis

What’s new with ML in production?

Much of modern ML is compression, as in, compressing a larger data source (images, text, audio) into a smaller but meaningful data source (vector, embeddings).

The question is, who does the compression?

Is it OpenAI’s API? Is it an open-source model? Is it a compression algorithm? Or is it a custom trained model?

Of course, answers will depend on a number of things. Cost requirements, latency requirements, privacy requirements, performance requirements. And often each of these will come with various levels of priority.

For example, using OpenAI’s API may be off limits because of your apps strict privacy requirements.

Or perhaps you can’t use a cloud-based API at all because of the latency requirements of your model.

This is one of the main considerations when it comes to ML in production.

And they're explored in Vicky Boykis’s recent blog post What's new with ML in production?

ml-as-compression

Most of modern machine learning is figuring out how to compress your data into a useful representation. And some algorithms are better than others. Source: What’s new with ML in production?

What does it actually take to build a production-grade ML app?

Every day a new machine learning demo appears in the world. And then a few days later and the old one is forgotten because there’s a new one. And so on and so on.

For every 1,000 demos there are, perhaps 1 of them makes it into a production grade application.

This is what Vicky Boykis did with Viberary, an application to search for books based on ✨vibe✨.

As in, if I search for “sci-fi but not too much with a twist of romance and adventure”, the results would hopefully include books which matched that theme (e.g. my own book Charlie Walks).

Vicky’s post-mortem of Viberary, Retro on Viberary, is full of practical tips on what it takes to build a production-grade ML system and ship it so others can use it.

Such as what model to use:

I chose msmarco-distilbert-base-v3, which is middle of the pack in terms of performance, and critically, is also tuned for cosine similarity lookups, instead of dot product, another similarity measure that takes into account both magnitude and direction.

Cosine similarity only considers direction rather than size, making cosine similarity more suited for information retrieval with text because it’s not as affected by text length, and additionally, it’s more efficient at handling sparse representations of data.

Tips on product building being an iterative process (arguably, as soon as you launch a product, it’s the worst it’ll ever be, because you’ll upgrade it overtime):

The important thing is to keep benchmarking the current model against previous models and to keep iterating and keep on building.

And finally, my favourite:

The satisfaction of shipping products that you’ve built is unparalleled.

How to Fine-Tune LLMs in 2024 with Hugging Face by Philipp Schmid

Philipp Schmid’s blog is one of my favourite resources for practical modern machine learning use cases.

In his latest article, he shares how to fine-tune an open-source LLM (e.g. Llama 2, Mistral and more) in 2024 on your own dataset.

Starting with the most important step: defining your use case. Many times I’ve seen ML projects start out with “let’s use XYZ model” because it’s the latest and greatest rather than having a practical use case for it.

Perhaps fine-tuning isn’t necessary for your use case. Or maybe you don’t have enough data yet.

Either way, Phil’s blog post is a sensational introduction to fine-tuning LLMs.

Interconnects.ai, a fantastic deep dive into all things LLMs

Nathan Lambert writes Interconnects.ai, a newsletter about AI and everything around it. I recently found two articles interesting and helpful:

1) It's 2024 and they just want to learn

An article detailing the fact that recipe training performant LLMs is pretty well-known now. As in, with large enough data, a Transformer architecture and compute power, even researchers relatively new to the game can create state-of-the-art (SOTA) models.

Of course, there's always room for improvement. It's also an ode to the open-source community. As although many cannot run the biggest models, well-performing and practical LLMs are becoming more and more accessible locally.

2) RLHF learning resources in 2024

RLHF stands for “Reinforcement Learning from Human Feedback". And it's the technique used to "align" LLMs to what humans prefer.

In essence, you give a human a set of LLM outputs from a given input and they rate their favourite output based on the input. Repeat this a few hundred thousand (or million) times and you can train a reinforcement learning model to tune an LLM so its outputs more closely align with the human rated preferences.

The article collects a plethora of resources to learn more on the topic.

Getting an LLM to output what you want with RLAIF, Reinforcement Learning from AI Feedback by Cameron Wolfe

One of the key ingredients to modern LLMs is RLHF (as discussed above). In other words, getting an LLMs output to align with human preferences. However, this approach requires a large amount of manual effort.

What if you could use an existing LLM to create the preferences for you?

For example, given an input and an output of one LLM, get another LLM to rate which is better (by some criteria). This is the concept of RLAIF.

And as it turns out, it’s just as effective and in some cases more effective than pure RLHF. Cameron Wolfe’s article on RLAIF collects a bunch of different resources on the topic as well an analysis on the original RLAIF paper.

rlaif-diagram-from-paper

RLHF (Reinforcement Learning from Human Feedback) versus RLAIF (Reinforcement Learning from AI Feedback). Source: RLAIF paper.

How do Transformers work? (code + blog post tutorial)

Mat Miller has published a sensational tutorial on how to build your own GPT (Generative Pretrained Transformer) from scratch.

All the way from tokenization (turning words into numbers) to data loading to scaling up the Transformer architecture.

Best of all, the notebook is completely runable in Google Colab.

Golf ball tracking in the PGA tour with computer vision AWS🏌️

AWS share a real-world use case of object detection and tracking by creating a ball tracking pipeline for the PGA tour.

Using several different inputs of videofeeds, they create a cascade of convolutional neural networks (CNNs) to detect non-players, players and golf balls.

The thing that stood out to me the most was the emphasis on data labelling. They created their own custom dataset and then fine-tuned several YOLOv7 models which happened to perform very well on their target problem.

Golf ball tracking pipeline with a cascade of different CNN models for different tasks on AWS. Source: AWS blog.

Finetine LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem

The PyTorch team show how you can take a Llama-2 7B model from requiring 112GB of GPU VRAM to fine-tune to taking only ~7-10GB (which is possible to do a T4 GPU on Google Colab).

“With QLoRA we are matching 16-bit fine-tuning performance across all scales and models, while reducing fine-tuning memory footprint by more than 90%— thereby allowing fine-tuning of SOTA models on consumer-grade hardware.”

Turns out you can customize an open-source LLM for your own use case with off the shelf tools quite efficiently.

There's a notebook to go with it on Google Colab too.

Open-source Courses and Tools 🔬

A free and open LLM course on GitHub — llm-course is a packed GitHub repo by Maxime Labonne with a bunch of materials and resources to learn LLMs from beginner to building LLM-based applications and deploying them.
LLM Engineering: Structured Outputs by Jason Liu and Weights & Biases (free) — One of the best use cases I’ve found with LLMs is turning unstructured data such as paragraphs of text into structured data such as dictionaries or JSON. I worked on a machine learning project a couple of years ago to do just that, extract structured diagnosis details from unstructured doctor’s notes (raw text). But LLMs at the time were nowhere near as good as they are now. The LLM Engineering: Structured Outputs course teaches you how to use the Instructor library to ensure the outputs you get from an LLM are structured, leveraging Pydantic (a type-checking library for Python). Here’s a code example from the open-source Instructor library:

import instructor
from openai import OpenAI
from pydantic import BaseModel

# Enables `response_model`
client = instructor.patch(OpenAI())

class UserDetail(BaseModel):
    name: str
    age: int

user = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=UserDetail,
    messages=[
        {"role": "user", "content": "Extract Jason is 25 years old"},
    ]
)

assert isinstance(user, UserDetail)
assert user.name == "Jason"
assert user.age == 25

Clipper is an open-source tool that turns HTML from webpages into Markdown format which is useful for RAG or Retrieval Augmented Generation applications.
Suyra is an open-source line-level text detection model capable of detecting individual lines of text in a document. It's built by Vik Paruchuri, the same person who created Marker (turn PDFs into markdown, mentioned in Machine Learning Monthly December 2023). Bonus: Check out Vik's GitHub profile, it's full of awesome repos and projects all related to making higher quality open-source AI datasets.

surya-text-example

Example of Suyra model extracting line-level text from an article. Source: Surya GitHub.

Sparrow is an open-source data extraction tool to get structured data from forms, invoices, receipts and various other unstructured data sources. Get JSON data from your images and PDFs to use in your RAG pipelines!

Research Papers 📰

A theme I’m noticing in ML research right now is the rise of synthetic data.

Over the past 2-3 years, text generation and image generation models got really good.

Now, for many common tasks, it’s possible to bootstrap a dataset by just generating synthetic data with existing foundation models.

For example, say you wanted to build a text classifier for certain classes.

You could get a model like GPT-4 to generate you 100 starter examples of each class and then build from there.

Or if you were short images for a certain class of image classification data, you could try and get Stable Diffusion to generate images for that class to boost it.

A repo I’m watching is syn-rep-learn by Google Research.

There’s already three papers in the repo (one of which is mentioned below) exploring the limits and possibilities of using synthetic data for vision representation learning and my guess is that text and other modalities may come next.

Scaling Laws of Synthetic Images for Model Training… for Now

Researchers explore the possibility of training CLIP (Contrastive Image-Language Pretraining) models with synthetic images and find that they perform well but real data still wins out (for now).

It seems that SynCLR (a paper mentioned in last month's AI/ML Monthly issue) in the same repo may have improved upon this already.

Bonus: The Google Colab X account shared a notebook recently with code examples to generate images with Imagen (Google's AI Image Generation model) with Gemini (Google's big LLM model) as the prompter.

scaling-laws-of-synthetic-data-image-of-results

Scaling laws of synthetic data finds that real data still wins out on image classification tasks, however, synthetic data may be more robust to out-of-distribution examples. Source: Scaling laws for synthetic data paper.

Improving Text Embeddings with Large Language Models

Why train on billions of unsupervised examples when you can get an LLM to generate premium samples?

Researchers use LLMs to create synthetic data for hundreds of thousands of text embeddings across ~100 languages and then train a text embedding model on top of them.

And it turns out that the synthetic data alone can provide excellent representations but training with synthetic data as well as real data sets a new state-of-the-art on the MTEB (Massive Text Embeddings) leaderboard.

See the embedding model on Hugging Face.

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

It turns out that "easy” data can be as good as “hard” data for fine-tuning/prompting LLMs models.

Specifically, a LLM prompted with 3rd Grade level questions can perform almost as well as on a set of college questions as a LLM prompted with similar kinds of college questions.

The takeaway: LLMs seem to have most of their knowledge in the pretrained weights (from unsupervised pretraining or the “P" in ChatGPT) and tuning/prompting them with easy data (e.g. data that's easy to collect) relevant to your target problem may result in a more efficient pipeline than collecting hard examples.

Read the accompanying blog post for more highlights.

training-on-hard-vs-easy-data-with-llms

Easy data seems to get about 70-100% of the results as hard data. To me this shows that most of the LLMs knowledge is embedded in the pretraining weights and then unlocked via fine-tuning/prompting. Source: The Unreasonable Effectiveness of Easy Training Data for Hard Tasks paper.

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

What happens when you mix visual search techniques with the world knowledge of an LLM? You get a MLLM (Multimodal Large Language Model) with much better visual capabilities.

V* introduces a framework called SEAL or Show, SEArch and TelL.

To have finer grained visual searching abilities a visual search mechanism is engaged if the image encoder's visual features are not sufficient for answering a query.

The visual search mechanism turns on a visual working memory which contains four blocks, a question block, a global image block, a searched targets block and a target location block.

The overall mechanism creates a pattern similar to how people zoom in on images when asked a specific question about them. Compared to other models such as GPT-4V and Gemini Pro, V* performs much better at fine-grained visual search.

Though still far worse than humans (98%+ versus 74%+).

See the code on GitHub, try out the demo on Hugging Face.

Example demo of V performing visual search on an image to find a glass and then answer the question of what colour the liquid is inside it. Note: you might have to zoom in to find the glass (I did). Source: V* demo on Hugging Face.

Videos, Podcasts & Presentations 📺

[Video/podcast] - Naval Ravikant, David Deutsch and Brett Hall discuss everything from whether ChatGPT understand what it’s doing (spoiler: no) to where the growth of knowledge comes from in a beautiful two part series (talk 1, talk 2).
[Presentation] - Benedict Evans, a technology analyst did a wonderful presentation on AI and how it influences almost everything else.
[Presentation] - Sofie Van Landeghem, a maintainer of the open-source library spaCy shared a presentation on how to use LLMs for rapid prototyping of production systems (hint: use LLMs to do a bunch of data annotation, then fine-tune into smaller, easier to manage models).

Quick links 🔗

See you next month!

What a massive month for the ML world in January!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.