[April 2024] AI & Machine Learning Monthly Newsletter 💻🤖

52nd issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in April 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

MLOps Lessons Learned (talk)

The MLOps Discord (MLOps stands for machine learning operations, essentially all the parts that go around building a machine learning model) was hosting a “Lessons Learned” session for people building ML/AI-powered apps.

I applied and was invited along to talk about Nutrify (a computer vision powered application to help people learn about food).

In the talk, I share 4 of my own biggest lessons building an AI startup from scratch:

Keep it simple
It’s always about the data
Deploy and see the whole picture
Model optimization matters

From the Internet 🤠

1. How Vikas Paruchuri got into Deep Learning

A sensational article with practical tips on how to get a role in deep learning.

My favourite takeaways were:

Get really good at programming. “No matter what era of AI I’ve been in, data cleaning has been >70% of my work.”
Book learning → teaching. Learn from books and resources and then teach yourself by teaching those materials or making something with them. Put what you learn into practice and teach others. It’s a fundamental way to solidify your knowledge.
Read the foundational papers from 2015-2022 and you’ll be able to converse with many people about modern AI techniques (see image below).
Find a problem that interests you and publish your work. Open-sourcing is a great way to get noticed by the community.

foundational-papers

A list of foundational papers from 2015-2022 for the modern era of deep learning collected by Vikas Paruchuri. A good exercise would be to take each of these papers and reimplement them in a deep learning framework of your choice. Source: Vikas Paruchuri blog.

I liked this article so much I’ve added it to my list of resources for how to get a job in machine learning.

2. Lessons learned after 500 million GPT tokens by Ken Kantzer

Ken’s company Truss helps accountants organize client documents.

To do so, they use GPT.

More specifically GPT-4 (85%) and GPT 3.5 (15%).

So far their team have processed closed to half a billion tokens via the GPT API (around 750,000 pages of text).

Some takeaways they found:

GPT doesn’t really hallucinate if you give it valid input text. For example, “here’s some text, extract some things.” (I’ve had a similar experience)
Keep prompts simple and language-focused. They sometimes found that the more verbose the prompt, the worse the model performed. And instead a simpler prompt worked better (though this will vary by use case).
You don’t need LangChain or much else in the OpenAI API, just the chat API, it’s really good. I’ve often found this too, for simple workflows, the chat API works really well. I’m yet to discover a required use case for LangChain, however, my needs for the GPT API are similar to Truss’s, “here’s some text, format it/change it in a certain way”.
How do I keep up with all the stuff happening in AI/LLMs? You don’t need to. Big general improvements to model performance tend to outweigh niche improvements. It can hard to keep up with something new coming out all the time, best to stick to building something that works first (e.g. with a big model like GPT-4) and then iterating from there.

Bonus: Ken wrote a follow up blog post called GPT is the Heroku of AI which explains how GPT’s zero-shot abilities dramatically lower the barrier to entry for building AI/ML-based apps. I totally agree.

Some of the things GPT (and similar foundation models) can do with little to no data is an incredible kickstart to any AI project. Start the project fast with a GPT-like model and then move to your own custom models.

3. LLM in the wild use case: DoorDash Product Knowledge Graph Enhancement

DoorDash is an app to help people order food or groceries from almost anywhere.

So you can imagine how many products they have to record in their database.

Some products are also the same product but are called different things at different stores.

For example, Carton of Water Bottles 12x600ml could be called 600ml Water Bottles (12 Pack) somewhere else.

How do you link these up?

Or what if there’s a new product being added to the database?

How do you enrich its entry with extra information so people can search for it?

LLMs to the rescue!

On their engineering blog, DoorDash details how they enrich product information using LLMs by extracting unstructured data in four steps:

Classify unstructured product descriptions with normal in-house based classifiers.
Products/SKUs that cannot be tagged confidently get sent to LLM for brand extraction.
Extraction output is passed to a second LLM which retrieves similar brands and example item names from an internal knowledge graph to decide whether the extracted brand is a duplicate entry.
New brand enters the knowledge graph and the in-house classifier is retrained with the new annotations.

An excellent example of combining traditional machine learning models (classifiers) with LLMs to enhance a workflow.

Doing this means that a customer’s search doesn’t necessarily have to include brand names or actual product names.

They can just search “organic handwash” and several brands will be returned.

From an ad perspective, product owners can also advertise their own products for relevant searches.

DoorDash’s workflow for enhancing product metadata information via traditional classifiers and LLMs. Source: DoorDash engineering blog.

4. Training LLMs in the wild at a startup by Yi Tay (founder of Reka.ai)

Yi Tay left Google Brain about a year ago and founding Reka, a generative AI company.

In that short time, starting from scratch, Reka built infrastructure and trained generative AI models competitive or better than Gemini Pro, GPT-3.5 and now close to GPT-4 and Claude 3 Opus.

In a story-style blog post, Tay shares lessons learned on what it’s like to build world-class models entirely from scratch.

Namely:

Acquiring compute — LLMs take exceptionally large amounts of compute, where do you get this? And even once you get it, how do you make it reliable?
GPUs vs TPUs — Spending years at Google, Tay used almost exclusively TPUs (Google’s in-house Tensor Processing Unit) and seems to think TPUs are much more reliable than GPUs.
Multi-cluster setups and data movement — When you need 100s or sometimes 1000s of GPUs, sometimes you need to combine or use several different clusters. This is not trivial. As LLMs need to train on terabytes of data. Moving this kind of data around takes time.
Code in the wild vs at a big tech company — PyTorch was the framework of choice for training Reka models because of its use in the industry. But there are many in-house frameworks at Google that make training larger models easier.
Less principled, more Yolo — When you’re at a startup you don’t have the resources and backing of a big tech company. Scaling models generally happens systematically. Well when you need to get something out as a new company, there isn’t always time to be systematic.

Bonus: Yi Tay posts some great takes and insights into ML/AI on X/Twitter. Like a recent discussion on architecture scaling vs data scaling.

5. Innovation through prompting by Ethan Mollick

Professor Ethan Mollick looks into a world and several use cases where educators leverage AI and prompting to help with their materials.

As a teacher, I read this and got plenty of ideas for my own future materials.

One example use case: making quizzes from existing materials and then having the resources to answer the question automatically linked back to source materials.

6. How Pinterest built text-to-SQL

Text-to-SQL means entering a natural language prompt such as “users with over 100 followers who posted in the last week from a European country” and getting back a valid SQL query.

For example (demo query):

SELECT user_id, followers_count, post_date, country
FROM users
WHERE followers_count > 100
  AND post_date >= CURRENT_DATE - INTERVAL '7 days'
  AND country IN ('Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 
                  'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 
                  'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 
                  'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 
                  'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 
                  'Spain', 'Sweden', 'United Kingdom')
ORDER BY followers_count DESC;

This is helpful because generally more people will be well-versed in natural language versus writing SQL.

But how do you get this to work across 100s or even 1000s of different SQL tables?

What if you’re not sure of the right table to query?

Once again, LLMs to the rescue!

You can create a vector index of table summaries and historical queries against them (e.g. turn the table metadata + history into embeddings).

Then query the vector index based on the embedded query text.

The top N potential tables are then returned along with more details about the tables to create a prompt for an intermediate LLM.

The intermediate LLM sorts the top N potential tables and metadata into a smaller pool of candidates and these new top K tables are returned to the user.

The user confirms the top K tables and the text-to-SQL pipeline gets executed in a RAG (Retrieval Augmented Generation) style workflow.

One of the most important findings was that table documentation played a crucial role in search hit rate performance.

Search hit rate without table documentation in the embeddings was 40%.

But performance increased linearly with additional documentation up to 90%.

Pinterest text-to-SQL pipeline which combines an offline and an online system. Source: Pinterest Engineering Blog.

7. The 900 most popular open source AI tools on GitHub, reviewed

Chip Huyen is one of my favourite people in the world of ML.

I have her book Designing Machine Learning Systems on the shelf next to me.

In one of her latest posts, she goes through 900+ of the most popular open source AI tools and picks out some of the biggest trends across the past couple of years.

These trends are broken down into four layers: applications, application development, model development and infrastructure.

the-new-ai-stack

The New AI Stack created by Chip Huyen after reviewing 900+ of the most popular AI tools on GitHub. Source: Chip Huyen blog.

Check out Chip’s blog post to get a bunch of insight in new tools and trends on the horizon.

One stand out thing I noticed is that a considerable number of tools get a large amount of stars/hype to begin but end up living fast, dying young (e.g. their popularity falls off quickly).

8. mixedbread.ai Introduces Binary MRL embeddings for a 64x efficiency gain

Embeddings are learned data representations.

They can take complex data samples and turn them into useful pieces of information that can be compared to each other.

And mixedbread.ai make some of the best text embedding models out there.

Their newest release, Binary Matryoshka Representation Learning (Binary MRL) allows for both a smaller embedding vector size (e.g. 512 vs 1024) as well a smaller data representation size (e.g. binary vs float32) for a whopping 64x efficiency gain with 90% of performance maintained.

This means you can store an embedding in just 64 bytes of memory (vs the standard 4096 bytes for a 1024 embedding in float32).

This translates to a cost reduction from $14,495.85 (3.81TB at $3.8 per GB/month) to $226.50 (59.60GB) per month to store 1B (1 billion) embeddings on x2gd instances on AWS.

And even better, all of this can be implemented in a few lines of code via the sentence_transformers library:

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings

# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# 2. Encode some text and select MRL dimensions
mrl_embeddings = model.encode(
    ["Who is German and likes bread?", "Everybody in Germany."], normalize_embeddings=True)[..., :512] 

# 3. Apply binary quantization
binary_embeddings = quantize_embeddings(mrl_embeddings, precision="binary")

9. A handful of quick releases

Label Studio 1.12.0 makes it easier to add in a foundational model backend for easier data annotation. Benefit: quality data is one of the biggest bottlenecks to ML projects. Improving data with label assistance from foundation models can improve downstream tasks.
REKA Core is a new state-of-the-art generative AI model similar to GPT-4, Claude and Gemini with very good results (this is the model from Yi Tay and colleagues mentioned above).
Stable Diffusion 3 API is live with incredible image generation capabilities.
Snowflake (a big cloud database company) recently released 3 powerful AI tools for LLMs, arctic embedding models (open-source), Snowflake Arctic LLM for enterprise (open-source), Snowflake text-to-SQL copilot (for use within Snowflake).
Meta release the first round of Llama 3 LLM models, Llama 3 8B and Llama 3 70B with a larger 405B model in training. These models are text-only for now but more features are said to be coming later. There’s also a research paper on the way too, once that comes out, I’ll include it in an upcoming issue.
Schedule-Free is an optimizer that promises equivalent or better performance compared to optimizers with decreasing learning rate schedules (e.g. cosine decay).

10. Machine Learning Research Papers

Airbench collects a bunch of PyTorch code-optimization techniques and shows how you get 94% accuracy on CIFAR10 in 3.29s on a single A100 GPU. GitHub, Paper.
MobileNetV4 introduces universal models for the mobile ecosystem which are mostly Pareto optimal. Includes a new MNv4-Hybrid-L variant which can achieve 87% accuracy on a Pixel 8 EdgeTPU in 3.8ms. Models on GitHub, Paper.
Apple releases OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework. OpenELM is a collection of LLM models ranging from 270M to 3B parameters. Training code is available on GitHub as well as model weights on Hugging Face.
CatLIP: CLIP-level visual recognition accuracy with 2.7x faster pre-trained on web-scale image-text data reframes CLIP-style training from Contrastive Image-Language Pretraining to a categorisation problem (hence the ‘cat’ in CatLIP). Categorisation learning is more efficient than contrastive learning and so CatLIP is able to achieve CLIP level results with far less compute time.
COCONut is an upgraded COCO dataset with 5.18M human-verified panoptic segmentations from Bytedance research. My favourite takeaway is their data labelling engine. Start with machine-proposed boxes and labels and slowly improve them over time with increased human supervision.

coconut-data-annotation-pipeline

Machine-assisted manual annotation pipeline from COCONut paper. Models are used to propose the initial labels before people are asked to review them (it’s much easier to review than it is to create labels from scratch). Source: COCONut paper.

11. Talks, Tutorials and Presentations

[Video/Podcast] Mark Zuckerberg on Llama 3 & open-source AI models
[Presentation/Slides] The AI Revolution will not be monopolized (the power of open-source AI) by Ines Montani of Explosion AI
[Tutorial] RAG Workshop for April 2024 from MLOps Leaners YouTube channel, go from intro to RAG (Retrieval Augmented Generation) to combining RAG with long context lengths and finally to RAG evaluation

See you next month!

What a massive month for the ML world in April!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.