49th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
Iโm an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
My brother and I have been working on a food education app called Nutrify.
And now version 1.0 is live!
Nutrify uses computer vision to allow you to take a photo of food and learn about it.
Stay tuned for an official launch video coming soon to my YouTube channel.
Nutrify: The Food App live on the App Store โ take a photo of food and learn about it! If youโve got an iPhone with iOS 16.0+, search โnutrifyโ on the App Store and look for the pineapple.
Much of modern ML is compression, as in, compressing a larger data source (images, text, audio) into a smaller but meaningful data source (vector, embeddings).
The question is, who does the compression?
Is it OpenAIโs API? Is it an open-source model? Is it a compression algorithm? Or is it a custom trained model?
Of course, answers will depend on a number of things. Cost requirements, latency requirements, privacy requirements, performance requirements. And often each of these will come with various levels of priority.
For example, using OpenAIโs API may be off limits because of your apps strict privacy requirements.
Or perhaps you canโt use a cloud-based API at all because of the latency requirements of your model.
This is one of the main considerations when it comes to ML in production.
And they're explored in Vicky Boykisโs recent blog post What's new with ML in production?
Most of modern machine learning is figuring out how to compress your data into a useful representation. And some algorithms are better than others. Source: Whatโs new with ML in production?
Every day a new machine learning demo appears in the world. And then a few days later and the old one is forgotten because thereโs a new one. And so on and so on.
For every 1,000 demos there are, perhaps 1 of them makes it into a production grade application.
This is what Vicky Boykis did with Viberary, an application to search for books based on โจvibeโจ.
As in, if I search for โsci-fi but not too much with a twist of romance and adventureโ, the results would hopefully include books which matched that theme (e.g. my own book Charlie Walks).
Vickyโs post-mortem of Viberary, Retro on Viberary, is full of practical tips on what it takes to build a production-grade ML system and ship it so others can use it.
Such as what model to use:
I chose
msmarco-distilbert-base-v3
, which is middle of the pack in terms of performance, and critically, is also tuned for cosine similarity lookups, instead of dot product, another similarity measure that takes into account both magnitude and direction.Cosine similarity only considers direction rather than size, making cosine similarity more suited for information retrieval with text because itโs not as affected by text length, and additionally, itโs more efficient at handling sparse representations of data.
Tips on product building being an iterative process (arguably, as soon as you launch a product, itโs the worst itโll ever be, because youโll upgrade it overtime):
The important thing is to keep benchmarking the current model against previous models and to keep iterating and keep on building.
And finally, my favourite:
The satisfaction of shipping products that youโve built is unparalleled.
Philipp Schmidโs blog is one of my favourite resources for practical modern machine learning use cases.
In his latest article, he shares how to fine-tune an open-source LLM (e.g. Llama 2, Mistral and more) in 2024 on your own dataset.
Starting with the most important step: defining your use case. Many times Iโve seen ML projects start out with โletโs use XYZ modelโ because itโs the latest and greatest rather than having a practical use case for it.
Perhaps fine-tuning isnโt necessary for your use case. Or maybe you donโt have enough data yet.
Either way, Philโs blog post is a sensational introduction to fine-tuning LLMs.
Nathan Lambert writes Interconnects.ai, a newsletter about AI and everything around it. I recently found two articles interesting and helpful:
1) It's 2024 and they just want to learn
An article detailing the fact that recipe training performant LLMs is pretty well-known now. As in, with large enough data, a Transformer architecture and compute power, even researchers relatively new to the game can create state-of-the-art (SOTA) models.
Of course, there's always room for improvement. It's also an ode to the open-source community. As although many cannot run the biggest models, well-performing and practical LLMs are becoming more and more accessible locally.
2) RLHF learning resources in 2024
RLHF stands for โReinforcement Learning from Human Feedback". And it's the technique used to "align" LLMs to what humans prefer.
In essence, you give a human a set of LLM outputs from a given input and they rate their favourite output based on the input. Repeat this a few hundred thousand (or million) times and you can train a reinforcement learning model to tune an LLM so its outputs more closely align with the human rated preferences.
The article collects a plethora of resources to learn more on the topic.
One of the key ingredients to modern LLMs is RLHF (as discussed above). In other words, getting an LLMs output to align with human preferences. However, this approach requires a large amount of manual effort.
What if you could use an existing LLM to create the preferences for you?
For example, given an input and an output of one LLM, get another LLM to rate which is better (by some criteria). This is the concept of RLAIF.
And as it turns out, itโs just as effective and in some cases more effective than pure RLHF. Cameron Wolfeโs article on RLAIF collects a bunch of different resources on the topic as well an analysis on the original RLAIF paper.
RLHF (Reinforcement Learning from Human Feedback) versus RLAIF (Reinforcement Learning from AI Feedback). Source: RLAIF paper.
Mat Miller has published a sensational tutorial on how to build your own GPT (Generative Pretrained Transformer) from scratch.
All the way from tokenization (turning words into numbers) to data loading to scaling up the Transformer architecture.
Best of all, the notebook is completely runable in Google Colab.
AWS share a real-world use case of object detection and tracking by creating a ball tracking pipeline for the PGA tour.
Using several different inputs of videofeeds, they create a cascade of convolutional neural networks (CNNs) to detect non-players, players and golf balls.
The thing that stood out to me the most was the emphasis on data labelling. They created their own custom dataset and then fine-tuned several YOLOv7 models which happened to perform very well on their target problem.
Golf ball tracking pipeline with a cascade of different CNN models for different tasks on AWS. Source: AWS blog.
The PyTorch team show how you can take a Llama-2 7B model from requiring 112GB of GPU VRAM to fine-tune to taking only ~7-10GB (which is possible to do a T4 GPU on Google Colab).
โWith QLoRA we are matching 16-bit fine-tuning performance across all scales and models, while reducing fine-tuning memory footprint by more than 90%โ thereby allowing fine-tuning of SOTA models on consumer-grade hardware.โ
Turns out you can customize an open-source LLM for your own use case with off the shelf tools quite efficiently.
There's a notebook to go with it on Google Colab too.
llm-course
is a packed GitHub repo by Maxime Labonne with a bunch of materials and resources to learn LLMs from beginner to building LLM-based applications and deploying them.import instructor
from openai import OpenAI
from pydantic import BaseModel
# Enables `response_model`
client = instructor.patch(OpenAI())
class UserDetail(BaseModel):
name: str
age: int
user = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=UserDetail,
messages=[
{"role": "user", "content": "Extract Jason is 25 years old"},
]
)
assert isinstance(user, UserDetail)
assert user.name == "Jason"
assert user.age == 25
Example of Suyra model extracting line-level text from an article. Source: Surya GitHub.
A theme Iโm noticing in ML research right now is the rise of synthetic data.
Over the past 2-3 years, text generation and image generation models got really good.
Now, for many common tasks, itโs possible to bootstrap a dataset by just generating synthetic data with existing foundation models.
For example, say you wanted to build a text classifier for certain classes.
You could get a model like GPT-4 to generate you 100 starter examples of each class and then build from there.
Or if you were short images for a certain class of image classification data, you could try and get Stable Diffusion to generate images for that class to boost it.
A repo Iโm watching is syn-rep-learn
by Google Research.
Thereโs already three papers in the repo (one of which is mentioned below) exploring the limits and possibilities of using synthetic data for vision representation learning and my guess is that text and other modalities may come next.
Researchers explore the possibility of training CLIP (Contrastive Image-Language Pretraining) models with synthetic images and find that they perform well but real data still wins out (for now).
It seems that SynCLR (a paper mentioned in last month's AI/ML Monthly issue) in the same repo may have improved upon this already.
Bonus: The Google Colab X account shared a notebook recently with code examples to generate images with Imagen (Google's AI Image Generation model) with Gemini (Google's big LLM model) as the prompter.
Scaling laws of synthetic data finds that real data still wins out on image classification tasks, however, synthetic data may be more robust to out-of-distribution examples. Source: Scaling laws for synthetic data paper.
Why train on billions of unsupervised examples when you can get an LLM to generate premium samples?
Researchers use LLMs to create synthetic data for hundreds of thousands of text embeddings across ~100 languages and then train a text embedding model on top of them.
And it turns out that the synthetic data alone can provide excellent representations but training with synthetic data as well as real data sets a new state-of-the-art on the MTEB (Massive Text Embeddings) leaderboard.
See the embedding model on Hugging Face.
It turns out that "easyโ data can be as good as โhardโ data for fine-tuning/prompting LLMs models.
Specifically, a LLM prompted with 3rd Grade level questions can perform almost as well as on a set of college questions as a LLM prompted with similar kinds of college questions.
The takeaway: LLMs seem to have most of their knowledge in the pretrained weights (from unsupervised pretraining or the โP" in ChatGPT) and tuning/prompting them with easy data (e.g. data that's easy to collect) relevant to your target problem may result in a more efficient pipeline than collecting hard examples.
Read the accompanying blog post for more highlights.
Easy data seems to get about 70-100% of the results as hard data. To me this shows that most of the LLMs knowledge is embedded in the pretraining weights and then unlocked via fine-tuning/prompting. Source: The Unreasonable Effectiveness of Easy Training Data for Hard Tasks paper.
What happens when you mix visual search techniques with the world knowledge of an LLM? You get a MLLM (Multimodal Large Language Model) with much better visual capabilities.
V* introduces a framework called SEAL or Show, SEArch and TelL.
To have finer grained visual searching abilities a visual search mechanism is engaged if the image encoder's visual features are not sufficient for answering a query.
The visual search mechanism turns on a visual working memory which contains four blocks, a question block, a global image block, a searched targets block and a target location block.
The overall mechanism creates a pattern similar to how people zoom in on images when asked a specific question about them. Compared to other models such as GPT-4V and Gemini Pro, V* performs much better at fine-grained visual search.
Though still far worse than humans (98%+ versus 74%+).
See the code on GitHub, try out the demo on Hugging Face.
Example demo of V performing visual search on an image to find a glass and then answer the question of what colour the liquid is inside it. Note: you might have to zoom in to find the glass (I did). Source: V* demo on Hugging Face.
What a massive month for the ML world in January!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.