52nd issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
The MLOps Discord (MLOps stands for machine learning operations, essentially all the parts that go around building a machine learning model) was hosting a “Lessons Learned” session for people building ML/AI-powered apps.
I applied and was invited along to talk about Nutrify (a computer vision powered application to help people learn about food).
In the talk, I share 4 of my own biggest lessons building an AI startup from scratch:
A sensational article with practical tips on how to get a role in deep learning.
My favourite takeaways were:
A list of foundational papers from 2015-2022 for the modern era of deep learning collected by Vikas Paruchuri. A good exercise would be to take each of these papers and reimplement them in a deep learning framework of your choice. Source: Vikas Paruchuri blog.
I liked this article so much I’ve added it to my list of resources for how to get a job in machine learning.
Ken’s company Truss helps accountants organize client documents.
To do so, they use GPT.
More specifically GPT-4 (85%) and GPT 3.5 (15%).
So far their team have processed closed to half a billion tokens via the GPT API (around 750,000 pages of text).
Bonus: Ken wrote a follow up blog post called GPT is the Heroku of AI which explains how GPT’s zero-shot abilities dramatically lower the barrier to entry for building AI/ML-based apps. I totally agree.
Some of the things GPT (and similar foundation models) can do with little to no data is an incredible kickstart to any AI project. Start the project fast with a GPT-like model and then move to your own custom models.
DoorDash is an app to help people order food or groceries from almost anywhere.
So you can imagine how many products they have to record in their database.
Some products are also the same product but are called different things at different stores.
For example, Carton of Water Bottles 12x600ml could be called 600ml Water Bottles (12 Pack) somewhere else.
How do you link these up?
Or what if there’s a new product being added to the database?
How do you enrich its entry with extra information so people can search for it?
LLMs to the rescue!
On their engineering blog, DoorDash details how they enrich product information using LLMs by extracting unstructured data in four steps:
An excellent example of combining traditional machine learning models (classifiers) with LLMs to enhance a workflow.
Doing this means that a customer’s search doesn’t necessarily have to include brand names or actual product names.
They can just search “organic handwash” and several brands will be returned.
From an ad perspective, product owners can also advertise their own products for relevant searches.
DoorDash’s workflow for enhancing product metadata information via traditional classifiers and LLMs. Source: DoorDash engineering blog.
Yi Tay left Google Brain about a year ago and founding Reka, a generative AI company.
In that short time, starting from scratch, Reka built infrastructure and trained generative AI models competitive or better than Gemini Pro, GPT-3.5 and now close to GPT-4 and Claude 3 Opus.
In a story-style blog post, Tay shares lessons learned on what it’s like to build world-class models entirely from scratch.
Namely:
Bonus: Yi Tay posts some great takes and insights into ML/AI on X/Twitter. Like a recent discussion on architecture scaling vs data scaling.
Professor Ethan Mollick looks into a world and several use cases where educators leverage AI and prompting to help with their materials.
As a teacher, I read this and got plenty of ideas for my own future materials.
One example use case: making quizzes from existing materials and then having the resources to answer the question automatically linked back to source materials.
Text-to-SQL means entering a natural language prompt such as “users with over 100 followers who posted in the last week from a European country” and getting back a valid SQL query.
For example (demo query):
SELECT user_id, followers_count, post_date, country
FROM users
WHERE followers_count > 100
AND post_date >= CURRENT_DATE - INTERVAL '7 days'
AND country IN ('Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus',
'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France',
'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy',
'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands',
'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia',
'Spain', 'Sweden', 'United Kingdom')
ORDER BY followers_count DESC;
This is helpful because generally more people will be well-versed in natural language versus writing SQL.
But how do you get this to work across 100s or even 1000s of different SQL tables?
What if you’re not sure of the right table to query?
Once again, LLMs to the rescue!
You can create a vector index of table summaries and historical queries against them (e.g. turn the table metadata + history into embeddings).
Then query the vector index based on the embedded query text.
The top N potential tables are then returned along with more details about the tables to create a prompt for an intermediate LLM.
The intermediate LLM sorts the top N potential tables and metadata into a smaller pool of candidates and these new top K tables are returned to the user.
The user confirms the top K tables and the text-to-SQL pipeline gets executed in a RAG (Retrieval Augmented Generation) style workflow.
One of the most important findings was that table documentation played a crucial role in search hit rate performance.
Search hit rate without table documentation in the embeddings was 40%.
But performance increased linearly with additional documentation up to 90%.
Pinterest text-to-SQL pipeline which combines an offline and an online system. Source: Pinterest Engineering Blog.
Chip Huyen is one of my favourite people in the world of ML.
I have her book Designing Machine Learning Systems on the shelf next to me.
In one of her latest posts, she goes through 900+ of the most popular open source AI tools and picks out some of the biggest trends across the past couple of years.
These trends are broken down into four layers: applications, application development, model development and infrastructure.
The New AI Stack created by Chip Huyen after reviewing 900+ of the most popular AI tools on GitHub. Source: Chip Huyen blog.
Check out Chip’s blog post to get a bunch of insight in new tools and trends on the horizon.
One stand out thing I noticed is that a considerable number of tools get a large amount of stars/hype to begin but end up living fast, dying young (e.g. their popularity falls off quickly).
Embeddings are learned data representations.
They can take complex data samples and turn them into useful pieces of information that can be compared to each other.
And mixedbread.ai make some of the best text embedding models out there.
Their newest release, Binary Matryoshka Representation Learning (Binary MRL) allows for both a smaller embedding vector size (e.g. 512 vs 1024) as well a smaller data representation size (e.g. binary vs float32) for a whopping 64x efficiency gain with 90% of performance maintained.
This means you can store an embedding in just 64 bytes of memory (vs the standard 4096 bytes for a 1024 embedding in float32).
This translates to a cost reduction from $14,495.85 (3.81TB at $3.8 per GB/month) to $226.50 (59.60GB) per month to store 1B (1 billion) embeddings on x2gd
instances on AWS.
And even better, all of this can be implemented in a few lines of code via the sentence_transformers
library:
from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
# 2. Encode some text and select MRL dimensions
mrl_embeddings = model.encode(
["Who is German and likes bread?", "Everybody in Germany."], normalize_embeddings=True)[..., :512]
# 3. Apply binary quantization
binary_embeddings = quantize_embeddings(mrl_embeddings, precision="binary")
Machine-assisted manual annotation pipeline from COCONut paper. Models are used to propose the initial labels before people are asked to review them (it’s much easier to review than it is to create labels from scratch). Source: COCONut paper.
What a massive month for the ML world in April!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.