39th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Hey there, Daniel here.
Iβm a Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
A few students asked me the question:
βWhy bother building your own custom ML models when ChatGPT/GPT-4 will be better?β
The short answer?
ChatGPT/GPT-4 are excellent but that doesnβt mean you still wonβt need your own models.
Why?
Thereβs lots of reasons.
Why bother building you own custom ML models when ChatGPT/GPT-4 will be better?
— Daniel Bourke (@mrdbourke) March 29, 2023
A few reasons:
1. *Lots* of companies have private data that canβt go to an API call (e.g. if you donβt want to send all your data to OpenAI/Microsoft, build your own model)
2. ChatGPT/GPT-4 areβ¦
Hello friends!
In light of the enormous releases in the past month (GPT-4, GPT-4 plugins, Googleβs PaLM API, etc), Iβve decided to sidestep covering all of those.
There are plenty of other blog posts and newsletters covering recent large language model (LLM) drops so Iβve decided to cover other things happening in ML.
The first two even shine some valuable critical thinking onto the recent releases!
There was a recent open letter to slow down the training of larger and larger language models (e.g. GPT-5+) to figure out if theyβre safe before continuing.
The letter has been signed by people such as Elon Musk and Emad Mostaque (founder of Stability.ai).
After the letter went public, a tidal wave of arguments for and against started to come in.
One of my favourites was the writings from the AI Snake Oil newsletter.
Their post A misleading open letter about sci-fi AI dangers ignores the real risks talks is about what the risks people think AI is going to cause in the future versus the risks it poses now.
AI speculative (potential future) risks versus real (current) risks. Source: AI Snake Oil blog.
After reading through their post, I found another of theirs on how GPT-4 may have broken the number 1 rule in machine learning: donβt mix your training and testing data.
This data leakage could represent why the model is so good at certain tasks and benchmarks.
Because the benchmarks are in the training data.
Their writings link to many other sources to critically evaluate the power of large language models which Iβve found very helpful.
Iβve signed up for their emails to get more in the future.
Huessian Mehanna is head of AI/ML at Cruise (a self-driving car company) and argues that in spite of all the major progress of generative models, they still require human assistance.
As in, it requires human intervention to succeed in any endeavor.
So theyβre not going to replace humans (yet).
You might trust ChatGPT to draft an essay or email for you.
But you wouldnβt blindly let it make decisions for you all day (yet).
Mehanna goes on to relate ChatGPT and similar models to the world of self-driving cars.
Current generative models are similar to level 1-3 autonomous cars.
They provide help but require a human to be in control.
Whereas, level 4 and level 5 self-driving systems require almost no human input whatsoever.
What happens when generative AI gets to this stage?
I donβt know.
Weβre on a rollercoaster of strange times ahead!
pandas is one of the most used (if not the most used) data science libraries in Python.
Almost every notebook of mine starts with import pandas as pd
.
Because itβs so widely used, many of the upgrades happen behind the scenes so your code doesnβt break on the front end.
But this update is big enough to call it pandas 2.0.
Marc Garcia, one of the core developers of pandas writes about how pandas is starting to shift away from a NumPy backend towards an Apache Arrow backend.
In short, the Apache Arrow backend allows for a bunch of upgrades in pandas to make it:
pandas 2.0 speedups with an example on a dataframe with 2.5 million rows. Source: Marc Garcia blog.
Rob Mulla has a fantastic video on YouTube breaking down each of these updates with examples.
How do you tell if an Airbnb listing is on a Lakefront?
One way would be to just go off what the person who listed the place wrote in the description.
But what if their idea of a Lakefront was different to someone elseβs?
Well, maybe you could just go off GPS data and see how far the place was from a lake.
But what if the place was close to a lake on the map but when you arrive thereβs a giant mountain between the house and the lake?
Ok then, how about the photos?
Or the reviews of other people?
Or the lists people have saved homes to, what if they call their lists βlakefrontβ?
How about all of them?
Thatβs what Airbnb has done with their machine learning powered categorisation models.
Airbnb building 3x machine learning models for different tasks with humans-in-the-loop. Source: Airbnb Tech Blog.
Because there are so many different places on Airbnb, it would be impossible for someone to sit there and verify that theyβve been categorised correctly into Lakefront, Countryside, Golf, Desert, National Parks, Surfing, so machine learning to the rescue!
The Airbnb Tech Blog shares how they went about creating a dataset with humanβs in-the-loop to pull a bunch of different features into several machine learning models to help automatically classify the:
The Multithreaded Blog from StitchFix has a great example of how generative AI is making its way into the industry.
StitchFix is an online fashion retailer and theyβre currently using generative AI to craft headlines and copy (marketing text) for their advertisements as well as product descriptions for their products.
Instead of writing headlines and ads from scratch, their copywriters generate multiple headlines and ad options and edit them to make sure they suit the product.
The outstanding LAION AI have released an open-source implementation of DeepMindβs Flamingo.
OpenFlamingo is a framework that enables training and evaluation of large multimodal models (LMMs).
Given an input image and input text, OpenFlamingo is able to produce an output completion.
Example of OpenFlamingoβs input and output capabilities. Given a text and image input example, the model can then complete the text output given a new image. Source: LAION AI blog.
The authors of the model state that itβs still a work in progress (expect updates soon) but itβs still very capable.
And to go along with OpenFlamingo, the Beijing Academy of Artificial Intelligence (BAAI) have also open-sourced EVA-02, the highest performing open-source model on ImageNet with 90.0% top-1 accuracy! All with 700M less parameters than their previous model.
This goes to show how much room there still is to benefit from improved pretraining schedules (which is what they did to achieve such incredible results).
Theyβve also open-sourced EVA-CLIP which produces the highest zero-shot classification (no previous images seen) score on ImageNet at 82.0%!
What an outstanding couple of releases for the open-source world of multi-modal models and computer vision!
You can see the LAION AI OpenFlamingo releases at:
And you can see all of the EVA vision releases at:
The largest open-source dataset of labelled 3D objects has dropped!
800k+ annotated 3D objects on everything from cars to food to rooms to shoes.
Bring on the text-to-3D object generation!
One of the great teachers of ML, Sebastian Raschka has an excellent blog post on how to train your PyTorch models faster.
He shows how you can perform fine tuning on a DistilBERT model with vanilla PyTorch in 21.33min but then speeds it up to an incredible 1.8min (an ~11x speed up!).
PyTorch training speed improvements with various training setups. Source: Sebastian Raschka blog.
As weβve all seen over the past 10-years, model architectures have been getting better and better.
However, data is still (it always has been) the bottleneck for machine learning.
As the EVA vision results showed, with the right pretraining recipe (using various sources of data in different ways), you can improve performance dramatically.
The DataPerf benchmark is the new place for showing the power of data-centric algorithms.
Something weβve known all along. Without the right data, how can you possibly build a great model? Source: Google AI blog.
Benchmarks include:
One of my favourite new podcasts/newsletter is Latent Space.
Itβs a blend of software development and AI.
And I love it.
Such as this recent post on You Are Not Too Old (to Pivot Into AI).
Itβs full of excellent advice and resources too.
Such as this handy Tweet on getting started with MLOps.
What a massive month for the ML world in March 2023!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.