Machine Learning Monthly Newsletter šŸ’»šŸ¤–

Daniel Bourke
Daniel Bourke
hero image

39th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.

Hey there, Daniel here.

Iā€™m a Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in March 2023 as a Machine Learning Engineerā€¦

My work šŸ‘‡

A few students asked me the question:

ā€œWhy bother building your own custom ML models when ChatGPT/GPT-4 will be better?ā€

The short answer?

ChatGPT/GPT-4 are excellent but that doesnā€™t mean you still wonā€™t need your own models.

Why?

Thereā€™s lots of reasons.

From the Internet

Hello friends!

In light of the enormous releases in the past month (GPT-4, GPT-4 plugins, Googleā€™s PaLM API, etc), Iā€™ve decided to sidestep covering all of those.

There are plenty of other blog posts and newsletters covering recent large language model (LLM) drops so Iā€™ve decided to cover other things happening in ML.

The first two even shine some valuable critical thinking onto the recent releases!

1. Is AI Snake Oil?

There was a recent open letter to slow down the training of larger and larger language models (e.g. GPT-5+) to figure out if theyā€™re safe before continuing.

The letter has been signed by people such as Elon Musk and Emad Mostaque (founder of Stability.ai).

After the letter went public, a tidal wave of arguments for and against started to come in.

One of my favourites was the writings from the AI Snake Oil newsletter.

Their post A misleading open letter about sci-fi AI dangers ignores the real risks talks is about what the risks people think AI is going to cause in the future versus the risks it poses now.

ai-risks AI speculative (potential future) risks versus real (current) risks. Source: AI Snake Oil blog.

After reading through their post, I found another of theirs on how GPT-4 may have broken the number 1 rule in machine learning: donā€™t mix your training and testing data.

This data leakage could represent why the model is so good at certain tasks and benchmarks.

Because the benchmarks are in the training data.

Their writings link to many other sources to critically evaluate the power of large language models which Iā€™ve found very helpful.

Iā€™ve signed up for their emails to get more in the future.

2. The real revolution in AI is still to come by Hussein Mehanna

Huessian Mehanna is head of AI/ML at Cruise (a self-driving car company) and argues that in spite of all the major progress of generative models, they still require human assistance.

As in, it requires human intervention to succeed in any endeavor.

So theyā€™re not going to replace humans (yet).

You might trust ChatGPT to draft an essay or email for you.

But you wouldnā€™t blindly let it make decisions for you all day (yet).

Mehanna goes on to relate ChatGPT and similar models to the world of self-driving cars.

Current generative models are similar to level 1-3 autonomous cars.

They provide help but require a human to be in control.

Whereas, level 4 and level 5 self-driving systems require almost no human input whatsoever.

What happens when generative AI gets to this stage?

I donā€™t know.

Weā€™re on a rollercoaster of strange times ahead!

3. Pandas 2.0 is coming! ā€” faster and more compatible

pandas is one of the most used (if not the most used) data science libraries in Python.

Almost every notebook of mine starts with import pandas as pd.

Because itā€™s so widely used, many of the upgrades happen behind the scenes so your code doesnā€™t break on the front end.

But this update is big enough to call it pandas 2.0.

Marc Garcia, one of the core developers of pandas writes about how pandas is starting to shift away from a NumPy backend towards an Apache Arrow backend.

In short, the Apache Arrow backend allows for a bunch of upgrades in pandas to make it:

  • Better with missing values.
  • Faster (Apache Arrowā€™s in-memory data format is generally much faster than NumPy).
  • Have better interoperability with other libraries, for example, the Polars library (like pandas but written in the Rust programming language) could be used to perform a data processing pipeline (faster than pandas) but then pandas could be used for exporting (more exporting types).
  • Have better data type capabilities than NumPy, for example, NumPy stores boolean data types as 8 bits where as Arrow stores them as 1 bit (an 8x saving on storage).

pandas-2

pandas 2.0 speedups with an example on a dataframe with 2.5 million rows. Source: Marc Garcia blog.

Rob Mulla has a fantastic video on YouTube breaking down each of these updates with examples.

4. Building Airbnb Categories with ML & Human in the Loop (Part 2)

How do you tell if an Airbnb listing is on a Lakefront?

One way would be to just go off what the person who listed the place wrote in the description.

But what if their idea of a Lakefront was different to someone elseā€™s?

Well, maybe you could just go off GPS data and see how far the place was from a lake.

But what if the place was close to a lake on the map but when you arrive thereā€™s a giant mountain between the house and the lake?

Ok then, how about the photos?

Or the reviews of other people?

Or the lists people have saved homes to, what if they call their lists ā€œlakefrontā€?

How about all of them?

Thatā€™s what Airbnb has done with their machine learning powered categorisation models.

airbnb-3-ml-models Airbnb building 3x machine learning models for different tasks with humans-in-the-loop. Source: Airbnb Tech Blog.

Because there are so many different places on Airbnb, it would be impossible for someone to sit there and verify that theyā€™ve been categorised correctly into Lakefront, Countryside, Golf, Desert, National Parks, Surfing, so machine learning to the rescue!

The Airbnb Tech Blog shares how they went about creating a dataset with humanā€™s in-the-loop to pull a bunch of different features into several machine learning models to help automatically classify the:

  • Quality of the listing (is it inspiring, high quality or low quality?)
  • Best cover image (if someone is only going to look at one image, which one is most attractive?)
  • Category of the listing (is it Lakefront, Beachfront, Desert, etc?)

5. Expert in the Loop Generative AI at StitchFix

The Multithreaded Blog from StitchFix has a great example of how generative AI is making its way into the industry.

StitchFix is an online fashion retailer and theyā€™re currently using generative AI to craft headlines and copy (marketing text) for their advertisements as well as product descriptions for their products.

Instead of writing headlines and ads from scratch, their copywriters generate multiple headlines and ad options and edit them to make sure they suit the product.

6. Two big new open-source vision models

The outstanding LAION AI have released an open-source implementation of DeepMindā€™s Flamingo.

OpenFlamingo is a framework that enables training and evaluation of large multimodal models (LMMs).

Given an input image and input text, OpenFlamingo is able to produce an output completion.

openflamingo

Example of OpenFlamingoā€™s input and output capabilities. Given a text and image input example, the model can then complete the text output given a new image. Source: LAION AI blog.

The authors of the model state that itā€™s still a work in progress (expect updates soon) but itā€™s still very capable.

And to go along with OpenFlamingo, the Beijing Academy of Artificial Intelligence (BAAI) have also open-sourced EVA-02, the highest performing open-source model on ImageNet with 90.0% top-1 accuracy! All with 700M less parameters than their previous model.

This goes to show how much room there still is to benefit from improved pretraining schedules (which is what they did to achieve such incredible results).

Theyā€™ve also open-sourced EVA-CLIP which produces the highest zero-shot classification (no previous images seen) score on ImageNet at 82.0%!

What an outstanding couple of releases for the open-source world of multi-modal models and computer vision!

You can see the LAION AI OpenFlamingo releases at:

And you can see all of the EVA vision releases at:

7. Objaverse ā€” the worldā€™s largest open-source dataset of 3D objects!

The largest open-source dataset of labelled 3D objects has dropped!

800k+ annotated 3D objects on everything from cars to food to rooms to shoes.

Bring on the text-to-3D object generation!

8. Techniques to train your PyTorch models (much) faster

One of the great teachers of ML, Sebastian Raschka has an excellent blog post on how to train your PyTorch models faster.

He shows how you can perform fine tuning on a DistilBERT model with vanilla PyTorch in 21.33min but then speeds it up to an incredible 1.8min (an ~11x speed up!).

pytorch-speed-improvements PyTorch training speed improvements with various training setups. Source: Sebastian Raschka blog.

9. DataPerf benchmark (like Kaggle but for improving on datasets)

As weā€™ve all seen over the past 10-years, model architectures have been getting better and better.

However, data is still (it always has been) the bottleneck for machine learning.

As the EVA vision results showed, with the right pretraining recipe (using various sources of data in different ways), you can improve performance dramatically.

The DataPerf benchmark is the new place for showing the power of data-centric algorithms.

data-new-bottleneck Something weā€™ve known all along. Without the right data, how can you possibly build a great model? Source: Google AI blog.

Benchmarks include:

  • Vision training data selection ā€” rather than using all of your vision data, can you select a portion of it to achieve similar or even better results?
  • Speed training data selection ā€” similar to the above, rather than training on hundreds of thousands of recordings, could you train on less and achieve the same results?

10. Latent Space Podcast/Newsletter

One of my favourite new podcasts/newsletter is Latent Space.

Itā€™s a blend of software development and AI.

And I love it.

Such as this recent post on You Are Not Too Old (to Pivot Into AI).

Itā€™s full of excellent advice and resources too.

Such as this handy Tweet on getting started with MLOps.

See you next month!

What a massive month for the ML world in March 2023!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

Data Engineer vs Data Analyst vs Data Scientist - Which Is Best for Me? preview
Data Engineer vs Data Analyst vs Data Scientist - Which Is Best for Me?

Data is HOT right now. Great salaries, 1,000s of job opportunities, exciting + high-impact work. But what are the differences and which role is best for you?

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

Python Monthly Newsletter šŸ’»šŸ preview
Python Monthly Newsletter šŸ’»šŸ

40th issue of Andrei Neagoie's must-read monthly Python Newsletter: Pythons growth, 3.12s speed, and Pynecone for site builds. This and more. Read the full newsletter to get up-to-date with everything you need to know from last month.