Machine Learning Monthly 💻🤖 March 2021

Daniel Bourke

Daniel Bourke

15th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.

Hey everyone!

Daniel here, I'm 50% of the instructors behind the Complete Machine Learning and Data Science: Zero to Mastery course and our new TensorFlow for Deep Learning course!. I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.

Welcome to the 15th edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

What you missed in March as a Machine Learning Engineer…

My work 👇

Video version of this article is live. Check it out here!

The Zero to Mastery TensorFlow for Deep Learning course has launched!

During COVID lockdowns of 2020, I decided to brush up my TensorFlow skills by getting TensorFlow Developer Certified. I made a video about how I did it (in a month or so) and afterwards had a bunch of questions from others asking how they could too.

The video talked about different resources I used but I wanted to go a step further.

So to answer the question of “I want to learn TensorFlow, where do I start?” I created the Zero to Mastery TensorFlow for Deep Learning course.

It teaches 3 things:

  1. The fundamentals of deep learning (a machine learning paradigm taking the world by storm)
  2. The fundamentals of TensorFlow (a framework used to write deep learning algorithms, e.g. Neural Networks)
  3. How to pass the TensorFlow Developer Certification (this is optional but requires 1 & 2)

The entire course is code-first. Which means I use code to explain different concepts and link external non-code first resources for those who liked to learn more.

Be sure to check out:

From the interwebs 🕸

Why machine learning is hard 😤

In traditional software development, you've got inputs to a system and a desired output. How you get to the desired outputs is completely up to you.

So when something doesn't work, you might go back through all of the steps you took (code you wrote) to get from the inputs to outputs.

However, ML changes the order here a little.

Instead of you writing code to get to your desired output, an algorithm figures out which steps to take to get there.

How so?

By leveraging data.

Leveraging data is a board term but you can see we've added another variable to the mix. So now you've got something extra to troubleshoot.

  • Do you have enough data?
  • Is it of good quality?
  • Where did it come from?

All of these points could take you down a long road.

Instead of only programming a computer, you end up programming data as well.

Zayd Enam's great blog post Why machine learning is hard explores this paradigm in-depth with visual examples.

Start with less fancy networks (principles for training neural networks) 🤵

When BERT came out our machine learning team wanted to drop everything. Rip out all our existing NLP machine learning models and replace them with BERT.

"2019 is the year of BERT!"

But when we tried, we didn't get far.

Although BERT performed exceptionally on a wide variety of tasks, it wasn't suited for our production environment.

At the time, none of the hardware we were using could deploy it.

So we found ourselves with a severe case of machine learning engineer's disease: wanting to only ever use the biggest baddest latest model.

HuggingFace's Transformers made using BERT and other well-performing NLP models a breeze.

So when I read Simple considerations for simple people building fancy neural networks by Victor Sanh, it was music to my ears.

Victor discusses some of the principles learned for developing machine learning models in practice, including:

  1. Start by putting machine learning aside (also Google's #1 rule of machine learning)
  2. Continue as if you just started machine learning (start as simple as possible, experiment, experiment, experiment)

And many more...

The extra resources at the end of the post are also a goldmine for training principles.

Airbnb's Multimodal Deep Learning Neural Network (WIDeText) 🕳

What it is: Airbnb have a lot of photos (over 390M as of June 2020). And those photos contain a bunch of information.

But what about all the information around the photos? Such as, captions, geo-location, quality.

If a photo is worth 1,000 words, surely the information around the photo could make it worth 10,000 words.

Airbnb's WIDeText (Wide, Image, Dense and Text channels) multimodal model seeks to gather as much information from a photo as possible to classify what kind of room the photo is of.

Using multimodal data sources, Airbnb were able to achieve up +15% better overall accuracy versus using images only.

Why it matters: You may have encountered single datatype problems (e.g. vision only) before but many of the world's problems have multiple datatypes associated with them.

In Airbnb's case, knowing what type of room a photo is associated with helps their customers search for particular kinds of homes and in turn creates a better experience. However, labelling 390M images per hand is not a scalable task.

Enter, WIDeText, a great example of multimodal models being used in the wild.


Architecture layout of Airbnb's WIDeText and an example problem of classifying the room type of a photo based on multiple input sources. Source: WIDeText: A Multimodal Deep Learning Framework

Learning from Videos (more multimodal modelling) 📺

What it is: As the world becomes more and more video first, the challenge of deriving information from them becomes even more important. Facebook's Learning from Videos project aims to use audio, visual and textual input (all together) to understand the contents of a video.

A live example of this can already be found in Instagram Reels' recommendation system. For example, using the audio embeddings from a series of videos to recommend other videos with similar sounds.

Why it matters: Imagine being able to ask for a system to go through your videos and find "all the times I played with my dog in the park" to do so would require multimodal understanding of at least vision and text (the text being the query translated into vision features).

And since access to a camera is becoming more and more available, there's no way all the videos being produced could be labelled by hand. So the plethora of video data being created creates a gold-mine for self-supervised learning techniques (see more on this below).

Read more on Facebook's approach to Learning from Videos in the blog post.

The dark matter of intelligence (self-supervised learning) 🧠

What it is: When you're a child, you experience millions of visual and auditory sensations (as well as other senses) before you know the names of them. By the time you're 16 you've seen other people drive cars many times so when you go to learn yourself you can pick it up fairly well within 20 or so hours. However, many deep learning models start learning from scratch and require many labelled examples. It's no wonder many top AI researchers are skeptical of this method going forward.

Self-supervised learning aims to mimic how a child might learn from a young age. In machine learning terms, a self-supervised learning algorithm is exposed to many different unlabelled examples and is asked to start forming an internal representation of how those examples relate to each other. In this case, examples could be sentences of text (how do different words relate to each other) or random images gathered from online.

In a recent blog post, Chief AI Scientist at Facebook, Yann LeCun, discusses why he believes self-supervised learning is the dark matter of intelligence.

Why it matters: Labelling all the data in the world is unfeasible. It's also counterintuitive to how we learn as humans. Show an eight year old a photo of a cow and they'd be able to recognise that cow in many different scenarios instantly, where as a machine learning model would often need to be shown several examples of the cow in different scenarios before it catches on.

Bonus: To see an example of self-supervised learning achieving outstanding results in computer vision, see SEER, a self-supervised model trained on 1 billion random and public Instagram images. Thanks to self-supervised pretraining, SEER achieves performance levels which rivals the best supervised learning algorithms after training on far less data.

📄 Paper of the month: ResNets are back baby!

What it is: ResNets first hit the computer vision scene in 2015, taking the title of best performing computer vision model on ImageNet. However, since their debut other model architectures like EfficientNet seem to have replaced them.

But, like always in the deep learning field, things change. And in the case of ResNets, there have been a few upgrades to way things are done since their release.

By combining recent best practices upgrades to training methods (learning rate decay), regularization methods (label smoothing) and architecture improvements (squeeze and excitation), ResNets have taken back the crown from EfficientNets.

More specifically a new branch of ResNets, ResNet-RS (ResNets re-scaled) matches the performance of many EfficientNet models whilst be 1.7-2.7x faster on TPUs and 2.1x-3.3x faster on GPUs.

paper of the month

Performance of ResNet-RS versus EfficientNets. Source:

Why it matters: Sometimes the architecture which works, works well for a reason. ResNets have worked well for years and their underlying mechanisms are quite simple. But recent architectures have introduced layers of complexity to achieve better results (discussed more in the paper). However, it was found with a few tweaks, many of them simplifications, the original ResNets still perform outstandingly well.

It raises the question: How much effort should be dedicated to improving and iterating on what works versus searching for another option?

  • Read more on the new baseline recommendation for computer vision in the paper:

    • Side note: this is incredibly well-written paper. I especially love how the authors broke down each of their changes (training, regularization, architecture) and discussed the performance tradeoffs for each.
  • Also see a great summary of the benefits by Kaggle Master Artsiom Sanakoyeu (thank you Ashik for the tip!)

Super-powered speech 🗣

What is it: Two frameworks for dealing with everything speech (speech to text, speaker recognition, text to speech and more).

The first, Coqui (also the name of a beautiful Puerto Rican frog), a startup and a framework aiming to provide open-source speech tools for everyone. Founded by Josh Meyer (someone I've been following in the speech community for a long time), Coqui provides two libraries TTS (text to speech) and STT (speech to text), both are based on deep learning and have been battle-tested in research and production.

The second, SpeechBrain is a PyTorch powered speech toolkit which is able to achieve state-of-the-art results in speech recognition, speaker recognition, speech enhancement and more.

Why it matters: A few short years ago, my team and I were tasked with a speech recognition project for a large financial institution. However, speech tools were very limited at the time and when we tried to apply them, we often ran into a plethora of issues whilst achieving sub-par results.

Now with toolkits like Coqui and SpeechBrain, it looks like speech is now gaining some of the benefits offered by computer vision and natural language processing libraries.

Find details for both on GitHub:

To generalise: speed up online learning test error, slow down offline learning train error ❌

What it is: Google recently released a paper discussing the Deep Bootstrap Framework, a framework to guide design choices in deep learning.

In the ideal world (unlimited data), you should aim to optimize your network as quickly as possible (using convolutions, skip-connections such as those in ResNets, pre-training and more) where as in the real world (limited data) you should try to slow down optimization (using regularization, data augmentation).

Why it matters: Using the Deep Bootstrap Framework, researchers can guide their design choices for generalization (does your model work on unseen data) through the lens of optimization.

Does your change increase the speed of online test error optimization (faster is better)?

And does your change decrease the speed of offline train error optimization (slower is better)?

The ideal change answers yes to both of these.

For more:

See you next month!

What a massive month for the ML world in March!

As always, let me know if there's anything you think should be included in a future post. Liked something here? Tell a friend!

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel | YouTube

PS. You can see video versions of these articles on my YouTube channel (usually a few days after the article goes live). Watch previous month's here.

By the way, I'm a full time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or see all Zero To Mastery courses by visiting the courses page.

More from Zero To Mastery

undefined preview
Don't be a Junior Developer

Don’t sell yourself short. Seriously, don’t be a Junior Developer. A Junior Developer puts this title in their resume, emails, and LinkedIn… They announce it to the world. Don’t.

undefined preview
Python Monthly 💻🐍

16th issue of Python Monthly! Read by 20,000+ Python developers every month. This monthly Python newsletter is focused on keeping you up to date with the industry and keeping your skills sharp, without wasting your valuable time.

undefined preview
Web Developer Monthly 💻🚀

33rd issue of Web Developer Monthly! Read by 100,000+ developers every month. Keeping you up to date with the industry, without wasting your valuable time.