March 31st, 2021 · 12 min read
15th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Daniel here, I'm 50% of the instructors behind the Complete Machine Learning and Data Science: Zero to Mastery course and our new TensorFlow for Deep Learning course!. I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.
Welcome to the 15th edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Video version of this article is live. Check it out here!
During COVID lockdowns of 2020, I decided to brush up my TensorFlow skills by getting TensorFlow Developer Certified. I made a video about how I did it (in a month or so) and afterwards had a bunch of questions from others asking how they could too.
The video talked about different resources I used but I wanted to go a step further.
So to answer the question of “I want to learn TensorFlow, where do I start?” I created the Zero to Mastery TensorFlow for Deep Learning course.
It teaches 3 things:
The entire course is code-first. Which means I use code to explain different concepts and link external non-code first resources for those who liked to learn more.
Be sure to check out:
In traditional software development, you've got inputs to a system and a desired output. How you get to the desired outputs is completely up to you.
So when something doesn't work, you might go back through all of the steps you took (code you wrote) to get from the inputs to outputs.
However, ML changes the order here a little.
Instead of you writing code to get to your desired output, an algorithm figures out which steps to take to get there.
By leveraging data.
Leveraging data is a board term but you can see we've added another variable to the mix. So now you've got something extra to troubleshoot.
All of these points could take you down a long road.
Instead of only programming a computer, you end up programming data as well.
Zayd Enam's great blog post Why machine learning is hard explores this paradigm in-depth with visual examples.
When BERT came out our machine learning team wanted to drop everything. Rip out all our existing NLP machine learning models and replace them with BERT.
"2019 is the year of BERT!"
But when we tried, we didn't get far.
Although BERT performed exceptionally on a wide variety of tasks, it wasn't suited for our production environment.
At the time, none of the hardware we were using could deploy it.
So we found ourselves with a severe case of machine learning engineer's disease: wanting to only ever use the biggest baddest latest model.
HuggingFace's Transformers made using BERT and other well-performing NLP models a breeze.
So when I read Simple considerations for simple people building fancy neural networks by Victor Sanh, it was music to my ears.
Victor discusses some of the principles learned for developing machine learning models in practice, including:
And many more...
The extra resources at the end of the post are also a goldmine for training principles.
What it is: Airbnb have a lot of photos (over 390M as of June 2020). And those photos contain a bunch of information.
But what about all the information around the photos? Such as, captions, geo-location, quality.
If a photo is worth 1,000 words, surely the information around the photo could make it worth 10,000 words.
Airbnb's WIDeText (Wide, Image, Dense and Text channels) multimodal model seeks to gather as much information from a photo as possible to classify what kind of room the photo is of.
Using multimodal data sources, Airbnb were able to achieve up +15% better overall accuracy versus using images only.
Why it matters: You may have encountered single datatype problems (e.g. vision only) before but many of the world's problems have multiple datatypes associated with them.
In Airbnb's case, knowing what type of room a photo is associated with helps their customers search for particular kinds of homes and in turn creates a better experience. However, labelling 390M images per hand is not a scalable task.
Enter, WIDeText, a great example of multimodal models being used in the wild.
Architecture layout of Airbnb's WIDeText and an example problem of classifying the room type of a photo based on multiple input sources. Source: WIDeText: A Multimodal Deep Learning Framework
What it is: As the world becomes more and more video first, the challenge of deriving information from them becomes even more important. Facebook's Learning from Videos project aims to use audio, visual and textual input (all together) to understand the contents of a video.
A live example of this can already be found in Instagram Reels' recommendation system. For example, using the audio embeddings from a series of videos to recommend other videos with similar sounds.
Why it matters: Imagine being able to ask for a system to go through your videos and find "all the times I played with my dog in the park" to do so would require multimodal understanding of at least vision and text (the text being the query translated into vision features).
And since access to a camera is becoming more and more available, there's no way all the videos being produced could be labelled by hand. So the plethora of video data being created creates a gold-mine for self-supervised learning techniques (see more on this below).
Read more on Facebook's approach to Learning from Videos in the blog post.
What it is: When you're a child, you experience millions of visual and auditory sensations (as well as other senses) before you know the names of them. By the time you're 16 you've seen other people drive cars many times so when you go to learn yourself you can pick it up fairly well within 20 or so hours. However, many deep learning models start learning from scratch and require many labelled examples. It's no wonder many top AI researchers are skeptical of this method going forward.
Self-supervised learning aims to mimic how a child might learn from a young age. In machine learning terms, a self-supervised learning algorithm is exposed to many different unlabelled examples and is asked to start forming an internal representation of how those examples relate to each other. In this case, examples could be sentences of text (how do different words relate to each other) or random images gathered from online.
In a recent blog post, Chief AI Scientist at Facebook, Yann LeCun, discusses why he believes self-supervised learning is the dark matter of intelligence.
Why it matters: Labelling all the data in the world is unfeasible. It's also counterintuitive to how we learn as humans. Show an eight year old a photo of a cow and they'd be able to recognise that cow in many different scenarios instantly, where as a machine learning model would often need to be shown several examples of the cow in different scenarios before it catches on.
Bonus: To see an example of self-supervised learning achieving outstanding results in computer vision, see SEER, a self-supervised model trained on 1 billion random and public Instagram images. Thanks to self-supervised pretraining, SEER achieves performance levels which rivals the best supervised learning algorithms after training on far less data.
What it is: ResNets first hit the computer vision scene in 2015, taking the title of best performing computer vision model on ImageNet. However, since their debut other model architectures like EfficientNet seem to have replaced them.
But, like always in the deep learning field, things change. And in the case of ResNets, there have been a few upgrades to way things are done since their release.
By combining recent best practices upgrades to training methods (learning rate decay), regularization methods (label smoothing) and architecture improvements (squeeze and excitation), ResNets have taken back the crown from EfficientNets.
More specifically a new branch of ResNets, ResNet-RS (ResNets re-scaled) matches the performance of many EfficientNet models whilst be 1.7-2.7x faster on TPUs and 2.1x-3.3x faster on GPUs.
Performance of ResNet-RS versus EfficientNets. Source: https://arxiv.org/pdf/2103.07579.pdf
Why it matters: Sometimes the architecture which works, works well for a reason. ResNets have worked well for years and their underlying mechanisms are quite simple. But recent architectures have introduced layers of complexity to achieve better results (discussed more in the paper). However, it was found with a few tweaks, many of them simplifications, the original ResNets still perform outstandingly well.
It raises the question: How much effort should be dedicated to improving and iterating on what works versus searching for another option?
Read more on the new baseline recommendation for computer vision in the paper: https://arxiv.org/pdf/2103.07579.pdf
What is it: Two frameworks for dealing with everything speech (speech to text, speaker recognition, text to speech and more).
The first, Coqui (also the name of a beautiful Puerto Rican frog), a startup and a framework aiming to provide open-source speech tools for everyone. Founded by Josh Meyer (someone I've been following in the speech community for a long time), Coqui provides two libraries TTS (text to speech) and STT (speech to text), both are based on deep learning and have been battle-tested in research and production.
The second, SpeechBrain is a PyTorch powered speech toolkit which is able to achieve state-of-the-art results in speech recognition, speaker recognition, speech enhancement and more.
Why it matters: A few short years ago, my team and I were tasked with a speech recognition project for a large financial institution. However, speech tools were very limited at the time and when we tried to apply them, we often ran into a plethora of issues whilst achieving sub-par results.
Now with toolkits like Coqui and SpeechBrain, it looks like speech is now gaining some of the benefits offered by computer vision and natural language processing libraries.
Find details for both on GitHub:
What it is: Google recently released a paper discussing the Deep Bootstrap Framework, a framework to guide design choices in deep learning.
In the ideal world (unlimited data), you should aim to optimize your network as quickly as possible (using convolutions, skip-connections such as those in ResNets, pre-training and more) where as in the real world (limited data) you should try to slow down optimization (using regularization, data augmentation).
Why it matters: Using the Deep Bootstrap Framework, researchers can guide their design choices for generalization (does your model work on unseen data) through the lens of optimization.
Does your change increase the speed of online test error optimization (faster is better)?
And does your change decrease the speed of offline train error optimization (slower is better)?
The ideal change answers yes to both of these.
What a massive month for the ML world in March!
As always, let me know if there's anything you think should be included in a future post. Liked something here? Tell a friend!
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
PS. You can see video versions of these articles on my YouTube channel (usually a few days after the article goes live). Watch previous month's here.
By the way, I'm a full time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or see all Zero To Mastery courses by visiting the courses page.