23rd issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Daniel here, I'm 50% of the instructors behind Zero To Mastery's Machine Learning and Data Science Bootcamp course and our new TensorFlow for Deep Learning course! I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.
Welcome to this edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
I put Apple's new hardware to the test on the machine learning front. Running a bunch of machine learning code to see how well they performed. I made a video and wrote an article with the results. Turns out, Apple's new hardware is faster than Google Colab (the free version).
I recently hit 100,000 subscribers on YouTube (thank you thank you thank you) and to celebrate, I hosted a 10-hour machine learning coding livestream to build a full-stack machine learning app from scratch. You can watch the whole stream on YouTube.
I've started working on a full-stack machine learning project to take a photo of food and learn about it. This combines my interests of health and nutrition and machine learning. My goal is to build a data flywheel (collect data, model, improve model, collect data, repeat). Stay tuned for video updates on my YouTube.
Using a series of newer training techniques such as, TrivialAugment, longer training, learning rate decay and varying training and inference image sizes, PyTorch added 4.5% accuracy to their baseline ResNet models on ImageNet.
The blog post breaks down each additional update and how it contributed to adding more predictive power to the models.
How much different training techniques add to the updated PyTorch models in torchvision. Source: PyTorch blog.
Using a few lines of code you can improve the training and inference speed of your existing Scikit-Learn.
By adding the following to your scripts:
# Add in Intel extension for Scikit-Learn
from sklearnex import patch_sklearn
patch_sklearn()
# Normal Scikit-Learn code
from sklearn.ensembles import RandomForestClassifier
# All Scikit-Learn code from here on will be sped up...
There have been extensive tests done on various models in Scikit-Learn and it looks like there's improvements all throughout.
TensorFlow 2.7 came out with an improved debugging experience (less but more helpful error messages gets printed out so you can fix your code sooner).
There's also a bunch of new models available on TensorFlow Hub including MLP-Mixer, Vision Transformers, Wav2Vec2, RoBERTa, ConvMixer, DistillBERT, YoloV5 and more.
One of the more breaking changes comes in the first point of the release notes:
One of the larger breaking changes when upgrading to TensorFlow 2.7.0. Source: TensorFlow release notes.
This means if you don't explicitly define the input shape to your models you might get an error saying incorrect shapes. This is because the model.fit()
function will no longer transform inputs from shape (batch_size, )
to (batch_size, 1)
.
For students of the Zero To Mastery TensorFlow course, this was seen in notebook 01 but has since been fixed. You can see the discussion and code example on the course GitHub.
New research from the Google AI team compared the performance of model ensembles and cascades and found groups of smaller models can outperform larger models in speed and accuracy.
A model ensemble is when the predictions of multiple models are combined to make decisions. For example you could average the predictions of two models to potentially perform better than one.
A model cascade is when a sample goes through one model and the prediction is given if the model is confident enough (e.g. the prediction probability is above a value such as 0.8) and if it's not, it continues to another model. For example, an EfficientNetB0 model (smaller) could be the first model and it will make adequate predictions for easier samples, however, harder samples may be passed to an EfficientNetB1 model (larger).
Model ensemble versus model cascade. Source: Model Ensembles Are Faster Than You Think.
The research showed that ensembles are the most cost-effective in the large computation regime (e.g. two EfficientNetB5 models are better than one EfficientNetB7 model in terms of compute and performance.
And cascades outperform single models in all (small and large) computation regimes.
For example, a cascade of EfficientNetB2 + B4 + B4 + B4 outperforms a single B5 model because the vast majority of easier samples (67.6%) are taken care of using EfficientNetB2 (saving most of the compute time) and the rest of the samples are passed down the cascade, decreasing with every passthrough.
Results of different combinations of ensembles and cascades in terms of accuracy per FLOPS. FLOPS stands for floating point operations per second. And is a measure of computer performance, more FLOPS required = more computing required = more time. Source: Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models.
What's the simplest strategy for augmenting image data?
That's what the authors of the TrivialAugment asked themselves.
And do you know what they came up with?
Don't even think of a strategy. Just let it be random.
Seriously.
Take a list of 14 different types of image augmentations (rotations, flips, shifts, colour changes) and then a set of different intensities to apply those at (0 to 30 with 0 being none and 30 being full) and that's it.
In pseudocode it looks like:
procedure TrivialAugment(*x*: image)
Sample an augmentation *a* from list_of_augmentations
Sample a strength *m* from {0,...,30}
Return *a(x, m)*
end procedure
TrivialAugment outperforms or equals almost all other automatic augmentation strategies far less (basically zero) overhead.
Paper: https://arxiv.org/abs/2103.10158
Code: https://github.com/automl/trivialaugment
Kaggle conducted a survey of 25,000 data scientists and ML engineers to figure out what's happening in the world of data science and ML.
The survey includes information on everything from education to employment to technologies in use.
Some of my favourite trends include:
Lifelong learning. Many data scientists and machine learning engineers continually study to upgrade their skills. Zero To Mastery offers a fantastic foundation but as I always say, it'll be up to you to keep learning as you progress in the field.
Ongoing learning slide from Kaggle State of Data Science and Machine Learning 2021 survey.
TensorFlow and Scikit-Learn are the most popular machine learning frameworks. And now with Intel's Scikit-Learn boost, you know if you learn Scikit-Learn not only will you be using one of the most common machine learning frameworks, it'll run fast too. And although used less in total, PyTorch has continued to grow year-over-year.
Machine Learning Frameworks slide from Kaggle State of Data Science and Machine Learning 2021 survey.
You may have seen the contrastive loss function being used in recent self-supervised learning papers and previous issues of machine learning monthly. I've been seeing it everywhere (such as in the below resource).
Brian Williams breaks it down in a wonderfully written blog post.
Contrastive loss takes the output of the network for a positive example and calculates its distance to an example of the same class and contrasts that with the distance to negative examples. Said another way, the loss is low if positive samples are encoded to similar (closer) representations and negative examples are encoded to different (farther) representations.
For example, say you're training a neural network to classify handwritten digits from 0 to 9. You want your model to push the representations of the number 0 away from the numbers 1-9. And similar for the other digits.
The blog post goes through and describes the loss function using intuition and simple code examples.
Turing Bletchley is a 2.5 billion parameter model that can match images with language in 94 different languages.
It does so by using transformer-based image and text encoders to encode (turn into numbers) images and image captions from the web. The model was then trained using contrastive loss on image and text encodings to understand the relationship between images and language.
Model architecture of Turing Bletchley. The encoders are comprised of transformer-based models like BERT large. Source: Microsoft Turing Bletchley blog post.
To enable multiple different languages, another model was trained using contrastive loss on text captions in English and various other languages.
This is important because a vast majority of current models only deal with English captions only. However, for many people, English isn't their main language.
Microsoft also released a Turing Bletchley demo page where you can see what kind of queries (text and image) can be used to retrieve different images.
Example of Turing Bletchley understanding the query "different types of vintage cars" to return images containing schematics of various car types. Source: Microsoft Turing Bletchley demo page.
Data drift is when your model is trained on a different dataset to what it's predicting on.
As in your model might be trained on images from 2010 but images in 2021 are far different. That's an extreme example but it's to describe what can happen over time.
Models often get trained on a snapshot of data but the world is constantly changing.
One way to adapt to such changes is to label new incoming data in the same fashion as older data.
And then retrain a model on the older and new data.
It's a never ending cycle. New data, new labels, new models.
Label Studio released a blog post describing the different types of data drift, covariate drift (features change but still the relationship between the target variable remains) and concept drift (the features are no longer predictive of the target variable) and how to handle them with adequate data labelling.
Julien Simon from HuggingFace talks about how machine learning models can now be injected into applications within a few lines of code.
The article is full of quotes I love.
“Every data scientist and machine learning engineer should obsess about getting their models in production, as quickly and as often as possible.” An okay production model beats a great sandbox model every time.
Once your model is in production, you can start to test it on real-time data.
Become an ML-savvy Software and DevOps engineer rather than a data scientist.
This is continuing on from the point above. Yes, data science is valuable but without something working in production, there might not be any data to do science on.
And finally:
Instead of “Let’s build and train our own deep learning model from scratch” how about ”Let’s pick a proven off the shelf model, fine-tune it on our own data, and be home early for dinner.”
I love it. I'm all for the latter. Get something working early and customize more when needed.
What a massive month for the ML world in November.
As always, let me know if there's anything you think should be included in a future post.
Liked something here? Tell a friend!
In the meantime, keep learning, keep creating, keep dancing.
See you next month, Daniel
P.S. Newsletter video update: I've decided to spend time working on different kinds of YouTube videos so the text-based issue of machine learning monthly (what you're reading) will remain but the video versions have stopped. Plenty more to come.
By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or see all Zero To Mastery courses by visiting the courses page.