27th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Daniel here, I'm 50% of the instructors behind Zero To Mastery's Machine Learning and Data Science Bootcamp course and our new TensorFlow for Deep Learning course! I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.
Welcome to this edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
MLOps (machine learning operations) is the process of going from data to predictive models to intelligence.
The following image showcases all of the tools in the MLOps space:
Ummm...
What?
Iβve never heard of 80% of these and I spend all day writing ML code and dancing around the ML space.
Clearly thereβs a lot going on.
Mihail Ericβs piece argues that this is to be expected. Since MLOps is still a new field, of course, things are going to be all over the place.
And thatβs kind of what youβd want to begin with.
Different people trying different things and then seeing what turns out the best.
My advice for getting started in the world of MLOps?
Keep it simple. Build something end-to-end (deploy the models you build in notebooks) and see what the whole process is like first-hand.
Thatβs what Iβm doing with Nutrify.
Although thereβs plenty of shiny tools Iβd like to try, Iβm finding I can do most of what I need to do with well-known tools like TensorFlow and plain vanilla JavaScript.
You may have seen the term feature store floating around the ML space lately.
A feature store is a computed value used as input into a machine learning model that the person using the machine learning model might not have access to at the time of using it.
For example, letβs say youβre Uber and at the start of each day you compute the demand forecast as well as how many drivers you had on the road yesterday.
This sounds trivial but rather than compute it every single time you want to make a prediction, you might store this value and query it instead (querying is often faster than computing).
So when someone requests an Uber, instead of recomputing the already calculated value, itβs stored as a feature store and incorporated into a model that predicts ETA.
This is only a high-level example, there are other ways of adding features to a model:
There are pros and cons to each.
And thatβs what Lak discusses in his post. He finishes it off with a nice decision tree for deciding if you need one or not (generally not, it often adds quite a lot of complexity for what itβs worth, however, if you have the resources, it can improve latency/performance).
βDo you need to use a feature store?β decision tree. Source: Lak Lakshmanan.
With the upcoming ZTM PyTorch course, Iβve been paying close attention to everything and anything PyTorch.
With that being said, thereβs been a fair few updates to PyTorch across the board (Iβm making sure to include the most useful of these in the new course).
TorchData
and functorch
β TorchData contains modular data loading steps for constructing flexible data pipelines (and all ML projects start with the data!) and inspired by JAX, functorch
enables several function transforms that are currently hard to do with pure PyTorch.See the PyTorch blog for more.
Combining vision and language data sources has been the theme of ML Monthly for the last few months.
But by the looks of Meta AIβs latest round of research, theyβve started to combine almost everything they can.
From 3D images to videos to images to depth maps, Omnivore can handle them all. Omnivore combines several vision modalities into one model.
My favourite part is that they used all off the shelf datasets to build the model (all data thatβs publicly available).
Omnivore can handle multiple different vision modalities and still perform better than models specifically trained for a certain modality. Source: Omnivore: A Single Model for Many Visual Modalities
FLAVA: A Foundational Language and Vision Alignment Model can handle 35 different tasks... getting closer and closer to one model to handle them all. FLAVA also uses a large number of public datasets (referred to as PMD: Public Multimodal Datasets in the paper). The FLAVA architecture combines a text encoder, image encoder and multi-modal encoder (image and text) to learn as much as possible from each data source.
CM3: A Casual Masked Multimodal Model of the Internet uses nearly a Terabyte of pure HTML code to create the first hyper-text language and image model. Because of the scale, the model is able to generate some gnarly images given a text prompt.
Images generated based on text-prompts given to CM3. Source: CM3 paper.
But thatβs not all, CM3 is capable of filling in masked portions of images, masked portions of text and even doing the reverse of the image above, generating captions when given an image.
Weβve talked about Googleβs Health AI efforts in previous issues of ML Monthly but theyβve recently released a whole bunch of research (and shipped products).
I downloaded the Google Fit app and tried out the respiratory and heart rate trackers. I had mixed results depending on what kind of lighting I was in. A cool project would be to replicate this.
Everyone wants to make their models train faster.
And one way to do so is to use a GPU.
But letβs say youβve got a GPU and youβve seen a good speed up, how do you push things further?
Or how do you figure out whatβs preventing your model from training faster?
Horace He works on the PyTorch team and shares his learnings on how to make your deep learning models go brrrrr using three first principles:
He uses the analogy of a factory to explain things, where the memory is the supplies warehouse and the factory running is the compute and the back and forth between the two is the overhead.
Computing as a factory. Memory stores all the supplies, overhead sends them back and forth and the factory (GPU) does all the computing. Source: Making Deep Learning Go Brrrr From First Principles by Horace He.
One of my favourite takeaways from the article was the power of operator fusion.
You want to minimize the time spent transferring data and operator fusion is one of the best ways to do so.
So instead of calling operations one by one, you can chain them together.
# No operator fusion (plenty of overhead due to reassigning values
x1 = torch.cos(x)
x2 = torch.cos(x1)
# Same operation using operator fusion
x2 = torch.cos(torch.cos(x))
Rachel Thomas is the co-founder of fast.ai, one of my favourite AI organizations.
In this piece, she argues that a lot of math education focuses on an overemphasis on techniques rather than meaning.
And this scares a lot of people off because as soon as you miss a single technique, you feel like youβre βnot a math personβ.
When really, learning math is like learning any skill, with time and effort you improve.
I like the concept of teaching/learning the whole game rather than just focusing on a single technique.
When you learn to drive a car, you donβt necessarily need to know how an internal combustion engine works.
You learn to drive from place to place.
And that momentum carries you forward if youβd like to learn more.
Head of Tesla AI, Andrei Karpathy travels back in time to some of the first ever working neural networks, a 1989 paper by Yann LeCun et al, Backpropagation applied to Handwritten Zip Code Recognition.
Perhaps surprisingly, Andrei states:
this paper reads remarkably modern today, 33 years later - it lays out a dataset, describes the neural net architecture, loss function, optimization, and reports the experimental classification error rates over training and test sets.
It sounds a lot like some of the papers I read this week.
Karpathy replicated (as best he could) the original training setup and was able to pull off training the whole network in ~90 seconds on an M1 MacBook Air CPU (3000X speed up to the original paper).
And applying some modern techniques such as dropout, the Adam optimizer and data augmentation, Andrei was able to reduce the original error on the paper by 60%.
Towards the end, Andrei offers his predictions for the future of neural networks.
The main one being perhaps the concept of a single network for a single task will be old news in the future (much like where research is heading now by combining modalities).
And itβs crazy to think that in another 33 years you might be able to train todayβs state of the art models on commodity hardware in a few minutes.
You can see the code on GitHub.
Iβve just discovered the mlxtend
library by Sebastian Raschka, author of the popular Machine Learning with PyTorch and Scikit-Learn book.
I donβt know what took me so long to find it.
Itβs full of helpful features that you often use in machine learning and data science that arenβt exactly ready to go in some other libraries.
One of my favourites being plotting a confusion matrix with plot_confusion_matrix().
Thereβs a new SOTA (state of the art) for ImageNet with Model soups achieving 90.94% top-1 accuracy.
Weβre getting closer and closer to 91%!
Model soups combine the weights of other models to form a βsoupβ.
The usual process is to train a bunch of models (via hyperparameter tuning) and then discard all of them except the best.
But model soups keep all of the extra models and then combine their weights via averaging or via reduction (if the extra model doesnβt improve the overall model, itβs discarded).
The new model soup saves on inference compute compared to ensembles because in the end it still ends up being only one model (rather than multiple).
Itβs hard to overstate the progress in machine learning over the past 10 years...
I mean from their latest NVIDIA GTC (GPU technology conference) 2022 keynote, NVIDIA states theyβve increased accelerated computing by 1,000,000x over the last 10 years.
Check out machine learning, itβs leaving the chart.
Via a combination of specialized hardware (GPUs) and software (CUDA), NVIDIA has accelerated machine learning computing off the charts over the past 10 years. Source: NVIDIA GTC 2022 keynote.
And their latest hardware continues the trend.
The H100 Tensor Core GPU (H is for Hopper, as in, Grace Hopper) offers up to 9x faster training and 30x faster inference over the previous generation (A100).
What???
I mean we knew it would be faster...
But those speedups are crazy!
Itβs a server-side GPU so potentially that means new NVIDIA consumer GPUs are on the way too.
Thereβs a bunch more from NVIDIA GTC 2022 as well, Iβve been watching some of the presentations and workshops, particularly the ones on PyTorch.
It requires a signup, but theyβre free to watch on the NVIDIA GTC website.
What a massive month for the ML world in March!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month, Daniel
By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or check out all Zero To Mastery courses.