24th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Daniel here, I'm 50% of the instructors behind Zero To Mastery's Machine Learning and Data Science Bootcamp course and our new TensorFlow for Deep Learning course! I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.
Welcome to this edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
OpenAI released a new swag of models capable of generating images from text. For example, using a text prompt such as âa dinner plate with a steak, roast potatoes and mushroom sauce on itâ the model generated the following:
Image generated by OpenAIâs GLIDE model.
Aside from generating images, thereâs another side of GLIDE capable of inpainting (fill in a masked region of an image). You select a region of an image and the model fills in the region based on a text prompt.
Example of OpenAIâs GLIDE inpainting various images.
The GLIDE model is trained on the same dataset as OpenAIâs previous DALL-E. Except the way images are created using text-conditioned diffusion (a diffusion model slowly adds random noise to a sample and then learns how to reverse it).
The paper found diffusion created images were more favourable than previous methods.
Try out the sample notebooks on the OpenAI GLIDE GitHub.
Aside: Even Nutrify (the ML project Iâm currently working on) classifies the generated image from GLIDE as âBeefâ, perhaps GLIDE type models will turn out to be a way to generate synthetic data.
Nutrify.app picking up on the âBeefâ contained in the image generated by OpenAIâs GLIDE model. And yes, Nutrify looks plain now and currently only works for 78 foods but thereâs plenty more to come. Stay tuned.
In a study from JAMA (Journal of the American Medical Association) of 15,307 memory clinic participants, machine learning algorithms (such as Random Forest and XGBoost) were able to outperform previous detection models in 2-year dementia incidence diagnosis.
The machine learning models only need 6 variables (out of a total of 256) such as sex, hearing, weight and posture to achieve an accuracy of at least 90% and an area under the curve of 0.89.
My father has dementia and I know how helpful early diagnosis can be so this is incredible news to hear two of my worlds, machine learning and health, colliding.
Colin Raffel is a faculty researcher at HuggingFace. And in his recent blog post, he calls for machine learning models to be built like open-source software.
Usually, models are trained by one (usually a large company) entity and then used in their services or made accessible through weight sharing (and used in transfer learning).
However, Colin paints the picture of building a machine learning model like open-source software is built, with potentially thousands of people contributing to a model from around the world just like large open-source projects (like TensorFlow or PyTorch) have hundreds of contributors.
For example, if a research facility had limited access to compute power, they could train a version 1.0 of a model and then share it with others who could update specific parts of the model and then those changes could be in turn verified before being incorporated back into the original model.
I love this idea because it doesnât make sense to always be training large models from scratch.
One of my favourite machine learning companies joins one of my other favourite machine learning companies.
Mentioned back in the July 2021 edition of Machine Learning Monthly, Gradio is one of the simplest ways to demo your machine learning models.
Using Gradio to create an interactive demo of a food recognition model. Notice the shareable link, these last for 24-hours when you first create the demo and can be used by others. See the example code used to make the demo on Google Colab.
Now Gradio will be incorporated directly into HuggingFace Spaces (a place to host interactive ML demos for free).
As a 2022 resolution, all of the models I build for 2022 will be deployed in some way using HuggingFace Spaces.
Erik Bernhardsson is back again with another fantastic article discussing the future of cloud computing.
And since cloud computing is so vital to machine learning (many of the machine learning models I build are done so with cloud resources), Iâve included the article here.
He forecasts that large cloud vendors (like AWS, GCP and Azure) will continue to provide access to lower levels of compute and storage, instead of providing many different services on top.
And software vendors will build on top of these lower level pieces of hardware offering their own custom solutions.
The top row shows whatâs currently available whereas the bottom row is what Erik predicts might change.
Again, these are predictions but Erik has a fair bit of skin in the game when it comes to building large-scale data services. After all, he did build the first recommendation engine at Spotify.
This is really cool.
I showed this one to my friend so he could show his kids (and himself) and they could watch their drawings come to life.
Researchers from Facebook AI (now Meta AI) developed a method (a series of four machine-learning based steps) to turn 2D humanoid-like drawings into living, dancing animations.
The blog post speaks of how the model(s) work but the live demo is where the real fun is at.
I tried it out with a drawing of my own, introducing G. Descent:
A drawing of G. Descent, a 2D smiling stick figure.
And with a few steps on the demo page, G. Descent turned into skipping G. Descent:
From 2D drawing to skipping character. There are many more different types of animation you can try such as kickboxing, dancing, waving and hopscotch.
If thereâs a trend going on in the machine learning world right now itâs the combination of multi-modal data sources (data from more than one source), especially vision data (images) and language (captions, text, labels).
GLIP combines object detection and language awareness by training on a massive dataset of image-text pairs.
The use of the language alongside vision helps GLIP to achieve 1-shot (predicting after using only 1 training image) performance comparable with a fully-supervised Dynamic Head model.
GLIP trains an object detection model and language grounding model at the same time. Instead of traditional object detection model labels (e.g. one label per box like [dog, cat, mouse]), GLIP reformulates object detection as a grounding task by aligning each region/box to phrases in a text prompt.
Example output of mapping detection regions in an image to a text-prompt.
Again in line with this issueâs theme of combining text & image data, Microsoft has a new page dedicated to its goal for achieving multimodal intelligence (using multiple sources of data).
Project Florence is Microsoftâs new all encompassing vision framework able to take in image and text data to retrieve images, classify them, detect objects in them, answer visual questions and even detect actions in video.
Microsoftâs Florence: A New Foundation for Computer Vision architecture. Source: Microsoft Research Page.
And Project Florence-VL (vision and language) collects all of Microsoftâs research in the vision and language space to help power Florence (including GLIP from above).
If youâre interested in how the future of combining vision and language looks, be sure to check out Microsoftâs Project Florence-VL research page, itâs a treasure trove of exciting research.
The dot product is one of the most used operations across many different deep learning architectures.
But why?
I recently discovered a terrific thread on StackExchange explaining (from multiple perspectives) why the dot product gets used so often in neural networks.
Tip: Find a topic tag on any of the StackExchange or StackOverflow websites, such as âneural networksâ or âpandasâ and filter the questions and answers for the most frequent or votes and youâll see a plethora of problems people often run into.
Example of filtering questions on StackExchange for questions tagged with âneural-networksâ for most frequently visited. Source: StackExchange CrossValidated page.
This one astounded me.
The rise of transformers is well known. And one of the main components of the transformer model architecture is the attention mechanism.
However, how exactly the transformer architecture achieves such good results is much debated.
A recent paper from SeaAI argues that itâs the building blocks around the attention mechanism that give the transformer such performant capabilities and the âtoken mixerâ (usually attention or a spatial MLP) can be substituted for other options and still get excellent results.
In fact, they substituted the attention mechanism with a non-parametric (no learning) pooling layer (yes, a pooling layer) in a vision transformer and achieved better or equal results to traditional transformer models with less compute.
They call the general architecture the MetaFormer (a transformer model with a specific token mixer layer) and their version of the MetaFormer the PoolFormer, where the token mixer layer is a pooling layer.
MetaFormer architecture design layout compared to various other forms of the MetaFormer such as the traditional transformer (with attention) and the PoolFormer (with pooling as the token mixer). The results of the different architecture setups can be seen on the right with the PoolFormer achieving the best accuracy for the least compute. Source: PoolFormer: MetaFormer is Actually What You Need for Vision paper.
Not only is the paper fantastically written, they provide a series of ablation studies in the end comparing the architecture changes with different setups (such as changing GELU activation to RELU) and the code is all available on GitHub (I love seeing this!).
What a massive month for the ML world in December!
As always, let me know if there's anything you think should be included in a future post.
Liked something here? Tell a friend using those widgets on the left!
In the meantime, keep learning, keep creating, keep dancing.
See you next month, Daniel
By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or see all Zero To Mastery courses by visiting the courses page.