29th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Daniel here, I'm 50% of the instructors behind Zero To Mastery's Machine Learning and Data Science Bootcamp course and our new TensorFlow for Deep Learning course! I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.
Welcome to this edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
A fantastic overview of what should be taken into consideration when deploying a machine learning model into production (putting a model in an application/service someone can use).
The main one being: always be iterating.
And one of my favourites, βI used to think that machine learning was about the models, actually machine learning in production is all about the pipelines.β
Iβm figuring this out whilst building my own project Nutrify.
Making a prototype model is quite easy.
But making sure that prototype model improves over time is the grand prize.
Improvement over time comes from making sure your pipeline, data collection, data verification, data modelling and model monitoring all fits together.
Thereβs a theme going around the machine learning world right now.
And thatβs using a language model for everything.
If language models can perform well at the tasks they were originally designed for in natural language processing (NLP), why not try and apply them to other things?
In essence, asking the question: βwhat isnβt a language?β
DeepMindβs recent research does this by creating a model that can play Atari games, caption images, chat, stack blocks with a real robot arm and much more.
Importantly, it does all of this with the same weights for each problem.
How?
It decides based on the context of its inputs, what it should output, whether it be text or actions to manipulate a robot arm.
This is a very exciting step in the direction of a real generalist agent, one agent capable of many tasks.
Machine learning is one of the only fields I can think of where the state of the art gets replaced within weeks.
Last monthβs machine learning monthly covered DALLEβ’2, OpenAIβs best generative vision model yet.
And now Googleβs Imagen takes the cake, with an arguably simpler format.
Using a pretrained large language model such as T5 (Text-To-Text Transfer Transformer) to encode a text query such as: βA Golden Retriever dog wearing a blue checkered beret and dotted turtleneckβ into an embedding which is used as the input to a Text-to-Image Diffusion model that upscales the image.
Outline of the Imagen architecture. Starting from using a large language model (T5-XXL) to turn text into an embedding and then using that embedding to influence an Efficient U-Net diffusion model to generate and upsample an image to 1024x1024 resolution. Source: Imagen homepage.
The vast pretraining of the text-only language model is vital to the quality of images being created.
With much of the research coming out, it seems the bigger the model and the bigger the dataset, the better the results.
Iβm blown away by some of the samples Imagenβs able to create.
Various sample images created by Imagen with text prompts. My favourite is on the left. Source: Imagen paper.
Thinking out loudβ¦
If large scale language and image models use internet-scale data to train and learn.
And then they generate data based on this internet-scale data.
And then this data is published to the internet.
And then the models learn again on updated internet-scale data (with the addition of the data also generated by machine learning models).
What kind of feedback loop does this create?
Itβs like the machine learning version of Ouroboros (the snake eating its own tail).
Jay Alammar publishes some of the highest quality machine learning education articles on the internet.
And in light of all the recent research coming out on LLMs, Iβve been reading his various pieces on the topic.
Overview of semantic search. Start with an archive corpus of different materials and turn them into embeddings (numerical representations), create a search query and turn it into an embedding as well, finally the results are the embeddings from the archive that are closest to the query embedding. Source: Semantic Search by Jay Alammar.
TensorFlow 2.9 was recently released with improved CPU performance thanks to a partnership with Intel along with many other helpful features listed on the full release page on GitHub.
There was also a fantastic blog post and video on how TensorFlow is being used to help control Crown of Thorns Starfish (COTS) populations in The Great Barrier Reef.
COTS naturally occur and bind to and feed on coral.
However, when their numbers get out of control from causes such as excess fertilizer runoffs (more nutrients in the water) and overfishing (less predators) they can consume coral faster than it can regenerate.
Google partnered with CSIRO (Australiaβs science agency) to develop object detection models running on underwater hardware to track the COTS population spread.
One of my favourite takeaways was the use of NVIDIAβs TensorRT for faster inference times. Using TensorFlow-TensorRT, the team were able to achieve a 4x inference speedup on the small GPU they were using (important since the GPU had to be on a small device capable of being underwater).
Using TensorFlow to build a custom object detection model to find and track Crown of thorn starfish (COTS) populations underwater. Source: TensorFlow YouTube Channel.
In another post, On-device Text-to-Image Search with TensorFlow Lite Searcher Library, the TensorFlow Lite team share how you can create a semantic text-to-image search model capable of running on-device.
For example, instead of just searching for images based on metadata such as βJapanβ returning images with a GPS location of Japan, you could search for βtrees in Japanβ and the text-to-image model will:
Iβm excited to see this kind of functionality not only added to apps like Appleβs Photos or Google Photos to make your own images more searchable but Iβm also thinking about how to add this functionality to Nutrify, except reversing it.
Instead of hand labelling individual images of food, train a contrastive model to learn a cross representation of language and images, then use an image to search for the language embedding most related to it.
In essence, learning the inputs as well as output.
Results for searching images with the query βman riding a bikeβ. The most similar image embeddings to the text embedding of the query get returned. Source: TensorFlow blog.
Using Mathpix to turn the formula for attention from the paper Attention Is All You Need into editable LaTeX. x
Plenty of interesting research coming out in the vision/language space.
The crossover of language models into vision is getting real.
And the lines between the two different domains are blurring (because models once used for language modelling are now being used for vision).
Language Models Can See: Plugging Visual Controls in Text Generation β What if you could add visual controls to your already existing language models?
For example, input an image and have a language model output text based on that image.
Even better, what if you didn't have to perform any training?
You could just use off the shelf pretrained language models and contrastive models like CLIP (contrastive language-image pretraining) to make sure your text outputs match the target image?
MAGIC (iMAge-Guided text generatIon with CLIP) is a method that does just that.
And since it doesn't perform any training during inference, it outperforms previous methods with a 27x speedup. See the code and demos on GitHub and the paper on arXiv.
Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) β What do you get when you combine a vision transformer (ViT) and a language model (seeing the trend yet) and then fine-tune their outputs on object detection datasets?
You get an object detector that's able to detect objects for almost any text.
As in you could pass in a list of potential objects to classify such as "bacon, egg, avocado" and then have your object detector detect these items in an image despite the model never actually being trained on these classes?
See the paper on arXiv and the code and example demos on GitHub.
Testing out OWL-ViT with my own images. Left image query: ["cooked chicken", "corn", "barbeque sauce", "sauce", "carrots", "butter"], right image query: ["bacon", "egg", "avocado", "lemon", "car"]. An important point is that the object detection model was never explicitly trained on any of these classes. Source: OWL-ViT demo Colab notebook with my own images.
One of the founders of HuggingFace, Clement Delangue went on the Robot Brains podcast to discuss how HuggingFace wants to become the GitHub for machine learning. And with all the incredible work theyβre doing in the open-source space, I think theyβre well on their way.
Listen on YouTube, Apple Podcasts, or Spotify.
What a massive month for the ML world in May!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month, Daniel
By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or check out all Zero To Mastery courses.