Machine Learning Monthly Newsletter 💻🤖

Daniel Bourke
Daniel Bourke
hero image

29th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.

Daniel here, I'm 50% of the instructors behind Zero To Mastery's Machine Learning and Data Science Bootcamp course and our new TensorFlow for Deep Learning course! I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.

Welcome to this edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in May as a Machine Learning Engineer…

My work 👇

From the Internet 🥅

Lessons from deploying deep learning to production by Peter Gao

A fantastic overview of what should be taken into consideration when deploying a machine learning model into production (putting a model in an application/service someone can use).

The main one being: always be iterating.

And one of my favourites, “I used to think that machine learning was about the models, actually machine learning in production is all about the pipelines.”

I’m figuring this out whilst building my own project Nutrify.

Making a prototype model is quite easy.

But making sure that prototype model improves over time is the grand prize.

Improvement over time comes from making sure your pipeline, data collection, data verification, data modelling and model monitoring all fits together.

Gato: A generalist agent by DeepMind

There’s a theme going around the machine learning world right now.

And that’s using a language model for everything.

If language models can perform well at the tasks they were originally designed for in natural language processing (NLP), why not try and apply them to other things?

In essence, asking the question: “what isn’t a language?”

DeepMind’s recent research does this by creating a model that can play Atari games, caption images, chat, stack blocks with a real robot arm and much more.

Importantly, it does all of this with the same weights for each problem.

How?

It decides based on the context of its inputs, what it should output, whether it be text or actions to manipulate a robot arm.

This is a very exciting step in the direction of a real generalist agent, one agent capable of many tasks.

Creating wild and very real images with Imagen by Google AI

Machine learning is one of the only fields I can think of where the state of the art gets replaced within weeks.

Last month’s machine learning monthly covered DALLE•2, OpenAI’s best generative vision model yet.

And now Google’s Imagen takes the cake, with an arguably simpler format.

Using a pretrained large language model such as T5 (Text-To-Text Transfer Transformer) to encode a text query such as: “A Golden Retriever dog wearing a blue checkered beret and dotted turtleneck” into an embedding which is used as the input to a Text-to-Image Diffusion model that upscales the image.

imagen-architecture

Outline of the Imagen architecture. Starting from using a large language model (T5-XXL) to turn text into an embedding and then using that embedding to influence an Efficient U-Net diffusion model to generate and upsample an image to 1024x1024 resolution. Source: Imagen homepage.

The vast pretraining of the text-only language model is vital to the quality of images being created.

With much of the research coming out, it seems the bigger the model and the bigger the dataset, the better the results.

I’m blown away by some of the samples Imagen’s able to create.

imagen-samples

Various sample images created by Imagen with text prompts. My favourite is on the left. Source: Imagen paper.

Thinking out loud…

If large scale language and image models use internet-scale data to train and learn.

And then they generate data based on this internet-scale data.

And then this data is published to the internet.

And then the models learn again on updated internet-scale data (with the addition of the data also generated by machine learning models).

What kind of feedback loop does this create?

It’s like the machine learning version of Ouroboros (the snake eating its own tail).

Jay Alammar's fantastic articles on large language models (LLMs)

Jay Alammar publishes some of the highest quality machine learning education articles on the internet.

And in light of all the recent research coming out on LLMs, I’ve been reading his various pieces on the topic.

  • Intro to large language models — What even is an LLM? In short, an LLM is like autocomplete on steroids. By reading vast amounts of text, an LLM learns what word is most likely related to another word. As a simple example, in the sentence "a _____ jumped over the fence", the word “dog" is far more likely than the word "car".
  • Semantic search with language models — Semantic search matches search queries semantically to results. This is often more helpful than strictly matching for exact matches. For example, instead of just matching the search term “dog" for articles with "dog" in the title, it'll match the embedding of the word "dog" to articles with similar semantic embeddings, which could be “pet", “animal", "canine", “labrador" etc.

basic-semantic-search-overview

Overview of semantic search. Start with an archive corpus of different materials and turn them into embeddings (numerical representations), create a search query and turn it into an embedding as well, finally the results are the embeddings from the archive that are closest to the query embedding. Source: Semantic Search by Jay Alammar.

  • The Illustrated RETRO transformer — Many LLMs are based on the transformer architecture. And many of them are huge in size (GPT-3 is 175 billion parameters). But what if you could enhance their performance and make them smaller by leveraging a database of known facts (e.g. questions and answers) alongside semantic search? That's exactly what DeepMind's RETRO (Retrieval-Enhanced TRansfOrmer) does.

Hot of the Press From the TensorFlow Blog

TensorFlow 2.9 was recently released with improved CPU performance thanks to a partnership with Intel along with many other helpful features listed on the full release page on GitHub.

There was also a fantastic blog post and video on how TensorFlow is being used to help control Crown of Thorns Starfish (COTS) populations in The Great Barrier Reef.

COTS naturally occur and bind to and feed on coral.

However, when their numbers get out of control from causes such as excess fertilizer runoffs (more nutrients in the water) and overfishing (less predators) they can consume coral faster than it can regenerate.

Google partnered with CSIRO (Australia’s science agency) to develop object detection models running on underwater hardware to track the COTS population spread.

One of my favourite takeaways was the use of NVIDIA’s TensorRT for faster inference times. Using TensorFlow-TensorRT, the team were able to achieve a 4x inference speedup on the small GPU they were using (important since the GPU had to be on a small device capable of being underwater).

cots-detected

Using TensorFlow to build a custom object detection model to find and track Crown of thorn starfish (COTS) populations underwater. Source: TensorFlow YouTube Channel.

In another post, On-device Text-to-Image Search with TensorFlow Lite Searcher Library, the TensorFlow Lite team share how you can create a semantic text-to-image search model capable of running on-device.

For example, instead of just searching for images based on metadata such as “Japan” returning images with a GPS location of Japan, you could search for “trees in Japan” and the text-to-image model will:

  1. Embed the text (turn the text into numbers).
  2. Match the embedded text to a library of embedded images (created from a database of images turned into numbers).

I’m excited to see this kind of functionality not only added to apps like Apple’s Photos or Google Photos to make your own images more searchable but I’m also thinking about how to add this functionality to Nutrify, except reversing it.

Instead of hand labelling individual images of food, train a contrastive model to learn a cross representation of language and images, then use an image to search for the language embedding most related to it.

In essence, learning the inputs as well as output.

results-from-text-to-image

Results for searching images with the query “man riding a bike”. The most similar image embeddings to the text embedding of the query get returned. Source: TensorFlow blog.

Cool tools 🔭

mathpix-demo

Using Mathpix to turn the formula for attention from the paper Attention Is All You Need into editable LaTeX. x

  • The easiest way to create machine learning demos gets better with Gradio 3.0 — Gradio enables you to create machine learning demos in Python with a few lines of code. And with their latest 3.0 update, your demos now have the ability to be multi-step interfaces. So if you've got a machine learning model that's more than just an input and an output, you can create it with Gradio Blocks.

Papers and research 📰

Plenty of interesting research coming out in the vision/language space.

The crossover of language models into vision is getting real.

And the lines between the two different domains are blurring (because models once used for language modelling are now being used for vision).

Language Models Can See: Plugging Visual Controls in Text Generation — What if you could add visual controls to your already existing language models?

For example, input an image and have a language model output text based on that image.

Even better, what if you didn't have to perform any training?

You could just use off the shelf pretrained language models and contrastive models like CLIP (contrastive language-image pretraining) to make sure your text outputs match the target image?

MAGIC (iMAge-Guided text generatIon with CLIP) is a method that does just that.

And since it doesn't perform any training during inference, it outperforms previous methods with a 27x speedup. See the code and demos on GitHub and the paper on arXiv.

Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) — What do you get when you combine a vision transformer (ViT) and a language model (seeing the trend yet) and then fine-tune their outputs on object detection datasets?

You get an object detector that's able to detect objects for almost any text.

As in you could pass in a list of potential objects to classify such as "bacon, egg, avocado" and then have your object detector detect these items in an image despite the model never actually being trained on these classes?

See the paper on arXiv and the code and example demos on GitHub.

owl-vit-object-detection

Testing out OWL-ViT with my own images. Left image query: ["cooked chicken", "corn", "barbeque sauce", "sauce", "carrots", "butter"], right image query: ["bacon", "egg", "avocado", "lemon", "car"]. An important point is that the object detection model was never explicitly trained on any of these classes. Source: OWL-ViT demo Colab notebook with my own images.

Podcast 🔊

One of the founders of HuggingFace, Clement Delangue went on the Robot Brains podcast to discuss how HuggingFace wants to become the GitHub for machine learning. And with all the incredible work they’re doing in the open-source space, I think they’re well on their way.

Listen on YouTube, Apple Podcasts, or Spotify.


See you next month!

What a massive month for the ML world in May!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month, Daniel

www.mrdbourke.com | YouTube

By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

ZTM Career Paths: Your Roadmap to a Successful Career in Tech preview
Popular
ZTM Career Paths: Your Roadmap to a Successful Career in Tech

Whether you’re a beginner or an experienced professional, figuring out the right next step in your career or changing careers altogether can be overwhelming. We created ZTM Career Paths to give you a clear step-by-step roadmap to a successful career.

Top 7 Soft Skills For Developers & How To Learn Them preview
Top 7 Soft Skills For Developers & How To Learn Them

Your technical skills will get you the interview. But soft skills will get you the job and advance your career. These are the top 7 soft skills all developers and technical people should learn and continue to work on.

Python Monthly Newsletter 💻🐍 preview
Python Monthly Newsletter 💻🐍

30th issue of Andrei Neagoie's must-read monthly Python Newsletter: Dunder methods, PyScript, code audits, and Python patterns. All this and more. Read the full newsletter to get up-to-date with everything you need to know from last month.