[September 2023] Machine Learning Monthly Newsletter 💻🤖

45th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.

Hey there, Daniel here.

I’m a Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in September 2023 as a Machine Learning Engineer…

My Work 👇

I’ve been going through the ZTM Machine Learning courses making sure the code runs and the APIs and information is all up to date.

If you find any errors in your travels, please feel free to make a pull request or leave an issue on the relevant course GitHub page:

From the Internet 🥅

Open Source

BLIP-Diffusion, a model capable of generating images given a subject is now in Hugging Face Diffusers.
Hugging Face release Würstchen, a fast text-to-image generation model, capable of generating images ~2x faster than Stable Diffusion XL (SDXL).
Startups Mistral and Adept release two new fully open-source LLMs which are small but mighty, Mistral 7B (Apache 2.0) beats Llama 2 13B on most benchmarks. Adept’s Persimmon 8B (Apache 2.0) also beats many equivalent sized models on several benchmarks. It would be really interesting to see these models pitted against each other. Fantastic to see more and more companies join in the open-source LLM movement.
Facebook's Nougat (Neural Optical Understanding for Academic Documents) is a neural network capable of reading PDFs and scientific documents and turning them into markup text.

Example of Facebook’s Nougat model reading a scanned academic document and turning it into markdown text. Source: Nougat homepage.

Tutorials 🧠

Jeremy Howard, creator of fast.ai published a sensational lecture and hands-on tutorial for using and creating large language models, including how to run them on your own machine. Watch the video on YouTube.
Matrix multiplications are one of the most common operations in machine learning. But they can also be quite a challenge to understand. A new visualization tool, mm, is here to help change that. It allows visualisations of matrix multiplications and similar operations to happen fully interactively right within the browser. Read the release blog post, read the reference guide.

dotprod1-gif

Example of visualizing a dot product operation on two matrices with mm. Source: PyTorch blog.

Roboflow have a terrific tutorial showing how to train a segmentation model using SegGPT and only 2 labelled images. SegGPT is a foundation model capable of segmenting similar items given a few examples. So you can manually label a few images and then use SegGPT to copy the style of those labels to other images programmatically. Read the blog post.
How to train your own object detection model. RTMDet is a state-of-the-art object detection model which is capable of performing real-time object detection. There are several scales of RTM models and the smaller ones are lightning fast with a sacrifice of performance where as the larger ones perform better but take a bit more compute. Roboflow has a sensational guide on how to train your own custom RTMDet model on your own dataset. See the blog post tutorial and try out the code on Google Colab.

Tips and tricks to optimize LLMs for production by Hugging Face

These days there are several different LLM APIs you can call into and get pretty good results.

But what if you wanted to host your own LLM models?

Doing so comes with a range of benefits: data privacy, speed, potential cost savings, not reliant on third parties.

One of the first issues you’ll likely face with trying to deploy your own LLMs is the model size.

The most performant LLMs come with billions of parameters and footprints that range from 10s to 100s of gigabytes.

Hugging Face has released a great blog post and tutorial on how to optimize your own LLM models with techniques such as:

One of my favourites is using a lower precision to achieve a smaller model size and in turn run larger models on smaller GPUs.

For example, a single float32 value takes up 4 bytes.

So a 7B parameter model (e.g. Llama 7B) would require 7,000,000,000 * 4 bytes = 28,000,000,000 bytes = ~28GB VRAM to load.

But loading the model in float16 halves this to ~14GB.

And then loading it in 8-bit would halve it again to ~7GB.

So the rule of thumb is:

For every X billion parameters of a LLM, you need:

Float32 = ~4 * X GB of VRAM
Float16 = ~2 * X GB of VRAM
8-bit = ~1 * X GB of VRAM

Of course, lowering the precision sometimes comes with the tradeoff of worse model performance so it would have to be explored what the best value is.

For more on optimizing your LLMs for production, see the Hugging Face blog.

Basics of Reinforcement Learning for Large Language Models by Cameron Wolfe

Alongside supervised learning and self-supervised learning, reinforcement learning is one method of training a neural network.

In the case of large language models, reinforcement learning is used as a part of RLHF (reinforcement learning from human feedback).

To get a desired output of a large language model, one option would be to produce 1000s of samples of ideal outputs. However, this is often costly and time consuming.

But since LLMs are pretty good to begin with, you can get it to produce outputs and then rate a series of outputs (with human feedback) and use reinforcement learning to train the model to produce outputs with higher ratings.

An example rating interface for providing human feedback on large language model outputs. One huge advantage of this format is that rating is far easier to scale than creating good examples from scratch. Source: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

The Deep (Learning) Focus newsletter by Cameron Wolfe PH.D. has a great overview of many learning concepts including reinforcement learning as well as large language models.

I’d highly recommend checking it out and subscribing for more.

Can Language Models Reason by Melanie Mitchell

It’s no secret LLMs produce incredible results.

But do they have the power to reason?

In Can Large Language Models Reason, Melanie Mitchell, Professor at the Santa Fe Institute defines reason as:

what-is-reasoning

Definition of reasoning by Melanie Mitchell. Source: Can Large Language Models Reason? by Melanie Mitchell.

So when an LLM produces an output that seemingly requires multiple steps of reasoning, is it actually reasoning or is it statistically pattern matching based on its training dataset?

This question is still open-ended.

One thing I’ve noticed when creating larger-scale datasets of images (100k+ images) is that you start to lose touch with the ability to know what’s in the training data.

As in, without large amounts of time invested, there’s no way you could go through every sample.

And this is only with 100k+ samples.

I can’t imagine what it would be like with billions or even trillions of samples, as is the case with large language models.

You get to the point where your training set is so large, that developing adequate tests for your models becomes harder and harder.

If your model is trained on nearly all of the text of the internet (including some existing machine learning benchmarks), data leakage becomes a big problem.

So perhaps for each new wave of large language models, a series of novel and unseen tests need to be developed.

Melanie Mitchell discusses these kinds of ideas and more in:

Can Large Language Models Reason?
And broader AI topics in her sensational newsletter AI: A Guide for Thinking Humans.

Falcon 180B soars to the top of the open LLM leaderboard 📈

Falcon 180B (180B = 180 billion parameters) is the newest and largest open (open access, not necessarily open source) language model to land.

Trained by the Technology Innovation Institute (TII) on 3.5 trillion tokens using their own RefinedWeb dataset which is a filtering and large-scale deduplication of the CommonCrawl (a free and open source crawl of the web).

It outperforms all other open-source models (including Llama) as well as rivalling closed-source models like GPT-4 by OpenAI and PaLM-2 by Google.

falcon-180b-vs-palm-2

Performance of Falcon 180B compared to closed-source PaLM and PaLM-2 models. Source: Hugging Face blog post announcement of Falcon 180B.

A little confusion on the open label though.

Falcon 180B is “open access” rather than “open source”, meaning can be used commercially under a restrictive licence. From my understanding of section 9 of the licence, you’re not able to host the Falcon 180B model yourself (though since it’s such a large model, you would need several A100 GPUs to do so).

But on the whole, an incredible collection of models and research contributions by TII!

Read the blog post announcement, see the Falcon homepage, get the models and code on Hugging Face.

RoboAgent is a real-world robotic agent capable of learning via hallucinations and semantic augmentations

Most of the recent and most public AI advancements have come through large language models.

And subsequently, many of these models are interacted with through chat interfaces.

However, most of the world isn’t a chat interface.

So how do you bridge the gap between text-based models and the real world?

RoboAgent is a research project from Meta AI and Carnegie Mellon University that aims to figure this out.

Using a combination of real and augmented datasets, they create a robot that is capable of performing 12 manipulation skills (including slide drawer open, slide drawer close, pick, place, lift cap, replace cap) across 38 tasks (including make tea, serve soup, stow bowl) across 100s of diverse and unseen scenarios (different kitchens, different objects, different tasks).

One of my favourite figures is how with each increase in semantic augmentation (swapping objects with similar semantic details for other objects in images), the performance of the robot increased as well.

roboagent-scaling

When provided with increasing amounts of semantic augmentations of an observation image, RoboAgent’s success improves quickly across various levels of generalization (L1 = object poses and lighting, L2 = background and distracting objects, L3 = novel tasks and environments). Source: RoboAgent website.

Read the paper, get the code on GitHub, view the project page.

SigLIP = CLIP with Sigmoid = Better results

SigLIP improves CLIP training results by using the sigmoid loss function instead of constrastive loss (the “C” in CLIP), in turn creating Sigmoid Language-Image Pretraining.

The SigLIP base model also forms the backbone of OWL-ViT v2, an open-world object detection model (featured in ML monthly June 2023).

Notably the model outperforms previous state of the art models (EVA-CLIP) with a 10x smaller visual encoder (5B parameters vs 400M parameters) on ImageNet 0-shot classification (82% vs 83.2%).

siglip-rock-drawing-demo-in-notebook

Results from the SigLIP model on several diverse images (none of which the model has seen) from the authors of the paper alongside one of my own images, a rock and sand drawing by my partner. Notice how the model matches many of the images to the text that most suits them. Source: SigLIP Colab Demo.

Read the paper, see the X announcement, get the code on GitHub, try the demo notebook.

Quick fire releases, announcements and cool things round 🔥

What can you do with $1k compute credits and a goal? Make CLIP multi-lingual! A very cool example of how to make the most of a small computing budget.
Cohere show how you can use LLMs to improve search results.
Emu is a new text-to-image generation model from Meta. One very notable finding from the paper was how much the image quality generation improves when fine-tuning the base model on 100-1000 (not very many) aesthetic images. Read the paper.
OpenAI announce DALL-E 3, an incredible update to their image-generating model DALL-E 2. Coming to ChatGPT Plus in October and developers via API shortly after.
ChatGPT can now hear, see and speak. Visual inputs come to ChatGPT with GPT-4V (V for vision). The vision and voice version of ChatGPT is slowly rolling out over the next couple of weeks to Plus and Enterprise users and developers will get access soon after. For a good quick overview of the vision capabilities of GPT-4V, see the Roboflow blog.
TensorFlow 2.14.0 is out! Two of my favourite features include: 1. You can now install Nvidia CUDA libraries through pip, (pip install tensorflow[and-cuda]). 2. tf.keras can now use steps_per_execution=“auto” in Model.fit , Model.predict and Model.evaluate for a significant performance boost.
Meta hosted their Connect conference for 2023 where they announced several new products including the Quest 3 and Ray-Ban Meta smart glasses (Ray-Ban sunglasses with cameras and AI built-in). There were also several talks about Llama and Meta's plans on expanding their open-source generative AI offerings.

See you next month!

What a massive month for the ML world in September!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month, Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.