45th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Hey there, Daniel here.
Iβm a Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Iβve been going through the ZTM Machine Learning courses making sure the code runs and the APIs and information is all up to date.
If you find any errors in your travels, please feel free to make a pull request or leave an issue on the relevant course GitHub page:
Example of Facebookβs Nougat model reading a scanned academic document and turning it into markdown text. Source: Nougat homepage.
Example of visualizing a dot product operation on two matrices with mm. Source: PyTorch blog.
These days there are several different LLM APIs you can call into and get pretty good results.
But what if you wanted to host your own LLM models?
Doing so comes with a range of benefits: data privacy, speed, potential cost savings, not reliant on third parties.
One of the first issues youβll likely face with trying to deploy your own LLMs is the model size.
The most performant LLMs come with billions of parameters and footprints that range from 10s to 100s of gigabytes.
Hugging Face has released a great blog post and tutorial on how to optimize your own LLM models with techniques such as:
One of my favourites is using a lower precision to achieve a smaller model size and in turn run larger models on smaller GPUs.
For example, a single float32 value takes up 4 bytes.
So a 7B parameter model (e.g. Llama 7B) would require 7,000,000,000 * 4 bytes = 28,000,000,000 bytes = ~28GB VRAM to load.
But loading the model in float16 halves this to ~14GB.
And then loading it in 8-bit would halve it again to ~7GB.
So the rule of thumb is:
For every X billion parameters of a LLM, you need:
Of course, lowering the precision sometimes comes with the tradeoff of worse model performance so it would have to be explored what the best value is.
For more on optimizing your LLMs for production, see the Hugging Face blog.
Alongside supervised learning and self-supervised learning, reinforcement learning is one method of training a neural network.
In the case of large language models, reinforcement learning is used as a part of RLHF (reinforcement learning from human feedback).
To get a desired output of a large language model, one option would be to produce 1000s of samples of ideal outputs. However, this is often costly and time consuming.
But since LLMs are pretty good to begin with, you can get it to produce outputs and then rate a series of outputs (with human feedback) and use reinforcement learning to train the model to produce outputs with higher ratings.
An example rating interface for providing human feedback on large language model outputs. One huge advantage of this format is that rating is far easier to scale than creating good examples from scratch. Source: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
The Deep (Learning) Focus newsletter by Cameron Wolfe PH.D. has a great overview of many learning concepts including reinforcement learning as well as large language models.
Iβd highly recommend checking it out and subscribing for more.
Itβs no secret LLMs produce incredible results.
But do they have the power to reason?
In Can Large Language Models Reason, Melanie Mitchell, Professor at the Santa Fe Institute defines reason as:
Definition of reasoning by Melanie Mitchell. Source: Can Large Language Models Reason? by Melanie Mitchell.
So when an LLM produces an output that seemingly requires multiple steps of reasoning, is it actually reasoning or is it statistically pattern matching based on its training dataset?
This question is still open-ended.
One thing Iβve noticed when creating larger-scale datasets of images (100k+ images) is that you start to lose touch with the ability to know whatβs in the training data.
As in, without large amounts of time invested, thereβs no way you could go through every sample.
And this is only with 100k+ samples.
I canβt imagine what it would be like with billions or even trillions of samples, as is the case with large language models.
You get to the point where your training set is so large, that developing adequate tests for your models becomes harder and harder.
If your model is trained on nearly all of the text of the internet (including some existing machine learning benchmarks), data leakage becomes a big problem.
So perhaps for each new wave of large language models, a series of novel and unseen tests need to be developed.
Melanie Mitchell discusses these kinds of ideas and more in:
Falcon 180B (180B = 180 billion parameters) is the newest and largest open (open access, not necessarily open source) language model to land.
Trained by the Technology Innovation Institute (TII) on 3.5 trillion tokens using their own RefinedWeb dataset which is a filtering and large-scale deduplication of the CommonCrawl (a free and open source crawl of the web).
It outperforms all other open-source models (including Llama) as well as rivalling closed-source models like GPT-4 by OpenAI and PaLM-2 by Google.
Performance of Falcon 180B compared to closed-source PaLM and PaLM-2 models. Source: Hugging Face blog post announcement of Falcon 180B.
A little confusion on the open label though.
Falcon 180B is βopen accessβ rather than βopen sourceβ, meaning can be used commercially under a restrictive licence. From my understanding of section 9 of the licence, youβre not able to host the Falcon 180B model yourself (though since itβs such a large model, you would need several A100 GPUs to do so).
But on the whole, an incredible collection of models and research contributions by TII!
Read the blog post announcement, see the Falcon homepage, get the models and code on Hugging Face.
Most of the recent and most public AI advancements have come through large language models.
And subsequently, many of these models are interacted with through chat interfaces.
However, most of the world isnβt a chat interface.
So how do you bridge the gap between text-based models and the real world?
RoboAgent is a research project from Meta AI and Carnegie Mellon University that aims to figure this out.
Using a combination of real and augmented datasets, they create a robot that is capable of performing 12 manipulation skills (including slide drawer open, slide drawer close, pick, place, lift cap, replace cap) across 38 tasks (including make tea, serve soup, stow bowl) across 100s of diverse and unseen scenarios (different kitchens, different objects, different tasks).
One of my favourite figures is how with each increase in semantic augmentation (swapping objects with similar semantic details for other objects in images), the performance of the robot increased as well.
When provided with increasing amounts of semantic augmentations of an observation image, RoboAgentβs success improves quickly across various levels of generalization (L1 = object poses and lighting, L2 = background and distracting objects, L3 = novel tasks and environments). Source: RoboAgent website.
Read the paper, get the code on GitHub, view the project page.
SigLIP improves CLIP training results by using the sigmoid loss function instead of constrastive loss (the βCβ in CLIP), in turn creating Sigmoid Language-Image Pretraining.
The SigLIP base model also forms the backbone of OWL-ViT v2, an open-world object detection model (featured in ML monthly June 2023).
Notably the model outperforms previous state of the art models (EVA-CLIP) with a 10x smaller visual encoder (5B parameters vs 400M parameters) on ImageNet 0-shot classification (82% vs 83.2%).
Results from the SigLIP model on several diverse images (none of which the model has seen) from the authors of the paper alongside one of my own images, a rock and sand drawing by my partner. Notice how the model matches many of the images to the text that most suits them. Source: SigLIP Colab Demo.
Read the paper, see the X announcement, get the code on GitHub, try the demo notebook.
pip install tensorflow[and-cuda]
). 2. tf.keras
can now use steps_per_execution=βautoβ
in Model.fit
, Model.predict
and Model.evaluate
for a significant performance boost.What a massive month for the ML world in September!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month, Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.