[August 2023] Machine Learning Monthly Newsletter 💻🤖

44th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.

Hey there, Daniel here.

I’m a Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in August 2023 as a Machine Learning Engineer…

From Around the Internet 🕸

1. Even more open-source state of the art text embedding models!

A new set of state-of-the-art (SOTA) open-source text embeddings are available on Hugging Face.

These new embeddings called bge (BAAI General Embeddings) and gte (General Text Embeddings) outperform previous sets of embeddings (including OpenAI’s) with smaller model sizes and less embedding dimensions.

Wait, what are embeddings?

Embeddings are a numerical representation of some kind of data.

In this case, we’re talking about text embeddings.

In essence, a numerical representation of a passage of text (e.g. “The quick dog ran up the hill” → [0.3, 0.2, 0.6 …].

Embeddings can be used for many different use cases such as:

Classification (which numbers relate to certain categories?)
Similarity matching (how similar is a passage of text to another?)

And much more.

Hugging Face Massive Text Embedding (MTEB) leaderboard.
FlagEmbedding (the best embeddings so far) GitHub.
Bonus: See below for Vicky Boykis’ incredible paper on embeddings.

2. Meta continues to crush it in the open-source AI space

Four (maybe more?) huuuuugeeee new model releases from Meta in the past month:

AudioCraft — a foundation model which generates high-quality audio from text-based prompts.
Code Llama — a Llama (Meta’s open-source large language model) model specifically fine-tuned for code generation.
Seamless Communication (SeamlessM4T) — a foundational speech/text translation and transcription model that can perform speech-to-speech and speech-to-text translation for 100 languages via input (speech + text) and 100 languages for text output and 35 language languages for speech output.
DINOv2 + FACET — DINOv2 is Meta’s state-of-the-art image encoder and it now comes with an Apache 2.0 licence meaning it can be used for commercial use cases! And FACET (FAirness in Computer Vision EvaluaTion) is a diverse dataset with 50,000 images of people labelled by expert human annotators to help evaluate the fairness of computer vision models.

Some of the models vary on the licences they’re released with (e.g. commercial or research-only) but it’s outstanding to see such efforts released in such a way.

Keep at it Meta!

3. Hugging Face models right in your database! (Hugging Face + Supabase)

Supabase is an open-source Firebase alternative Postgres database.

It can be used for storing structured data (SQL) in an efficient way.

My brother and I are using it to store account data and metadata for our app Nutrify.

Supabase recently allowed storing of vectors via pgvector (vectors can be embeddings, numerical representations of data, such as discussed above).

And now they’ve partnered with Hugging Face to create those vectors via Transformers models (such as the new state-of-the-art open-source embedding models).

In essence, you can now go from data → transform to vector (embedding) → store in database → query later, all with Supabase and Hugging Face, two open-source tools!

Read the blog post announcement
Related: Watch a demo video of using AI to generate Postgres snippets with Supabase AI
Checkout a code snippet of embedding text into a database with Supabase’s Python library vecs:

import vecs
from vecs.adapter import Adapter, ParagraphChunker, TextEmbedding

vx = vecs.create_client("postgresql://<user>:<password>@<host>:<port>/<db_name>")

# Create a new collection with an associated adapter
docs = vx.get_or_create_collection(
    name="docs",
    # here comes the new part
    adapter=Adapter(
        [
            ParagraphChunker(skip_during_query=True),
            TextEmbedding(model='Supabase/gte-small'),
        ]
    )
)

# Upsert
docs.upsert(
    records=[
        (
         "vec0",
         "the diameter of a 747 ...", # <- inserting text!
         {"publish_year": 2019}
        )
    ]
)

# Search by text
docs.query(data="how many ping pong balls fit in a Boeing ...")

# Results: [...]

4. Upgrade your computer vision workflow with Roboflow’s Supervision library

Robowflow Supervision is an open-source library to help with your computer vision pipelines.

It includes tools to:

Load datasets
Draw detections on an image or video
Count how many detections are in a zone
Evaluate detection models

It goes along with Roboflow’s fantastic suite of computer vision tools, such as Notebooks (for computer vision examples), Autodistill (automatic computer vision data labelling) and Collect (automatic collection of computer vision data at different intervals).

5. Segment Anything in High Quality (and with a fast model!)

Meta released the Segment Anything (SAM) model a couple of months ago now.

Since then there’s been a flourishing of open-source variants to both improve its capabilities and speed.

Two of the latest are HQ-SAM and Light HQ-SAM, where HQ stands for High Quality and Light stands for smaller and faster.

HQ-SAM turns the segmentation masks from SAM into higher quality segmentation masks, improving the results and making the masks look overall better.

Light HQ-SAM is a distillation (a way to use what bigger models know and embed it into smaller models) of HQ-SAM which allows real-time segmentation due to having less parameters (40MB vs 5-10GB).

Both are open-source an available via the sam-hq GitHub repo.

coco vis comp

Visual comparison of SAM vs SAM-HQ results. Source: sam-hq GitHub.

6. Keras Core brings together PyTorch, TensorFlow and JAX into one framework

Keras Core is one of the most exciting things I’ve seen in deep learning frameworks since Hugging Face’s transformers.

The goal of the project is to combine the usability of Keras with the backends of the major frameworks.

As in, it allows you to code your machine learning code in Keras (simple and easy to use) and have it run with TensorFlow, PyTorch OR JAX as the backend.

The big benefit here is you could build and train a model in Keras Core and then integrate your favourite functions from other frameworks.

Or you can find a model trained with PyTorch, import it into your Keras Core environment and have it run without hiccups.

Keras Core will eventually become Keras 3.0 and already has a bunch of functionality ready to use (e.g. many of the existing operations across PyTorch, TensorFlow and JAX, including data loading pipelines all work within Keras Core).

multibackend workflow

If successful, Keras 3.0 could mean that there’s no longer the question of JAX, TensorFLow or PyTorch…? Because with Keras 3.0, you get any and all of them. Source: Keras Core announcement post.

Read the Keras Core announcement post and see the library on GitHub.

7. Freeze the model, reinforce the data instead (and get better results faster)

This is easily one of my favourite papers I’ve read in a long time.

Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement discusses how to improve the results of a model on a dataset by making the dataset better instead of making the model better.

reinforce-data-framework

Figure of the dataset reinforcement workflow. Source: Reinforce Data, Multiply Impact paper.

With data reinforced ImageNet (ImageNet+), the researchers were able to train an equivalent ResNet-50 model in 2.5x less time compared to knowledge distillation (KD) training.

And compared to standard ImageNet training, with ImageNet+ the researchers achieved the same results in 150 epochs vs 1000 epochs (6.9x faster) and with the same amount of training time (1000 epochs) achieved a 1.7% better result with the same architecture.

The results also lent themselves to a variety of architectures (CNN, ViT, MobileNet), datasets (Food101, Flowers102) and problem spaces (object detection and segmentation).

One of the main benefits is that the act of dataset reinforcement only needs to take place once, then the reinforced dataset can subsequently be used for later training with minimal overheads.

The authors also found that training on reinforced datasets made the models far more robust to out of distribution (OOD) datasets such as ImageNet-(V2, A, R, C, Sketch).

I love this paper because it’s an excellent example of “freeze the model and iterate on the dataset”.

My next question is, how does the reinforced data paradigm work with an online dataset? As in, a dataset that continually changes and grows. My guess is the original dataset could be cached and then new samples could be reinforced and continually added to the system.

Read the paper, get the code on GitHub.

8. Google AI papers and updates

RO-ViT or Region-aware pretraining for open-vocabulary object detection with vision transformers uses random crops of an image to teach a detector to understand what “regions” are and then uses these along with an image and text encoder to create an open-world vocabulary detector.

This means the model is able to detect and localise classes in an image it has never seen before.

The model achieves +7.8 points AP (average precision) on the LVIS dataset rare class split.

A nice side effect of the improved region-aware embedding is that the model also see an improvement on image-level representation, achieving state-of-the-art results on 9 out of 12 metrics for image-text retrieval.

AVIS (Autonomous Visual Information Seeking with Large Language Models) is a framework to combine LLMs with computer vision tools, web search tools and image search tools to answer complicated questions about items in images.

CHITA and CHITA++ (Fast as CHITA: Neural Network Pruning with Combinatorial Optimization) is an optimizable pruning procedure to create smaller whilst still performant neural networks. The results show that CHITA and CHITA++ can prune up to 70% of a model’s weights and still retain good performance.

One caveat: since CHITA performs unstructured pruning (any weight can be removed), resulting networks require software and hardware capable of supporting sparse computations.

9. How does Siri work?

Ever wonder how you can say “Hey Siri” or “Siri” (in upcoming iOS 17 and macOS 18) and have your Apple device(s) start listening for commands?

Turns out there are a fair few steps that go into making sure Siri triggers when you want and doesn’t trigger when you don’t want.

Namely:

Distinguishing a device’s primary user from other speakers (is it you speaking or someone else?)
Identifying and rejecting false triggers from background noise (did you actually say “Hey Siri” or “Siri” or was it a random sound?)
Identifying and rejecting acoustic segments that are phonetically similar to trigger phases (did actually say “Hey Siri” or “Siri” or were you saying “Hello Sally” to your friend?)
Supporting a shorter phonetically challenging trigger phrase (“Siri”) across multiple locales (a shorter phrase is harder to identify)

siri-workflow

The workflow for triggering Siri on-device all the way to performing some kind of action. Source: Apple Machine Learning blog.

The above steps would be an excellent machine learning project to try and replicate.

You could:

Train a classifier to classify your voice from others.
Train another classifier to classify whether you said “Hey Siri” or not (this could be an audio based model or a text-based model after you’ve transcribed the text).
If the above two models predict positive, transcribe the voice text (if you already haven’t) and perform actions based on the text.
The actions could be something performed by a ChatGPT-like model.

If you're looking for more machine learning project ideas, check out this list I made.

10. Challenges and Patterns in Large Language Models (LLMs)

Eugene Yan and Chip Huyen are two of my favourite writers in the machine learning and AI space.

Their recent blog posts (one from Chip, two from Eugene) delve into the challenges LLMs face as well as the problems and patterns LLMs can be used for.

10-challenges-with-llms

11. Vicki Boykis’ Treasure Trail of machine learning articles

I’ve said it before but every so often you stumble upon someone’s work and proceed to read everything they’ve ever written.

Vicki Boykis is one of those people.

She’s been working with data and machine learning for over 10 years at companies big and small.

I’d recommend exploring her blog for whatever sparks your interest but the following are what I’ve really enjoyed:

What we don’t talk about when we talk about building AI apps — Cool AI demos are everywhere, but what actually goes into building an AI app? (the less sexy parts)
What are embeddings? — Possibly the most thorough and well researched paper on embeddings on the internet.
Git, SQL and Bash — The tools Vicki has used across multiple jobs, companies and tech projects.
(Paid) The Machine Learning Engineer Toolset — What tools does a machine learning engineer use?

Cool tools, videos and tweets 😎

Rapid fire time!

AI Jason is one of my favourite new YouTube channels. He showcases code examples of how you can use large language models for various projects. Such as being your own automatic research agent.
A research team at Apple (at least that’s what the video title says) published a talk giving an overview of different Transformer architectures for multi-modal (images and text) language models. The video gives an excellent overview of the main different types of language models (encoder-only, encoder-decoder, decoder only) and how they work with text and images.
Simon Willison has a fantastic talk/blog post discussing all things LLMs. From where they started and how they work to where we are now.
SynthID is a new tool from DeepMind to help identify whether an image has been generated by AI or not.
What if you could set a learning rate to train infinitely (and continually improve/stop when it doesn’t)? Shital Shah discovered a hidden gem in the Scaling Vision Transformers paper.
Tesla’s Full Self Driving version 12 is coming soon and supposedly removes 300k+ lines of manual code in favour of pure machine learning code (e.g. pixels from cameras in → driving actions out). Video breakdown from AI DRIVR.
Normal Computing’s outlines library is a helpful utility for guiding LLMs to generate things, such as guaranteed JSON output.
Hugging Face Transformers comes to Swift! Allowing you to run on-device LLMs on Apple hardware.
There's a new Hugging Face Open Object Detection Leaderboard to measure the scores of openly available object detection models. A few additions I'd love to see are: model size and computation requirements.

See you next month!

What a massive month for the ML world in August!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.