[October 2023] AI & Machine Learning Monthly Newsletter 💻🤖

46th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in October 2023 as an A.I. & Machine Learning Engineer…

My Work 👇

RAG Resources Repo — The theme of this month's issue is RAG (or Retrieval Augmented Generation).

It's a technique I've been using to improve the outputs of my LLMs. And with that, I've created a GitHub repo to collect all of the best resources I've found so far on learning the technique.

What is RAG?

Retrieval Augmented Generation or RAG for short is the process of having a Large Language Model (LLM) generate text based on a given context.

Retrieval = Find relevant data (texts, images, etc) for a given query.
Augmented = Add the retrieved relevant data as context information for the query.
Generation = Generate responses based on the query and retrieved information.

The goal of RAG is to reduce hallucinations in LLMs (as in, prevent them from making up information that looks right but isn't).

Think of RAG as a tool to improve your calculator for words.

The workflow looks like this:

Query (e.g. a question)
Find relevant resources to query (retrieval)
Add relevant resources to query (augment)
LLM creates a response to the query based on the context

What can you use RAG for?

Let’s say you were a large company with huge amounts of documentation.

You could build a traditional search engine on top of that documentation.

And this works fairly well.

But sometimes you want to go directly to the answer.

That’s where RAG can help.

A RAG workflow would involve someone asking a question (typically called a query), then retrieving passages in the documentation related to that query and having a LLM generate an answer based on the relevant passages.

That way, instead of just generating an answer based on the LLMs training data, the generated answer comes from a combination of your company’s own data.

The following resources will help you learn more about the topic of RAG.

But stay tuned to the rag-resources GitHub repo for future updates.

From the Internet 👇

RAG Resources

Forget RAG, the Future is RAG-Fusion by Adrian Raudaschl

RAG-Fusion is a take on RAG that enhances the original search query by creating other similar queries like it, searching for those and then combining the results.

This is similar to the process of “Hypothetical Document Embedding" (HyDE, mentioned in ML Monthly May 2023), as in asking a model “what would a document that answers this question look like?”, generating that document and then searching for similar documents to it.

1-rag-fusion-image

Diagram of RAG Fusion workflow. Source: Adrian Raudaschl blog.

[Blog] Prompt engineering for Claude's long context window

Claude is a Large Language Model (LLM) by Anthropic with similar performance to GPT-4. It's one of the best available proprietary LLMs.

In this blog post, the Anthropic team go through a series of experiments to improve Claude's performance (e.g. showing no examples, two examples and five examples). Turns out, with smaller models (Claude 1.2), the examples really matter whereas with larger models (Claude 2), examples still improve results but aren't as necessary.

Bonus: Anthropic also have an LLM cookbook on GitHub with examples such as how to iteratively search Wikipedia for information to help answer a question.

2-claude-results-with-experiments

Results of using Claude LLM with varying levels of examples and scratchpad abilities. Source: Anthropic blog.

Retrieval augmented generation: Keeping LLMs relevant and current from the Stack OverFlow blog.

A nice and easy to read overview of the RAG landscape along with workflow and tool recommendations such as the Emerging Architectures for LLM Applications by Andreessen Horowtiz.

[Updated] Building RAG-based LLM Applications for Production by Goku Mohandas and Philipp Moritz.

One of the gold-standard posts for how to build a full production setting RAG application RAG for production blog post has been updated with new sections:

fine-tuning
prompt engineering
lexical search
reranking
and data flywheel

I highly recommend reading through this post to see what it's like to build a full-scale RAG-style application.

3-rag-workflow-from-anyscale-blog

Overview of a typical RAG workflow to get relevant responses from a query into an LLM to create a response. Source: Anyscale blog.

Blogs 📖

Decoding images from the brain with AI from Meta.

Researchers from Meta have successfully used signals from magnetoencephalography (MEG, a technique which measures brain activity with magnets) aligned with image encoder representations to generate images given brain waves.

Instead of using a text-to-image prompt, this is brain wave-to-image. Looks like the world of brain-computer interfaces over the next couple of years is going to be wild.

4-brain-waves-to-decoded-images-large

Movie showing the image shown to a person and then the output from an image generation model conditioned on brainwaves. Source: Meta AI blog.

Can LLMs learn from a single example? by Jeremy Howard from fast.ai.

The best way to fine-tune or use LLMs on custom data is still being figured out.

Jeremy Howard from fast.ai discovered a weird effect when trying to fine-tune an LLM on science exam questions... the loss would drop dramatically even with only one custom training sample.

This may mean that the majority of LLM knowledge is learned during pretraining and that the notion of fine-tuning may need to be completely rethought.

Bonus: Jeremy recently went on the Latent Space podcast to talk about his findings.

Keeping an eye on cattle with ML from AWS.

How do you monitor thousands of cattle on a large and growing dairy farm?

Ideally you want the cows to be as healthy as possible for:

1. quality of life and
1. saving costs (a sick or dead cow doesn't produce milk).

Current solutions involve days of work from multiple people counting and inspecting individual cows.

However, potentially computer vision can help. Using commodity hardware (cheap security cameras) as well as off-the-shelf models (YOLOv5), a team from AWS built a proof of concept to monitor cow health with computer vision.

The model measures a degree of “lameness" (when a cow is sick, it tends to bow its head and walk with smaller strides). Initial results show a promising potential to expand the technology for similar use cases.

Is this a date? Using ML to recognize dates in file names from Dropbox.

How do you organize billions of files? It turns out dates are important.

I use dates in my own filenames. For many projects, I name the file with the date in reverse order (YYYY-MM-DD). I find it helpful to sort files in their chronological order.

Using a Transformer model and a custom labeled dataset, Dropbox created a model to help rename files to a consistent date format. Since releasing the model in August 2022, they've seen a 40% increase in renamed files.

The blog post is an excellent read about how to build seemingly small feature but used by millions.

Unbundling AI by Benedict Evans.

Many of the largest startups and companies of the past 20 years have been from unbundling Craigslist. As in, taking an existing category from the Craigslist website and turning it into a company.

The reason?

Because Craigslist covered everything but it wasn't really anything.

Benedict Evans argues a similar case for AI and in particular ChatGPT. Because ChatGPT covers everything, does this now mean there's room for more specialized companies to be built for AI specific use cases?

What is and why tokenization in ML? by Alex Strick van Linschoten.

LLMs work with natural language data. But behind the scenes, the computer still needs a way to convert that data into numbers. That's what a tokenizer does. It converts raw text into numerical form so that it can be used with a machine learning model.

In this blog post, Alex Strick van Linschoten goes through several different levels of tokenization methods and why they're required for machine learning systems.

As a bonus, Alex created a follow-up post with several examples of different tokenizers from FastAI, Hugging Face and more.

5-different-levels-of-tokenizations

Different levels of tokenization (turning words into numbers). Source: Alex Strick van Linschoten blog.

Multimodality and Large Multimodal Models (LMMs) by Chip Huyen.

What happens when you give an LLM access to data sources other than language (e.g. audio, images, tables + more)?

It becomes a Large Multimodal Modal (LMM)!

This is the kind of model that GPT-4V is. Because it can deal with text and images, it's considered multimodal. Multimodal stands for “multi-modalities" as in, multiple sources of data.

In this comprehensive post, Chip Huyen goes through the why of multimodality (life is multimodal, not just text), how to train multimodal models (mixing text and other sources) and research directions for LMMs (figuring the best way to incorporate different sources of data).

Is AGI a philosophy or computer science problem? by David Deutsch.

A fascinating essay from 2012 (I originally read it as if it was written last week but was shocked to realize it's over 10 years old) about whether Artificial General Intelligence (AGI) is a computer science, physics or philosophy problem.

Deutsch argues that humans’ unique capability over AI is our ability to generate new explanations. And current AI systems are only capable of generating explanations contained within their training set.

As a follow on to this essay, a more recent episode with Deutsch speaking about AGI is available on the Reason is Fun podcast.

Open-Source Releases 🤗

Adept AI release Fuyu 8B, a powerful multimodal model with a simplified architecture. Multimodal means it can take in text and images and produce outputs. Blog post and demo on Hugging Face.

6-fuyu-cake-captioning-image

Example input and output of Fuyu multimodal model from Adept AI. The model takes in an image and is able to read the text in the image despite never being explicitly trained to do so. Source: Adept AI blog.

TorchVision 0.16.0 is out with improved transforms and augmentations under the torch.transfroms.v2 namespace as well as improved MPS (Metal Performance Shaders in other words, PyTorch on Mac) operations.
Gradio, the library which enables the easiest building of ML demos gets two new upgrades, Gradio 4.0 (a bunch of performance improvements and new features) and Gradio Lite (run machine learning demos right within the browser, no server required!). Gradio 4.0 release video and Gradio Lite release notes.
One of my favourite multimodal model libraries, OpenCLIP had a huge round of updates in the last month, including new models such as SigLIP (best performing zero-shot model on ImageNet) as well as MetaCLIP (Meta’s improved reproduction of OpenAI’s CLIP model) as well as two new results tables to see which models offer the best results at different scales and compute. Get the code on GitHub.
OwLv2 (Open World Vocabulary Object Detection), a model that performs state-of-the-art zero-shot object detection is now available in Hugging Face Transformers. The models are also available on Google's Hugging Face page and you can get an example workflow on the Transformers-Tutorials GitHub.

7-making-predictions-with-owlvitv2

Example of OwLv2 outputs with the prompt [”fries”, “cheeseburger”, “tomato sauce”].

The model is able to successfully identify almost all instances of the given classes. This is a game changer for automatic labelling!

Source: That’s a photo of my lunch the other day 😄.

Super fast speech recognition! Distil-Whisper = 6x faster Whipser while performing within 1% error rate of the full Whisper model. Distil-Whipser is an English speech recognition system capable of sensational transcription results. See the release announcement on X, code on GitHub and models on Hugging Face.
Llama-Index releases Llama Chat, an open-source ChatGPT-like client that allows you to chat with your own data. Announcement on X and GitHub.
Spotify Open Sources their next generation approximate nearest-neighbor search engine, Voyager. Battled-tested by doing hundreds of millions of searches per day at Spotify, Voyager is built on top of the HNSW (Hierarchical Navigable Small Worlds) algorithm, which is contained inside the Hnswlib library. Voyager brings Java and Python bindings to the existing C++ inside Hnswlib. See the docs, code on GitHub, release blog post and Spotify Podcast Episode with the main engineers.

Papers 📄

SoM or Set-of-Mark Prompting is a technique for overlaying visual prompts on an image to dramatically improve the abilities of LMMs (Large Multimodal Models) such as GPT-4V. Project page, GitHub, paper.
Measuring heart rate with noise-cancelling headphones? A team from Google Research released a paper called APG: Audioplethysmography for Cardiac Monitoring in Hearables. The paper shows how a pair of headphones can be used to send a small soundwave into the ear and then measure the resulting movements of tiny ear canal displacements thanks to the echoes returned by the initial soundwave. Through various transformations, these echoes can be processed to reveal metrics such as heart rate and heart rate variability. APG showed a consistently accurate heart rate (3.21% median error across 153 participants in all activity scenarios) and heart rate variability (2.70% median error in the inter-beat interval) measurements. It would be awesome to see this integrated into a future product! Read more in the blog post on Google Research.
Kosmos 2.5 is a Large Multimodal Model by Microsoft (they call it a Multimodal Literate Model) that excels at generating spatially aware text as well as producing structured text output that captures styles and translates it into markdown. It does this thanks to pretraining on millions of text-based documents.
Demystifying CLIP Data is a paper by Meta that aims to reverse engineer the data used to train the original OpenAI CLIP model (in the original CLIP paper, the OpenAI authors never disclosed the exact details on how they acquired the data). The new model, named MetaCLIP (Metadata-Curated Language-Image Pretraining) successfully managed to outperform the original CLIP model given the same training setup and number of images. This means that the Meta team were able to successfully reproduce a similar (but better) quality dataset of 400M image-text pairs. One of my favourite findings from the paper was:

“We observe that data quality is much more important than quantity (different from existing open source efforts or ALIGN that mostly scale quantity)”.

Get the code on GitHub.

Guides 🗺

SpaCy-LLMs is a package from spaCy to integrate LLMs into structured NLP pipelines, leveraging the generality of LLMs with the speed and robustness of spaCy.
KeyLLM — extract keywords with LLMs by Maarten Grootendorst — An excellent walkthrough on how to extract various keywords from a document with an LLM in a few lines of code.

Quick News 📰

The State of AI 2023 report is out with a focus on LLMs but also a bunch more of information from the world of AI that is non-LLMs.
Reka announces multimodal AI assistant, Yasa-1.
Descript (a product for transcribing voice into text and then making it editable) announces a bunch of new features including better AI voice cloning, chapter generation (similar to chapters on YouTube), summarization and more.

Videos 📺

AI learns to play Pokémon with Reinforcement Learning — This video combined two of my favourite things in the world, AI and Pokemon. A wave of nostalgia and inspiration washed over me as I watched it. What an excellent debut video! Well done Peter!
Making Chat (ro)Bots — What happens when you connect a robot dog to a LLM? You get an epic tour guide! The team from Boston Dynamics connected their robot dog to an LLM and made it listen and speak.

See you next month!

What a massive month for the ML world in October!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.