[June 2024] AI & Machine Learning Monthly Newsletter 💻🤖

54th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in June 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

[Video] A Day in the Life of a Machine Learning Engineer

My brother Josh and I are working on a startup called Nutrify (an app which uses computer vision to help people learn about foods). My brother does the iOS development and I build the machine learning models.

The other day we decided to film a “day in the life” style video of what a typical day at Nutrify looks like from a machine learning engineer’s point of view.

From the Internet 🌐

1. NLP at 15,000 words per second with up to 99% accuracy and model artifacts of 6MB (case study)

S&P Global provides market insights on commodities trading markets.

To do this, they must extract information from trading calls such as:

Americas Crude oil: TUPI: May delivery heard offered July ICE Brent +$2.50/b, CIF Qindao

Inside this small piece of text, they use NLP models to extract markets, grades, pricing, timing, location and more.

All of this has to happen in real-time (because trading markets move so fast) meaning that prediction time needs to be under 15ms.

With this kind of latency requirement, it means generative AI models are off the table (too slow).

So what did they do?

They used spaCy (a powerful open-source NLP library) to train supervised models as well as craft rule-based pipelines.

And they used Prodigy (a very easy to use and customize labelling tool) to create very specific supervised data which they used to train intermediate models to help with further labelling.

The result?

Several hyper-specific models capable of real-time inference with a small footprint (~6MB per model). The small footprint means they can also be deployed in a wide range of regions for even faster inference.

2024-06-27-s-and-p-case-study-with-spacy-models-label-100-examples-first

How S&P Global created a human-in-the-loop data labelling pipeline to label ~100 examples, train a model, and then use that model to label future examples. Source: Explosion AI blog.

2. What We Learned from a Year of Building with LLMs Part 2 & Part 3

In last month’s ML Monthly (May 2024), I shared an excellent article with practical tips for building with LLMs.

Well over the past month, the authors of the original article put out two more to complete the series (as well as a single website collecting them all).

One of my favourite quotes from Part 2 (operational) was:

Look at samples of LLM inputs and outputs every day — “Genchi Genbutsu, real things, real places”.

Where “Genchi Genbutsu” is a Japanese term often translated as “go and see for yourself”.

In other words, if you’re designing a machine learning system with probabilistic outputs, one of the best ways to test it is to look at actual examples of inputs and outputs every day.

This is especially important for LLMs considering how wide the output space can be.

Part 3 answers a great series of questions:

2024-06-26-llm-strategy

Questions to ask when building/designing with LLMs (and other kinds of ML models). Source: What We Learned from a Year of Building with LLMs (Part 3).

If you’re building with LLMs (or any kind of machine learning), I’d highly recommend reading the series end-to-end at least twice.

It’s well worth it.

Resources:

Applied LLMs (website with all three articles)
[Article] What We Learned from a Year of Building with LLMs Part 1: Tactical
[Article] What We Learned from a Year of Building with LLMs Part 2: Operational
[Article] What We Learned from a Year of Building with LLMs Part 3: Strategic

3. Generative AI Is Not Going To Build Your Engineering Team For You by Charity Majors

An excellent write-up on how Generative AI tools make it easier to create code but how writing code is often one of the easiest parts of software engineering.

The real value comes in creating the system of software as a whole rather than just adding more lines of code.

I also really align with the quote on software being an apprenticeship industry.

Where the only real way to learn something is to do it and do it wrong, over and over, iteratively improving over time before you start to get an idea of how to do it right.

But then again…

Due to the nature of the field, the right way today may be superseded in the future. And so the learning by doing continues.

2024-06-27-software-is-an-apprenticeship-industry

Software is an apprenticeship industry, the best way to learn is by doing, doing and more doing. Or “experiment, experiment, experiment!”. Source: Generative AI Is Not Going To Build Your Engineering Team For You by Charity Majors.

4. Google Introduce Gemma 2 (a powerful open-source LLM)

Google’s new open-source LLM, Gemma 2 is now available in 4 variants (2 base models & 2 fine-tuned ones).

For now, there’s a 9B parameter version and a 27B parameter version each with best-in-class performance for their size or comparable to much larger models (e.g. the 27B parameter version is comparable to the 70B Llama 3).

Having parameter counts like these mean the models can be deployed on hardware such as consumer GPUs (NVIDIA RTX 4090 with 24GB can run the 18GB 9B parameter model) as well as single enterprise GPUs (a NVIDIA H100 with 80GB of memory can run the 56GB 27B parameter model).

There’s a smaller 2.6B model coming soon too.

Hugging Face and Google both have great write-ups with links to fine-tuning examples for custom use cases.

Resources:

Introducing Gemma 2 on Hugging Face (model weights are available here too)
Google Blog Post Announcing Gemma 2
Technical report of how Gemma 2 came to be

5. Apple Announces Apple Intelligence at WWDC (as well as many more ML features)

Apple hosted their annual WWDC keynote during June and introduced Apple Intelligence.

The benefit of Apple Intelligence is that models run mostly on-device.

Running on-device means they can incorporate information you choose to let them use whilst remaining private.

Apple reports that Apple Intelligence features are largely powered by an on-device LLM with a size of ~3 billion parameters. This on-device LLM is equipped with a series of adapter modules which are specialized for certain tasks (e.g. summarization, proofreading, mail replies, tone adjustment and more).

And larger server-based models are available running on dedicated Apple Silicon are available when necessary.

apple-foundation-model-building

Apple is able to achieve incredible results on-device by taking a large foundation model and equipping it with several adaptor modules which are specialized for certain tasks. Source: Apple ML blog.

There’s also an incredibly cool feature in the new iPadOS 18 Calculator/Notes App: Math Notes.

You can now write mathematical expressions as you would on pen and paper and have the calculator app solve them inline.

apple-ios18-on-ipad-with-calculator-app

Apple’s new Math Notes feature in iPadOS 18 lets you write and solve mathematical expressions with Apple Pencil as you would with pen and paper. There must be several models working under the hood here including handwriting recognition, handwriting generation (the numbers produced match your handwriting), variable recognition and more.

What a powerful homework helper tool to get the best of both worlds. Writing by hand and calculating by machine.

There are also a series of developer-related videos with regards to machine learning I found interesting. If you’re looking to develop ML-powered products for Apple devices, I’d highly recommend checking them out too.

Resources:

Apple WWDC 2024 Keynote
Apple Intelligence Announcement Page/Film
iPadOS Preview Page (contains Math Notes information)
WWDC video: Deploy Machine Learning and AI Models On-device with CoreML (this is the process my brother and I use for deploying models on-device for our app Nutrify)
WWDC video: Train machine learning models with Apple Silicon
WWDC video: Bring your machine learning and AI models to Apple Silicon

6. Microsoft Release Florence-2, a (relatively lightweight) open-source foundation vision model

Microsoft have published a series of open-source computer vision models under the name Florence-2.

These computer vision models are capable of many tasks such as image captioning, object detection, object detection with grounding, region proposal, segmentation and more.

Florence-2 comes in two size variants base (0.2B parameters) and large (0.7B parameters). At these sizes, it means they can be deployed on relatively small devices compared to some other larger foundational models out there.

To build the models, Microsoft researchers combined a DaViT vision encoder as well as a BERT text encoder. Outputs are created in encoder-decoder style with the decoder generating output text and location tokens based on the inputs.

The most impressive thing to me was the data engine used to create the dataset.

A pipeline was constructed to go from a large corpus of images (126M) → label a small corpus with specialist models → iterative data filtering and enhancement → final dataset for large-scale model training.

The final dataset results in 5B annotations (500M text annotations, 1.3B region-text annotations and 3.6B text-phrase-region annotations).

florence2-data-engine

Florence-2 data engine scheme. Starting with a large corpus of images followed by several steps of iterative data labelling and refinement. Source: Florence-2 paper.

There are several resources available online for learning about Florence-2, including a demo as well as a fine-tuning example notebook.

florence2-laptops-with-object-detection

Example of Florence-2 inputs and outputs. Florence-2 was used to caption the image and then the caption was used to draw boxes around the image (caption to phrase grounding). This would be a very helpful workflow for bootstrapping an object detection dataset. Source: Florence-2 Demo on Hugging Face.

Resources:

7. Anthropic Releases Claude 3.5 Sonnet, 80% cheaper than before and better than GPT-4o

Anthropic’s flagship model, Claude 3.5 Sonnet is live!

And it surpasses their previous flagship model, Claude 3 Opus whilst being 80% cheaper ($3/million input tokens vs $15/million input tokens and $15/million output tokens vs $75/million output tokens).

It also operates at 2x the speed of Claude 3 Opus.

Finally, it also largely outperforms other similar models on various academic benchmarks such as GPT-4o and Gemini 1.5 Pro.

However it’s always important to test these models on your own benchmarks.

8. Hugging Face release FineWeb Dataset + Paper for better LLM training

Large scale datasets are paramount when training Large Language Models (LLMs).

But for many LLMs, people often ask, where did the data come from?

The FineWeb paper and datasets seek to answer that.

For LLM datasets, the two main options are:

A filtered version of Common Crawl (a public collection of web snapshots taken every few months).
Custom web crawlers which read various web pages (such as OpenAI’s bot and Anthropic’s bot).

The FineWeb dataset takes the first approach.

Trying to replicate the creation of a high-quality dataset in a public and open manner.

Their paper reveals many of the important details they considered when creating the dataset as well as all of the ablation studies they performed to test it.

One of my favourite pieces of data filtering was using an open-source LLM (Llama-3-70B-Instruct) to annotate 500k samples for quality.

They then used these samples to create a classifier which could classify an article based on its quality level.

It turns out this approach worked quite well.

With the higher quality classified documents (e.g. these documents are educative in nature) producing better end results.

This shows how classifiers trained on LLM-created annotations can be used effectively for data filtering.

fineweb-annotating-data-at-scale

How Hugging Face used an LLM (Llama3-70B) to create 500k annotations on the quality of different articles. 0 being poor and 5 being high quality. They then used these annotations to successfully create a classifier to classify future documents at scale. This shows the effectiveness of using an LLM to create annotations. Source: Hugging Face FineWeb paper.

9. More Papers, Tools and Open Source

Depth Anything V2 is a state-of-the-art depth (image-to-depth) model which shows how you can get incredible results using purely synthetic data. Paper, GitHub, Demo.

depth-anything-v2-data-labelling-engine

Depth Anything V2 data engine which starts with purely synthetic data labelled with a high-quality teacher. The teacher model then labels a large real-world image dataset with high-quality pseudo labels. These real-world images and labels are then used to train smaller student models. Source: Depth Anything V2 project page.

Google Releases two papers which highlight the potential for LLM use in personal healthcare. Their Personal Health LLMs (PH-LLMs) score on par or better than experts in sleep and fitness regimes. See the blog post as well as Towards a Personal Health Large Language Model and Transforming Wearable Data Into Health Insights using Large Language Models for more.
Webpage → Markdown text: Jina AI releases their Reader API which allows you to get an LLM-friendly input from a URL or web search by adding r.jina.ai to the front of it.
Cohere Toolkit is an open-source collection of pre-built components to build your own custom RAG applications.

10. Books, Talks, Tutorials and Presentations

[Video/Podcast] Is AGI just a fantasy? — Nick Frost (co-founder of Cohere) went on the Machine Learning Street Talk podcast and talked about how Cohere are focused on the business use cases of AI rather than building “digital Gods”. A very refreshing take given the current state of the field.
[Video/Podcast] Francois Chollet (Founder of Keras and researcher at Google) on how LLMs won’t lead to AGI — A very thought-provoking discussion on how LLMs are very useful but something new may be needed to move closer to AGI.
[Video] Official PyTorch Documentary — PyTorch released a documentary which details the story of how PyTorch came to be the driving force of the AI revolution.
[Video] The Cult of Done: How to Get Started by No Boilerplate — A sensational video describing 13 principles of getting things done including “Laugh at perfection. It’s boring and keeps you from being done.”
[Book] Build a Large Language Model (from scratch) by Sebastian Raschka — Machine learning researcher and educator Sebastian Raschka is nearing the completion of a new book which details the inner workings of LLMs right from data inputs to crafting a GPT model from scratch. An excellent resource for those looking to learn more about LLMs from the ground up.

Bonus

Tiny Corp prepare to ship their first computer! An AI supercomputer in a box! It comes in two versions, red and green (the red one has AMD GPUs and the green one has NVIDIA GPUs). A highly inspiring story from the company looking to commoditize the petaflop.

See you next month!

What a massive month for the ML world in June!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

AI & Machine Learning Monthly Newsletter 💻🤖

Daniel Bourke

Here's what you might have missed in June 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work 👇

[Video] A Day in the Life of a Machine Learning Engineer

From the Internet 🌐

1. NLP at 15,000 words per second with up to 99% accuracy and model artifacts of 6MB (case study)

2. What We Learned from a Year of Building with LLMs Part 2 & Part 3

3. Generative AI Is Not Going To Build Your Engineering Team For You by Charity Majors

4. Google Introduce Gemma 2 (a powerful open-source LLM)

5. Apple Announces Apple Intelligence at WWDC (as well as many more ML features)

6. Microsoft Release Florence-2, a (relatively lightweight) open-source foundation vision model

7. Anthropic Releases Claude 3.5 Sonnet, 80% cheaper than before and better than GPT-4o

8. Hugging Face release FineWeb Dataset + Paper for better LLM training

9. More Papers, Tools and Open Source

10. Books, Talks, Tutorials and Presentations

Bonus

See you next month!

Complete A.I. Machine Learning and Data Science: Zero to Mastery

PyTorch for Deep Learning Bootcamp: Zero to Mastery

TensorFlow for Deep Learning Bootcamp: Zero to Mastery

Complete Python Developer in 2025: Zero to Mastery

Prompt Engineering Bootcamp (Working With LLMs): Zero to Mastery

More from Zero To Mastery