AI & Machine Learning Monthly Newsletter ๐Ÿ’ป๐Ÿค–

Daniel Bourke
Daniel Bourke
hero image

54th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

Iโ€™m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

Here's what you might have missed in June 2024 as an A.I. & Machine Learning Engineer... let's get you caught up!

My Work ๐Ÿ‘‡

[Video] A Day in the Life of a Machine Learning Engineer

My brother Josh and I are working on a startup called Nutrify (an app which uses computer vision to help people learn about foods). My brother does the iOS development and I build the machine learning models.

The other day we decided to film a โ€œday in the lifeโ€ style video of what a typical day at Nutrify looks like from a machine learning engineerโ€™s point of view.

From the Internet ๐ŸŒ

1. NLP at 15,000 words per second with up to 99% accuracy and model artifacts of 6MB (case study)

S&P Global provides market insights on commodities trading markets.

To do this, they must extract information from trading calls such as:

Americas Crude oil: TUPI: May delivery heard offered July ICE Brent +$2.50/b, CIF Qindao

Inside this small piece of text, they use NLP models to extract markets, grades, pricing, timing, location and more.

All of this has to happen in real-time (because trading markets move so fast) meaning that prediction time needs to be under 15ms.

With this kind of latency requirement, it means generative AI models are off the table (too slow).

So what did they do?

They used spaCy (a powerful open-source NLP library) to train supervised models as well as craft rule-based pipelines.

And they used Prodigy (a very easy to use and customize labelling tool) to create very specific supervised data which they used to train intermediate models to help with further labelling.

The result?

Several hyper-specific models capable of real-time inference with a small footprint (~6MB per model). The small footprint means they can also be deployed in a wide range of regions for even faster inference.

2024-06-27-s-and-p-case-study-with-spacy-models-label-100-examples-first

How S&P Global created a human-in-the-loop data labelling pipeline to label ~100 examples, train a model, and then use that model to label future examples. Source: Explosion AI blog.

2. What We Learned from a Year of Building with LLMs Part 2 & Part 3

In last monthโ€™s ML Monthly (May 2024), I shared an excellent article with practical tips for building with LLMs.

Well over the past month, the authors of the original article put out two more to complete the series (as well as a single website collecting them all).

One of my favourite quotes from Part 2 (operational) was:

Look at samples of LLM inputs and outputs every day โ€” โ€œGenchi Genbutsu, real things, real placesโ€.

Where โ€œGenchi Genbutsuโ€ is a Japanese term often translated as โ€œgo and see for yourselfโ€.

In other words, if youโ€™re designing a machine learning system with probabilistic outputs, one of the best ways to test it is to look at actual examples of inputs and outputs every day.

This is especially important for LLMs considering how wide the output space can be.

Part 3 answers a great series of questions:

2024-06-26-llm-strategy

Questions to ask when building/designing with LLMs (and other kinds of ML models). Source: What We Learned from a Year of Building with LLMs (Part 3).

If youโ€™re building with LLMs (or any kind of machine learning), Iโ€™d highly recommend reading the series end-to-end at least twice.

Itโ€™s well worth it.

Resources:

3. Generative AI Is Not Going To Build Your Engineering Team For You by Charity Majors

An excellent write-up on how Generative AI tools make it easier to create code but how writing code is often one of the easiest parts of software engineering.

The real value comes in creating the system of software as a whole rather than just adding more lines of code.

I also really align with the quote on software being an apprenticeship industry.

Where the only real way to learn something is to do it and do it wrong, over and over, iteratively improving over time before you start to get an idea of how to do it right.

But then againโ€ฆ

Due to the nature of the field, the right way today may be superseded in the future. And so the learning by doing continues.

2024-06-27-software-is-an-apprenticeship-industry

Software is an apprenticeship industry, the best way to learn is by doing, doing and more doing. Or โ€œexperiment, experiment, experiment!โ€. Source: Generative AI Is Not Going To Build Your Engineering Team For You by Charity Majors.

4. Google Introduce Gemma 2 (a powerful open-source LLM)

Googleโ€™s new open-source LLM, Gemma 2 is now available in 4 variants (2 base models & 2 fine-tuned ones).

For now, thereโ€™s a 9B parameter version and a 27B parameter version each with best-in-class performance for their size or comparable to much larger models (e.g. the 27B parameter version is comparable to the 70B Llama 3).

Having parameter counts like these mean the models can be deployed on hardware such as consumer GPUs (NVIDIA RTX 4090 with 24GB can run the 18GB 9B parameter model) as well as single enterprise GPUs (a NVIDIA H100 with 80GB of memory can run the 56GB 27B parameter model).

Thereโ€™s a smaller 2.6B model coming soon too.

Hugging Face and Google both have great write-ups with links to fine-tuning examples for custom use cases.

Resources:

5. Apple Announces Apple Intelligence at WWDC (as well as many more ML features)

Apple hosted their annual WWDC keynote during June and introduced Apple Intelligence.

The benefit of Apple Intelligence is that models run mostly on-device.

Running on-device means they can incorporate information you choose to let them use whilst remaining private.

Apple reports that Apple Intelligence features are largely powered by an on-device LLM with a size of ~3 billion parameters. This on-device LLM is equipped with a series of adapter modules which are specialized for certain tasks (e.g. summarization, proofreading, mail replies, tone adjustment and more).

And larger server-based models are available running on dedicated Apple Silicon are available when necessary.

apple-foundation-model-building

Apple is able to achieve incredible results on-device by taking a large foundation model and equipping it with several adaptor modules which are specialized for certain tasks. Source: Apple ML blog.

Thereโ€™s also an incredibly cool feature in the new iPadOS 18 Calculator/Notes App: Math Notes.

You can now write mathematical expressions as you would on pen and paper and have the calculator app solve them inline.

apple-ios18-on-ipad-with-calculator-app

Appleโ€™s new Math Notes feature in iPadOS 18 lets you write and solve mathematical expressions with Apple Pencil as you would with pen and paper. There must be several models working under the hood here including handwriting recognition, handwriting generation (the numbers produced match your handwriting), variable recognition and more.

What a powerful homework helper tool to get the best of both worlds. Writing by hand and calculating by machine.

There are also a series of developer-related videos with regards to machine learning I found interesting. If youโ€™re looking to develop ML-powered products for Apple devices, Iโ€™d highly recommend checking them out too.

Resources:

6. Microsoft Release Florence-2, a (relatively lightweight) open-source foundation vision model

Microsoft have published a series of open-source computer vision models under the name Florence-2.

These computer vision models are capable of many tasks such as image captioning, object detection, object detection with grounding, region proposal, segmentation and more.

Florence-2 comes in two size variants base (0.2B parameters) and large (0.7B parameters). At these sizes, it means they can be deployed on relatively small devices compared to some other larger foundational models out there.

To build the models, Microsoft researchers combined a DaViT vision encoder as well as a BERT text encoder. Outputs are created in encoder-decoder style with the decoder generating output text and location tokens based on the inputs.

The most impressive thing to me was the data engine used to create the dataset.

A pipeline was constructed to go from a large corpus of images (126M) โ†’ label a small corpus with specialist models โ†’ iterative data filtering and enhancement โ†’ final dataset for large-scale model training.

The final dataset results in 5B annotations (500M text annotations, 1.3B region-text annotations and 3.6B text-phrase-region annotations).

florence2-data-engine

Florence-2 data engine scheme. Starting with a large corpus of images followed by several steps of iterative data labelling and refinement. Source: Florence-2 paper.

There are several resources available online for learning about Florence-2, including a demo as well as a fine-tuning example notebook.

florence2-laptops-with-object-detection

Example of Florence-2 inputs and outputs. Florence-2 was used to caption the image and then the caption was used to draw boxes around the image (caption to phrase grounding). This would be a very helpful workflow for bootstrapping an object detection dataset. Source: Florence-2 Demo on Hugging Face.

Resources:

7. Anthropic Releases Claude 3.5 Sonnet, 80% cheaper than before and better than GPT-4o

Anthropicโ€™s flagship model, Claude 3.5 Sonnet is live!

And it surpasses their previous flagship model, Claude 3 Opus whilst being 80% cheaper ($3/million input tokens vs $15/million input tokens and $15/million output tokens vs $75/million output tokens).

It also operates at 2x the speed of Claude 3 Opus.

Finally, it also largely outperforms other similar models on various academic benchmarks such as GPT-4o and Gemini 1.5 Pro.

However itโ€™s always important to test these models on your own benchmarks.

8. Hugging Face release FineWeb Dataset + Paper for better LLM training

Large scale datasets are paramount when training Large Language Models (LLMs).

But for many LLMs, people often ask, where did the data come from?

The FineWeb paper and datasets seek to answer that.

For LLM datasets, the two main options are:

  1. A filtered version of Common Crawl (a public collection of web snapshots taken every few months).
  2. Custom web crawlers which read various web pages (such as OpenAIโ€™s bot and Anthropicโ€™s bot).

The FineWeb dataset takes the first approach.

Trying to replicate the creation of a high-quality dataset in a public and open manner.

Their paper reveals many of the important details they considered when creating the dataset as well as all of the ablation studies they performed to test it.

One of my favourite pieces of data filtering was using an open-source LLM (Llama-3-70B-Instruct) to annotate 500k samples for quality.

They then used these samples to create a classifier which could classify an article based on its quality level.

It turns out this approach worked quite well.

With the higher quality classified documents (e.g. these documents are educative in nature) producing better end results.

This shows how classifiers trained on LLM-created annotations can be used effectively for data filtering.

fineweb-annotating-data-at-scale

How Hugging Face used an LLM (Llama3-70B) to create 500k annotations on the quality of different articles. 0 being poor and 5 being high quality. They then used these annotations to successfully create a classifier to classify future documents at scale. This shows the effectiveness of using an LLM to create annotations. Source: Hugging Face FineWeb paper.

9. More Papers, Tools and Open Source

  • Depth Anything V2 is a state-of-the-art depth (image-to-depth) model which shows how you can get incredible results using purely synthetic data. Paper, GitHub, Demo.

depth-anything-v2-data-labelling-engine

Depth Anything V2 data engine which starts with purely synthetic data labelled with a high-quality teacher. The teacher model then labels a large real-world image dataset with high-quality pseudo labels. These real-world images and labels are then used to train smaller student models. Source: Depth Anything V2 project page.

10. Books, Talks, Tutorials and Presentations

Bonus

Tiny Corp prepare to ship their first computer! An AI supercomputer in a box! It comes in two versions, red and green (the red one has AMD GPUs and the green one has NVIDIA GPUs). A highly inspiring story from the company looking to commoditize the petaflop.

See you next month!

What a massive month for the ML world in June!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel

www.mrdbourke.com | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

Python Monthly Newsletter ๐Ÿ’ป๐Ÿ preview
Python Monthly Newsletter ๐Ÿ’ป๐Ÿ

55th issue of Andrei Neagoie's must-read monthly Python Newsletter: PyCon US 2024 Recap, NVIDIA Loves Python, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.