54th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
Iโm an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
My brother Josh and I are working on a startup called Nutrify (an app which uses computer vision to help people learn about foods). My brother does the iOS development and I build the machine learning models.
The other day we decided to film a โday in the lifeโ style video of what a typical day at Nutrify looks like from a machine learning engineerโs point of view.
S&P Global provides market insights on commodities trading markets.
To do this, they must extract information from trading calls such as:
Americas Crude oil: TUPI: May delivery heard offered July ICE Brent +$2.50/b, CIF Qindao
Inside this small piece of text, they use NLP models to extract markets, grades, pricing, timing, location and more.
All of this has to happen in real-time (because trading markets move so fast) meaning that prediction time needs to be under 15ms.
With this kind of latency requirement, it means generative AI models are off the table (too slow).
So what did they do?
They used spaCy (a powerful open-source NLP library) to train supervised models as well as craft rule-based pipelines.
And they used Prodigy (a very easy to use and customize labelling tool) to create very specific supervised data which they used to train intermediate models to help with further labelling.
The result?
Several hyper-specific models capable of real-time inference with a small footprint (~6MB per model). The small footprint means they can also be deployed in a wide range of regions for even faster inference.
How S&P Global created a human-in-the-loop data labelling pipeline to label ~100 examples, train a model, and then use that model to label future examples. Source: Explosion AI blog.
In last monthโs ML Monthly (May 2024), I shared an excellent article with practical tips for building with LLMs.
Well over the past month, the authors of the original article put out two more to complete the series (as well as a single website collecting them all).
One of my favourite quotes from Part 2 (operational) was:
Look at samples of LLM inputs and outputs every day โ โGenchi Genbutsu, real things, real placesโ.
Where โGenchi Genbutsuโ is a Japanese term often translated as โgo and see for yourselfโ.
In other words, if youโre designing a machine learning system with probabilistic outputs, one of the best ways to test it is to look at actual examples of inputs and outputs every day.
This is especially important for LLMs considering how wide the output space can be.
Part 3 answers a great series of questions:
Questions to ask when building/designing with LLMs (and other kinds of ML models). Source: What We Learned from a Year of Building with LLMs (Part 3).
If youโre building with LLMs (or any kind of machine learning), Iโd highly recommend reading the series end-to-end at least twice.
Itโs well worth it.
Resources:
An excellent write-up on how Generative AI tools make it easier to create code but how writing code is often one of the easiest parts of software engineering.
The real value comes in creating the system of software as a whole rather than just adding more lines of code.
I also really align with the quote on software being an apprenticeship industry.
Where the only real way to learn something is to do it and do it wrong, over and over, iteratively improving over time before you start to get an idea of how to do it right.
But then againโฆ
Due to the nature of the field, the right way today may be superseded in the future. And so the learning by doing continues.
Software is an apprenticeship industry, the best way to learn is by doing, doing and more doing. Or โexperiment, experiment, experiment!โ. Source: Generative AI Is Not Going To Build Your Engineering Team For You by Charity Majors.
Googleโs new open-source LLM, Gemma 2 is now available in 4 variants (2 base models & 2 fine-tuned ones).
For now, thereโs a 9B parameter version and a 27B parameter version each with best-in-class performance for their size or comparable to much larger models (e.g. the 27B parameter version is comparable to the 70B Llama 3).
Having parameter counts like these mean the models can be deployed on hardware such as consumer GPUs (NVIDIA RTX 4090 with 24GB can run the 18GB 9B parameter model) as well as single enterprise GPUs (a NVIDIA H100 with 80GB of memory can run the 56GB 27B parameter model).
Thereโs a smaller 2.6B model coming soon too.
Hugging Face and Google both have great write-ups with links to fine-tuning examples for custom use cases.
Resources:
Apple hosted their annual WWDC keynote during June and introduced Apple Intelligence.
The benefit of Apple Intelligence is that models run mostly on-device.
Running on-device means they can incorporate information you choose to let them use whilst remaining private.
Apple reports that Apple Intelligence features are largely powered by an on-device LLM with a size of ~3 billion parameters. This on-device LLM is equipped with a series of adapter modules which are specialized for certain tasks (e.g. summarization, proofreading, mail replies, tone adjustment and more).
And larger server-based models are available running on dedicated Apple Silicon are available when necessary.
Apple is able to achieve incredible results on-device by taking a large foundation model and equipping it with several adaptor modules which are specialized for certain tasks. Source: Apple ML blog.
Thereโs also an incredibly cool feature in the new iPadOS 18 Calculator/Notes App: Math Notes.
You can now write mathematical expressions as you would on pen and paper and have the calculator app solve them inline.
Appleโs new Math Notes feature in iPadOS 18 lets you write and solve mathematical expressions with Apple Pencil as you would with pen and paper. There must be several models working under the hood here including handwriting recognition, handwriting generation (the numbers produced match your handwriting), variable recognition and more.
What a powerful homework helper tool to get the best of both worlds. Writing by hand and calculating by machine.
There are also a series of developer-related videos with regards to machine learning I found interesting. If youโre looking to develop ML-powered products for Apple devices, Iโd highly recommend checking them out too.
Resources:
Microsoft have published a series of open-source computer vision models under the name Florence-2.
These computer vision models are capable of many tasks such as image captioning, object detection, object detection with grounding, region proposal, segmentation and more.
Florence-2 comes in two size variants base (0.2B parameters) and large (0.7B parameters). At these sizes, it means they can be deployed on relatively small devices compared to some other larger foundational models out there.
To build the models, Microsoft researchers combined a DaViT vision encoder as well as a BERT text encoder. Outputs are created in encoder-decoder style with the decoder generating output text and location tokens based on the inputs.
The most impressive thing to me was the data engine used to create the dataset.
A pipeline was constructed to go from a large corpus of images (126M) โ label a small corpus with specialist models โ iterative data filtering and enhancement โ final dataset for large-scale model training.
The final dataset results in 5B annotations (500M text annotations, 1.3B region-text annotations and 3.6B text-phrase-region annotations).
Florence-2 data engine scheme. Starting with a large corpus of images followed by several steps of iterative data labelling and refinement. Source: Florence-2 paper.
There are several resources available online for learning about Florence-2, including a demo as well as a fine-tuning example notebook.
Example of Florence-2 inputs and outputs. Florence-2 was used to caption the image and then the caption was used to draw boxes around the image (caption to phrase grounding). This would be a very helpful workflow for bootstrapping an object detection dataset. Source: Florence-2 Demo on Hugging Face.
Resources:
Anthropicโs flagship model, Claude 3.5 Sonnet is live!
And it surpasses their previous flagship model, Claude 3 Opus whilst being 80% cheaper ($3/million input tokens vs $15/million input tokens and $15/million output tokens vs $75/million output tokens).
It also operates at 2x the speed of Claude 3 Opus.
Finally, it also largely outperforms other similar models on various academic benchmarks such as GPT-4o and Gemini 1.5 Pro.
However itโs always important to test these models on your own benchmarks.
Large scale datasets are paramount when training Large Language Models (LLMs).
But for many LLMs, people often ask, where did the data come from?
The FineWeb paper and datasets seek to answer that.
For LLM datasets, the two main options are:
The FineWeb dataset takes the first approach.
Trying to replicate the creation of a high-quality dataset in a public and open manner.
Their paper reveals many of the important details they considered when creating the dataset as well as all of the ablation studies they performed to test it.
One of my favourite pieces of data filtering was using an open-source LLM (Llama-3-70B-Instruct) to annotate 500k samples for quality.
They then used these samples to create a classifier which could classify an article based on its quality level.
It turns out this approach worked quite well.
With the higher quality classified documents (e.g. these documents are educative in nature) producing better end results.
This shows how classifiers trained on LLM-created annotations can be used effectively for data filtering.
How Hugging Face used an LLM (Llama3-70B) to create 500k annotations on the quality of different articles. 0 being poor and 5 being high quality. They then used these annotations to successfully create a classifier to classify future documents at scale. This shows the effectiveness of using an LLM to create annotations. Source: Hugging Face FineWeb paper.
Depth Anything V2 data engine which starts with purely synthetic data labelled with a high-quality teacher. The teacher model then labels a large real-world image dataset with high-quality pseudo labels. These real-world images and labels are then used to train smaller student models. Source: Depth Anything V2 project page.
Google Releases two papers which highlight the potential for LLM use in personal healthcare. Their Personal Health LLMs (PH-LLMs) score on par or better than experts in sleep and fitness regimes. See the blog post as well as Towards a Personal Health Large Language Model and Transforming Wearable Data Into Health Insights using Large Language Models for more.
Webpage โ Markdown text: Jina AI releases their Reader API which allows you to get an LLM-friendly input from a URL or web search by adding r.jina.ai
to the front of it.
Cohere Toolkit is an open-source collection of pre-built components to build your own custom RAG applications.
Tiny Corp prepare to ship their first computer! An AI supercomputer in a box! It comes in two versions, red and green (the red one has AMD GPUs and the green one has NVIDIA GPUs). A highly inspiring story from the company looking to commoditize the petaflop.
What a massive month for the ML world in June!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.