59th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
Iโm an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
First of all, happy holidays!
Second of all, I recently entered a Kaggle Competition to use Googleโs Gemini API for a โlong contextโ use case.
Long context meaning to harness Geminiโs ability to process a large amount of input tokens (e.g Gemini Flash can handle 1M input tokens and Gemini Pro can handle 2M input tokens).
The more input tokens = the more information the model can use to produce an output.
I had an insurance company call me recently and ask how much my home and contents insurance should cover.
And I said, โgood questionโฆโ
So I recruited Geminiโs help.
I took a video lap (Gemini can handle video inputs) of my house and asked Gemini to output every major item and the timestamp it occurred at as well as an estimated worth in a structured format.
And it turns out it worked!
As a bonus, Gemini can also create bounding boxes around each of the target items.
You can watch the video breakdown of the project on YouTube and get all of the code/prompts to make it happen on Kaggle.
On the whole, Gemini is very cool. In the past, the capability to process videos in such a way wouldโve required several machine learning models (e.g. one for vision, another for transcription and another for extracting item names from text)โฆ
Plus a lot more engineering work!
One of the most important things about any AI project is having a good set of evaluations.
This means whenever a new model comes along, you can evaluate it against your own use case and see if itโs worth adopting or not.
Good evaluations take time to create.
But so they should.
Evaluations are test cases for your AI product.
Models change but evals last forever.
To bootstrap evaluations, you can use an LLM as a judge.
For example, pass a sample through an LLM and get it to rate how high quality that sample was.
The actual workflow will depend on what your ideal data inputs and outputs are.
Once youโve got some samples that have been rated by an LLM as a judge, you can continually refine this procedure to create better and better test cases.
Hamel Husainโs guide on creating an LLM as a judge to evaluate your samples walks through this process.
Bonus: See Hamelโs blog post Your AI Product Needs Evals for more on evaluations.
One of my favourite things is using machine learning to enhance an existing workflow.
Airbnb have some great use cases for that.
To create their โRoom Tourโ feature which enables someone to see different rooms in a stay, they created a computer vision model to classify the rooms into different categories (e.g. โBedroomsโ, โBathroomsโ, โKitchenโ).
To do so, they leveraging their existing database of many different photos of rooms and fine-tuned a Vision Transformer to output their custom classes.
Some cool findings include that error rate reduced ~5% for every double in the data size and ensemble learning (combining models) combined with model distillation (teaching a smaller model to copy a larger model) improved results the best.
Airbnbโs computer vision error rate declining based on the amount of data given to the model. Source: Airbnb tech blog.
Ameru.ai is a company creating smart bins, bins which use computer vision to identify and sort waste into several categories.
The goal?
Make sure the right waste goes to the right place.
To do so the company uses Label Studio to label data, PyTorch to train their computer vision models and a Nvidia Jetson Nano computer with an 8MP camera mounted to the bin to make predictions on the waste.
I absolutely love this use case!
See the demo video for footage of the bins in action or read the blog post for an expanded explanation of the product.
Machine learning in the browser is becoming more and more possible!
With Transformers.js v3 you can now directly install it via NPM:
// Install the library
npm i @huggingface/transformers
// Import the library
import { pipeline } from "@huggingface/transformers";
// Use the library
// Create a feature-extraction pipeline
const extractor = await pipeline(
"feature-extraction",
"mixedbread-ai/mxbai-embed-xsmall-v1",
{ device: "webgpu" },
);
// Compute embeddings (runs on the local GPU straight through the browser!!)
const texts = ["Hello world!", "This is an example sentence."];
const embeddings = await extractor(texts, { pooling: "mean", normalize: true });
console.log(embeddings.tolist());
There are many upgrades in v3 including, WebGPU support, 120 supported model architectures (including Whisper, Phi-3 and Florence-2) and 25 new example projects and templates.
For example, check out the demo of OpenAIโs Whisper (a model for speech-to-text) running directly in the browser (I tried it on a sample of my own audio and it worked incredibly well/fast on my MacBook Pro M1).
Finbarr Timbers writes a short history of Vision-Language Models and how the landscape has converged to a similar structure over the past two years (hint: combine a LLM with a vision encoder).
Steps for creating a modern VLM. Source: Finbarr Timbers AI blog.
A great read for those looking to learn more about VLMs, including recent models such as Mistralโs Pixtral.
Ever wonder why the Transformer architecture requires a positional encoding?
Well, Christopher Fleetwoodโs blog post paints an incredible picture on how it works.
It starts by explaining other forms of positional encoding such as absolute and sinusoidal.
And then talks about where these approaches might fall down.
Before then going to explain where RoPE makes up for the weaknesses of the others and why RoPE is used in many modern Transformer architectures today.
The blog post is full of great visuals and extra resources to learn more about positional embeddings and their history.
Docling takes in PDF, HTML, Word or other similar documents and outputs the contains as JSON or markdown.
How?
Using two open-source document analysis models:
If you need a way to process large amounts of PDF documents into more manageable markdown and JSON files, try docling out.
Docling processing pipeline from the Docling technical report.
Iโve tried it a few times already and it seems to work really well.
Goes through a ~6-step process to generate examples, create a Mermaid (Mermaid is like a combination of a programming and a diagram language) flowchart for steps and then rewrites the overall prompt a few times for instruction and clarity.
Anthropicโs new prompt improver tool. Available in the Anthropic developer console.
Weโre now entering the era where local (local meaning running on your own computer) LLMs and VLMs are going to start being the norm.
These models will have good baseline capabilities, however, they will still have to be guided towards your own specific use case.
In Daniel van Strienโs case he wanted to organize all of screenshots on his desktop.
A simple task if youโve got some spare time but also a task that would be a good challenge for recent VLMs.
He combined LM Studio with the Pixtral VLM from Mistral and gave it instructions to sort screenshots into various categories as well as write down information from them.
Aside: I wonder how far you could take this with SigLIP (a model that has been trained to contrast vision and language)? E.g. use SigLIP to zero-shot categorise the images into [โmemeโ, โdocumentโ, โreceiptโ, โdiagramโ, โunknownโ] where the โunknownโ option could be those below a certain threshold.
Workflow for organizing screenshots with a VLM. First the VLM reads and analyses the screenshot and then it gets moved to the appropriate location based on the outputs of the VLM. Source: Daniel van Strien blog.
RAG = Retrieval Augmented Generation.
In other words, getting related documents to a query to improve LLM outputs.
For example, you could ask โwhat are our policies related to damaged delivery boxes?โ and the retrieval system would look up the company documentation on dealing with damaged boxes and then return the highest scoring related resources.
These resources could then be used as input to an LLM to generate a response based on the actual company documentation.
All of this could be processed with text only.
However, what if there were some images showcases examples of damaged boxes too?
Thatโs where vision-based RAG comes in.
Instead of only looking for text, vision-based RAG looks for both text and images/figures.
It does this by embedding the screenshot/image of the page/resource and then performing similarity search on those embeddings with the query.
The results so far show that vision-based RAG can provide substantial improvements over pure text-based RAG.
But of course, this will all be dependent on your use case, so make sure to always test on your own problems.
Two resources Iโve found helpful if youโd like to learn more about vision-based RAG systems:
VisRAG (visual retrieval augmented generation) overall workflow. Source: VisRAG GitHub.
gemini-1.5-flash-002
and found models trained on these datasets performed better there, and models trained on Claude Sonnet datasets performed better there. Goes to show that not all LLM output data is created equal.I really liked this note in the Vespa visual RAG in practice blog post.
Building an AI system that works is one thing.
Making sure itโs a nice user experience is the next step.
What a massive month for the ML world in November!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
Daniel
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.