48th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.
Hey there, Daniel here.
I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:
I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.
Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Firstly, Merry Christmas and Happy New Year!
I hope the holiday period has been a fun and enjoyable time with food and family for you all.
Now, onto the magical world of ML!
Apple recently released a series of M3 Apple Silicon chips. And naturally, I wondered how they’d perform from a machine learning standpoint.
So I wrote some code and ran some benchmarks with TensorFlow and PyTorch for Mac. I also compared them to my own deep learning PC as well as the free tier of Google Colab.
In short, M3 Macs are sensational for everyday tasks and smaller scale ML workflows.
However, for larger ML tasks, you’ll likely either want a spec’d up laptop or a dedicated NVIDIA GPU.
See the M3 Mac Machine Learning speed test video on YouTube, read the blog post, get the code on GitHub.
My brother and I have been working on a food education app called Nutrify.
And now version 1.0 is live!
Nutrify uses computer vision to allow you to take a photo of food and learn about it.
And now if you have an iPhone with iOS 16+, you can download it from the App Store and start trying it out on foods straight away.
If you’ve done the Food Vision projects in the ZTM TensorFlow or PyTorch course, you can think of Nutrify as a full-fledged production version of those projects.
Nutrify: The Food App live on the App Store — take a photo of food and learn about it! If you’ve got an iPhone with iOS 16.0+, search “nutrify” on the App Store and look for the pineapple.
One of the first models to be able to successfully rival and surpass GPT-4 in several large language model (LLM) benchmarks got released.
It comes in 3 variants: Nano, Pro and Ultra.
With Nano designed to work on mobile devices, Pro meant to be for everyday usage and Ultra for the hardest tasks.
The release video showed some huge capabilities but I heard rumours circulating that it was staged.
Regardless, I’m sure the model is very capable but as it goes with all tech demos, real-world usage may vary.
Best to experiment, experiment, experiment!
Google Colab recently added AI-powered code generation but now it’s also adding AI-powered coding assistance.
Now if you make an error in your code, you might see a “Explain error” button below the output.
Clicking this button will try to explain why you get the error you’re receiving and offer ways to fix it.
The cool thing is that Google says this feature should be rolling out across 175 countries and is free to use (for the time being).
However, in my own personal usage, I haven’t found out how to set it up or trigger it.
I also use Google Colab Pro too.
Maybe, it’ll automatically appear soon!
See the blog post from Google about the release.
Example of writing a small typo in Google Colab and having the Colab AI Assistant help out with the coding problem. Source: Google blog post.
Marker is an open-source repo that contains scripts capable of turning various types of documents into editable markdown files.
For example, say you had a large archive of PDF files but wanted to deploy them as a documentation-like website, you could use Marker to convert the PDF files to markdown and then deploy them to a static website capable of being rendered on many different devices.
Marker PDF to markdown results against Nougat (another popular PDF to markdown model). Source: Marker GitHub.
Ollama is one of the simplest tools to run LLMs on your own computer.
ollama run llama2
Running the line of code above will download the Llama 2 7B model locally and enable to you to start running it straight it away for experimentation.
Example of using Ollama to run Llama 2 7B locally in a few lines of code. Source: Myself.
1) StyleDrop: Text-to-Image Generation in Any Style by Google Research allows you give a few sample images and have an image generated for you in a very similar style.
Example of how StyleDrop can create similar images from text prompts given the style of an existing image. Source: StyleDrop project page.
2) Learning Vision From Models Rivals Learning Vision from Data by Google Research shows that self-supervised learning models trained purely on synthetic images can create visual representations equivalent to those trained on actual images.
Their method, SynCLR generates both image captions (using Llama 2 7B) and images from those image captions (using Stable Diffusion 1.5) to create an image-text dataset to learn representations on.
Their model outperforms or performs on par with image-specific models such as DINOv2 which are trained on large curated real-image datasets.
This is a very interesting result as it’s the first of it’s kind I’ve seen where a model trained on purely synthetic data from other models outperforms an existing state-of-the-art model trained on real data.
Is 2024 the year of synthetic data?
Figure showing different data setup paradigms for learning representations from real data (top) to purely synthetic data (bottom). Source: SynCLR paper.
3) MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training by Apple Research sets a new standard for image-text models capable of running on mobile devices.
MobileCLIP runs up to 2.3x faster than the equivalent ViT-B/16 CLIP model from OpenAI while achieving better performance.
This is possible thanks to a restructured architecture as well as a better data setup via Dataset Reinforcement (or DR, covered in AI & Machine Learning Monthly August 2023), a method to improve datasets by adding in various representations of images from larger-scale models.
Read the paper on arXiv.
4) LLM in a flash: Efficient Large Language Model Inference with Limited Memory from Apple shows how you can load LLMs onto devices with constrained compute power (e.g. mobile phones) by leveraging the available flash memory (flash memory is often far larger capacity than DRAM (dynamic random access memory) or CPU or GPU but with lower bandwidth).
Their methods enable them to run models up to twice the size of the available DRAM at 4-5x faster inference than CPU and 20-25x faster inference than GPU.
Read the paper on arXiv.
5) SAM (Segment anything model) goes mobile. Three new papers I checked out in the past month all revolved around compressing the original SAM model and making it more efficient/be able to run on mobile devices.
6) Generative Multimodal Models are In-Context Learners introduces Emu2, generative multimodal model with 37 billion parameters. Emu2 outperforms the closed-sourced Flamingo 80B from DeepMind and its weights are open-sourced!
The model has also been trained to follow instructions so it can be used for both generation and question and answering on wide range of multimodal problems.
Another cool feature is that Emu2 was trained in a relatively simple fashion: predict the next token. Just like GPT predicts the next token in text, it does the same for the multimodal space.
7) Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition provides a unique and large-scale dataset for invasive species.
The Species196-L dataset contains 19K images with expert-level annotations whereas the Species196-U dataset contains 1.2M images sourced from LAION-5B using image similarity matching on random images and classes in the Species196-L dataset.
The researchers found that pretraining on the larger scale domain-specific dataset instead of ImageNet achieved better results (81.6% vs 80.5%).
Perhaps my favourite feature of the Species196 paper is the creation of the large-scale unsupervised dataset via filtering specific images from LAION-5B and then using those images for pretraining. Source: Species196 paper.
Points describing the assumptions of language modelling as a subfield. Source: Colin Raffel blog.
Find the samples where a custom model and an LLM disagree and prioritize those for further annotations/dataset improvement. Source: koaning.io.
Mistral.ai recently released their largest public model, Mixtral 7x8B, an LLM that outperforms Llama 2 70B and even ChatGPT 3.5 on several benchmarks. A sensational advancement for open-source AI. It does so by leveraging the “Mixture of Experts” (MoE) paradigm, hence the 7x8B, which means there are 7 different 8B parameter models that can be triggered and used for prediction time at inference. If this sounds confusing, Hugging Face released a blog post explaining what MoE is and different techniques that can be used to construct MoE models.
You may have heard of RAG (Retrieval Augmented Generation) for text but what about RAG for text and images or multimodal RAG? LangChain released a blog post discussing two approaches for multimodal RAG: 1. Multimodal embeddings (e.g. turn slides or images into multimodal embeddings with a CLIP-like model). 2. Multivector retriever (e.g. create image captions for the slides/images with a model like GPT-4V and then embed the summary of the image). They found that multimodal RAG performs better than text-only RAG and that image summary generation works better than multimodal embeddings for RAG with presentation slides.
Two approaches to a multimodal RAG pipeline. Source: LangChain blog.
What a massive month for the AI/ML world in December!
As always, let me know if there's anything you think should be included in a future post.
In the meantime, keep learning, keep creating, keep dancing.
See you next month,
By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.