AI & Machine Learning Monthly Newsletter 💻🤖

Daniel Bourke
Daniel Bourke
hero image

48th issue! If you missed them, you can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel here.

I’m an A.I. & Machine Learning Engineer who also teaches the following beginner-friendly machine learning courses:

I also write regularly about A.I. and machine learning on my own blog as well as make videos on the topic on YouTube.

Since there's a lot going on, the utmost care has been taken to keep things to the point.

Enough about me! You're here for this month's A.I. & Machine Learning Monthly Newsletter.

Typically a 500ish (+/-1,000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.

What you missed in December 2023 as an A.I. & Machine Learning Engineer…

Firstly, Merry Christmas and Happy New Year!

I hope the holiday period has been a fun and enjoyable time with food and family for you all.

Now, onto the magical world of ML!

My Work 👇

M3 Mac Machine Learning Speed Tests 🏎️

Apple recently released a series of M3 Apple Silicon chips. And naturally, I wondered how they’d perform from a machine learning standpoint.

So I wrote some code and ran some benchmarks with TensorFlow and PyTorch for Mac. I also compared them to my own deep learning PC as well as the free tier of Google Colab.

The outcome?

In short, M3 Macs are sensational for everyday tasks and smaller scale ML workflows.

However, for larger ML tasks, you’ll likely either want a spec’d up laptop or a dedicated NVIDIA GPU.

See the M3 Mac Machine Learning speed test video on YouTube, read the blog post, get the code on GitHub.

Nutrify is live on the App Store! 🍍

My brother and I have been working on a food education app called Nutrify.

And now version 1.0 is live!

Nutrify uses computer vision to allow you to take a photo of food and learn about it.

And now if you have an iPhone with iOS 16+, you can download it from the App Store and start trying it out on foods straight away.

If you’ve done the Food Vision projects in the ZTM TensorFlow or PyTorch course, you can think of Nutrify as a full-fledged production version of those projects.


Nutrify: The Food App live on the App Store — take a photo of food and learn about it! If you’ve got an iPhone with iOS 16.0+, search “nutrify” on the App Store and look for the pineapple.

From the Internet 🥅

Releases, Tools and Products 🔨

1) Google Releases Gemini Nano, Pro and Ultra to rival GPT-4

One of the first models to be able to successfully rival and surpass GPT-4 in several large language model (LLM) benchmarks got released.

Google’s Gemini.

It comes in 3 variants: Nano, Pro and Ultra.

With Nano designed to work on mobile devices, Pro meant to be for everyday usage and Ultra for the hardest tasks.

The release video showed some huge capabilities but I heard rumours circulating that it was staged.

Regardless, I’m sure the model is very capable but as it goes with all tech demos, real-world usage may vary.

Best to experiment, experiment, experiment!

You can find the release details, API links and more such as a new version of Imagen, Imagen V2 (Google’s text-to-image model) on Google’s Blog.

2) AI-assisted coding comes to Google Colab

Google Colab recently added AI-powered code generation but now it’s also adding AI-powered coding assistance.

Now if you make an error in your code, you might see a “Explain error” button below the output.

Clicking this button will try to explain why you get the error you’re receiving and offer ways to fix it.

The cool thing is that Google says this feature should be rolling out across 175 countries and is free to use (for the time being).

However, in my own personal usage, I haven’t found out how to set it up or trigger it.

I also use Google Colab Pro too.

Maybe, it’ll automatically appear soon!

See the blog post from Google about the release.

00-google-colab-ai-pd explainerror

Example of writing a small typo in Google Colab and having the Colab AI Assistant help out with the coding problem. Source: Google blog post.

3) Convert PDF, EPUB and MOBI documents to markdown with Marker

Marker is an open-source repo that contains scripts capable of turning various types of documents into editable markdown files.

For example, say you had a large archive of PDF files but wanted to deploy them as a documentation-like website, you could use Marker to convert the PDF files to markdown and then deploy them to a static website capable of being rendered on many different devices.

Marker performs up to 10x faster than Facebook’s Nougat (another PDF to markdown model, mentioned in AI & Machine Learning Monthly September 2023) with on par or competitive accuracy.


Marker PDF to markdown results against Nougat (another popular PDF to markdown model). Source: Marker GitHub.

4) Run Large Language Models Locally with Ollama

Ollama is one of the simplest tools to run LLMs on your own computer.

You can download the app and with a few lines of code on the command line be running models such as Llama 2 and Mistral 7B.

For example:

ollama run llama2

Running the line of code above will download the Llama 2 7B model locally and enable to you to start running it straight it away for experimentation.


Example of using Ollama to run Llama 2 7B locally in a few lines of code. Source: Myself.

Notable Machine Learning Research Papers 📰

1) StyleDrop: Text-to-Image Generation in Any Style by Google Research allows you give a few sample images and have an image generated for you in a very similar style.

Using Low-Rank Adaption (LoRA) training, authors manage to only train 1 million parameters of a 3B parameter Muse model for 1000 steps to achieve the highly transferable abilities.

Links to StyleDrop: Blog post, paper, create a custom style model in Vertex AI.


Example of how StyleDrop can create similar images from text prompts given the style of an existing image. Source: StyleDrop project page.

2) Learning Vision From Models Rivals Learning Vision from Data by Google Research shows that self-supervised learning models trained purely on synthetic images can create visual representations equivalent to those trained on actual images.

Their method, SynCLR generates both image captions (using Llama 2 7B) and images from those image captions (using Stable Diffusion 1.5) to create an image-text dataset to learn representations on.

Their model outperforms or performs on par with image-specific models such as DINOv2 which are trained on large curated real-image datasets.

This is a very interesting result as it’s the first of it’s kind I’ve seen where a model trained on purely synthetic data from other models outperforms an existing state-of-the-art model trained on real data.

Is 2024 the year of synthetic data?

Read the paper on arXiv, see the code on GitHub.


Figure showing different data setup paradigms for learning representations from real data (top) to purely synthetic data (bottom). Source: SynCLR paper.

3) MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training by Apple Research sets a new standard for image-text models capable of running on mobile devices.

MobileCLIP runs up to 2.3x faster than the equivalent ViT-B/16 CLIP model from OpenAI while achieving better performance.

This is possible thanks to a restructured architecture as well as a better data setup via Dataset Reinforcement (or DR, covered in AI & Machine Learning Monthly August 2023), a method to improve datasets by adding in various representations of images from larger-scale models.

Read the paper on arXiv.

4) LLM in a flash: Efficient Large Language Model Inference with Limited Memory from Apple shows how you can load LLMs onto devices with constrained compute power (e.g. mobile phones) by leveraging the available flash memory (flash memory is often far larger capacity than DRAM (dynamic random access memory) or CPU or GPU but with lower bandwidth).

Their methods enable them to run models up to twice the size of the available DRAM at 4-5x faster inference than CPU and 20-25x faster inference than GPU.

Read the paper on arXiv.

5) SAM (Segment anything model) goes mobile. Three new papers I checked out in the past month all revolved around compressing the original SAM model and making it more efficient/be able to run on mobile devices.

  • MobileSAMv2 introduces a 16x faster decoder of the "segment everything" model, a model which segments everything in an image.
  • EdgeSAM improves the speed of the original SAM by 14x and is able to run at over 30 FPS on an iPhone 14.
  • EfficientSAM by Meta significantly improves throughput of the original SAM (~20x faster) whilst maintaining good performance (44.4 AP vs 46.5 AP).

6) Generative Multimodal Models are In-Context Learners introduces Emu2, generative multimodal model with 37 billion parameters. Emu2 outperforms the closed-sourced Flamingo 80B from DeepMind and its weights are open-sourced!

The model has also been trained to follow instructions so it can be used for both generation and question and answering on wide range of multimodal problems.

Another cool feature is that Emu2 was trained in a relatively simple fashion: predict the next token. Just like GPT predicts the next token in text, it does the same for the multimodal space.

See the project page, code on GitHub, read the paper, try the demo on Hugging Face.

7) Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition provides a unique and large-scale dataset for invasive species.

The Species196-L dataset contains 19K images with expert-level annotations whereas the Species196-U dataset contains 1.2M images sourced from LAION-5B using image similarity matching on random images and classes in the Species196-L dataset.

The researchers found that pretraining on the larger scale domain-specific dataset instead of ImageNet achieved better results (81.6% vs 80.5%).

Read the paper on arXiv, download the dataset from the project website.


Perhaps my favourite feature of the Species196 paper is the creation of the large-scale unsupervised dataset via filtering specific images from LAION-5B and then using those images for pretraining. Source: Species196 paper.

Blog posts 💻

  • Could language modelling be considered a subfield of its own? In A New Alchemy: Language Model Development as a Subfield?, Colin Raffel, an associate professor at the University of Toronto poses some arguments on how we could assume a fixed language model architecture (e.g. Transformer + some form of Adam optimizer, because this combination has proven time and time again to get pretty good results) and then see how far it can go. Much more in the blog post.


Points describing the assumptions of language modelling as a subfield. Source: Colin Raffel blog.

  • Language Disagreement Modelling by introduces a simple concept for improving your own custom models. Have an LLM predict on a sample and predict on the same sample with your own custom model. Then see if the two disagree. If so, send that sample to annotation to focus on. Later on, you can fine-tune your own model with the improved samples and repeat the process.


Find the samples where a custom model and an LLM disagree and prioritize those for further annotations/dataset improvement. Source:

  • recently released their largest public model, Mixtral 7x8B, an LLM that outperforms Llama 2 70B and even ChatGPT 3.5 on several benchmarks. A sensational advancement for open-source AI. It does so by leveraging the “Mixture of Experts” (MoE) paradigm, hence the 7x8B, which means there are 7 different 8B parameter models that can be triggered and used for prediction time at inference. If this sounds confusing, Hugging Face released a blog post explaining what MoE is and different techniques that can be used to construct MoE models.

  • You may have heard of RAG (Retrieval Augmented Generation) for text but what about RAG for text and images or multimodal RAG? LangChain released a blog post discussing two approaches for multimodal RAG: 1. Multimodal embeddings (e.g. turn slides or images into multimodal embeddings with a CLIP-like model). 2. Multivector retriever (e.g. create image captions for the slides/images with a model like GPT-4V and then embed the summary of the image). They found that multimodal RAG performs better than text-only RAG and that image summary generation works better than multimodal embeddings for RAG with presentation slides.


Two approaches to a multimodal RAG pipeline. Source: LangChain blog.

  • LangChain released their round up of the State of AI in 2023 with a particular focus on generative AI. Some notable findings include, top LLM providers being OpenAI, Azure+OpenAI, Anthropic, Hugging Face and Vertex AI. And for most common open-source providers, we see Hugging Face at number 1 then, Ollama, Llama.cpp and then replicate.

Quick Links 🔗

See you next month!

What a massive month for the AI/ML world in December!

As always, let me know if there's anything you think should be included in a future post.

In the meantime, keep learning, keep creating, keep dancing.

See you next month,

Daniel | YouTube

By the way, I'm also an instructor with Zero To Mastery Academy teaching people Machine Learning & AI in the most efficient way possible. You can see a few of our courses below or check out all Zero To Mastery courses.

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

6-Step Framework To Tackle Machine Learning Projects (Full Pipeline) preview
6-Step Framework To Tackle Machine Learning Projects (Full Pipeline)

Want to apply Machine Learning to your business problems but not sure if it will work or where to start? This 6-step guide makes it easy to get started today.

Python Monthly Newsletter 💻🐍 preview
Python Monthly Newsletter 💻🐍

49th issue of Andrei Neagoie's must-read monthly Python Newsletter: Weekend Project Idea, Python + A.I., Why Type_Checking, and much more. Read the full newsletter to get up-to-date with everything you need to know from last month.