21st issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Daniel here, I'm 50% of the instructors behind Zero To Mastery's Machine Learning and Data Science Bootcamp course and our new TensorFlow for Deep Learning course! I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.
Welcome to this edition of Machine Learning Monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
I'm trying a new style this month though. This month's theme is: the value of custom datasets.
Working as a machine learning engineer for the largest tech company in the world but Charlie wants to be a writer. So during the day he writes code and in the evenings he writes words, letters to his nephew Pauly about what he's discovered in his world-generating computer program XK-1.
I've started working on a full-stack machine learning project to take a photo of food and learn about it. This combines my interests of health and nutrition and machine learning. My goal is to build a data flywheel (collect data, model, improve model, collect data, repeat). Stay tuned for video updates on my YouTube.
Studying online can be hard. So I created a video with the help of my brother Sam that teaches six skills you can use every day to make it fun.
A fantastic example of taking something that interests you and combining it with another skill. In Gerasimos's case, he used the NBA data API and his data science skills to pick the best NBA player transfer. Turns out his prediction was right. If Gerasimos wanted a job at the NBA as a data scientist, this would be the way to start.
A group of doctors created a custom dataset to map mitotic (a type of cell division) figures for more reliable cancer prognosis in animals. The most important takeaway from this article for me was the creation of a custom dataset + iterating on that dataset. A trend for many of the resources in this month's issue.
We've all written Jupyter Notebooks before that are all over the place and have had to run every cell to get back to where we were when coming back (I still do this). But Eduardo's article is here to help change that. By making your code more modular and reproducible, your notebooks will be more understandable by others and (more importantly) your future self. My favourite is number 2: package your project, I'd never even known about this.
One of the questions I still get most often is: "How do I learn math for machine learning?" And I usually reply, "replicate the math with Python code and you'll start to understand it better". Well, the math-as-code
GitHub repo does just that (there's a JavaScript version too). Take the example below for sigma.
Example of math as code for sigma notation from the math-as-code Python repo.
The new iPhone launched last week and it's got a new video feature called Cinematic Mode. Cinematic Mode uses machine learning to blur the background of a video whilst keeping a chosen subject in focus. It's an advanced filmmaking technique cinematographers use to tell stories in films and now it's available in your pocket.
Fascinating to read how they created a dataset for this. Using plenty of video with different depth measurements to then train models able to predict at 30 FPS all while generating a live preview! Apple's website also speaks about the data collection process they used, though not too much (don't wanna spill all the secrets to the competition π).
Going with the trend of this month's issue, if you want a custom machine learning model, you'll need a custom dataset. One can only imagine how this was created. I'm also amazed Apple was able to get such a model running on a mobile phone in real-time...
Google AI developed a new anomaly detection (also called outlier detection or out-of-distribution detection) method to answer the question "what should belong in my data and what shouldn't?"
This technique is very common in manufacturing (detecting poorly made samples with defects) and finance (detecting fraud transactions). Google's new self-supervised method achieves a state-of-the-art level on several datasets and some by a long margin. If you need to find strange samples in your datasets, check this out. I plan on using similar methods for this for my Nutrify project to detect different and abnormal kinds of food images.
Given a certain sample, find a sample that differs in the dataset. Such as with a dataset of cat and dog images, if given an image of a dog, what's the anomalous version?
In contrast to the above, what if you wanted to find similar examples in your dataset? TensorFlow's new similarity module allows you to build models to do just that. Using contrastive learning, the (current) models in TensorFlow similarity learn to pull embeddings (representations of data) that are similar closer together whilst pushing embeddings that are different further away from each other. For example, say you had a dataset of images of 100 different dog breeds but didn't know which dog breeds were which, you could use TensorFlow similarity models to group similar images together and then label the similar images. This is another module I'm planning to explore for Nutrify. See the code on GitHub.
We've spoken about creating custom datasets for custom models. But how should one create such datasets? How should you label your data? Eugene Yan's latest post explores and discusses some of the powerful techniques for labelling data from scratch, mid-stage and late-stage machine learning projects, such as semi, active and weakly supervised learning. Eugene's blog posts are some of the highest quality in the data space on the internet.
As I dive deeper into the world of full-stack machine learning, Chip Huyen's writing is like a shining light guiding me where I need to go. In this post she breaks down many of the new directions the field of MLOps is going such as workflow abstraction (tools to orchestrate the steps in your project) and infrastructure abstraction (setting up the hardware where the code in your project runs, e.g. cloud, local, etc). If you feel like the ML tooling space is like the wild wild west right (it is), reading Chip's posts (sometimes more than once) will clear things up.
A modelling workflow as a DAG (directed acyclic graph, a fancy word for flowchart). Drawing these is fun but where does each step run? Chip's post speaks more on this point.
This one's close to home. My Dad has Alzheimer's and I'm sure many people reading this know someone with some form of dementia too. Well, turns out a research team has constructed a dataset of fMRI (functional magnetic resonance imaging) scans (again, custom datasets) and have built an algorithm based on ResNet-18 that's able to classify the presence of several stages of MCI (mild cognitive impairment) vs. AD (Alzheimer's disease) at over 99% accuracy each. The obvious potential here is: earlier diagnosis, earlier potential for treatment.
As the saying goes, "everybody wanna build state of the art computer vision models but nobody wanna build datasets of fake dog sh*t made of Play-Doh". But truly, this is wild. The iRobot team constructed a dataset of "many, many thousands" of samples of pet mess to guarantee their new Roomba vacuums no longer roll over something you'd rather not have on your rug.
World-class dog poo avoidance, now available in your living room.
What a massive month for the ML world in September!
As always, let me know if there's anything you think should be included in a future post.
Liked something here? Tell a friend!
In the meantime, keep learning, keep creating, keep dancing.
See you next month, Daniel
PS. Video update: I've decided to spend time working on different kinds of YouTube videos so the text-based issues of machine learning monthly (what you just read) will continue but the video versions will stop. Plenty more to come.
By the way, I'm a full-time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or see all Zero To Mastery courses by visiting the courses page.