9th issue! If you missed them, you can read the previous issues of the Machine Learning Monthly newsletter here.
Hey everyone, Daniel here, I'm 50% of the instructors behind the Complete Machine Learning and Data Science: Zero to Mastery course. I also write regularly about machine learning and on my own blog as well as make videos on the topic on YouTube.
Welcome to the ninth edition of machine learning monthly. A 500ish (+/-1000ish, usually +) word post detailing some of the most interesting things on machine learning I've found in the last month.
Since there's a lot going on, the utmost care has been taken to keep things to the point.
What it is: Ines, one of the core developers of the popular and industrial-grade NLP library spaCy, takes you through a series of tutorials on how to use spaCy for NLP. You'll start with text processing (preparing text so patterns can be found in it) then move into large-scale data analysis, how to create text preprocessing pipelines, a bunch of tips and tricks for your own NLP projects and finally you'll learn how to write a neural network training loop from scratch.
Best of all, it's completely free!
Why it matters: With the rise of models like GPT-3, many other methods of NLP haven't got as much airtime as they deserve. This tutorial series with spaCy gives you a solid foundation in many of the traditional rule-based and machine learning based NLP solutions spaCy provides.
What it is: A phenomenal article and tutorial walkthrough on different text preprocessing methods for text classification.
Mauro Di Pietro walks you through more traditional methods of text preprocessing such as TF-IDF (term frequency–inverse document frequency) to a newer technique known as a language model which powers many of the state of the art NLP architectures (like GPT-3). Not only does Mauro explain what these techniques are, code examples are provided to try them for yourself.
Why it matters: In following up with the last resource, the field of NLP has been flooded by the power of language models. But sometimes these kind of models are too large to be used in production settings. So if you're looking at building production NLP systems, knowing what other options are available and how they perform compared to newer techniques is worth it.
What it is: Move over Transformers, ****Google just released an NLP architecture powered by RNN's (recurrent neural networks) which performs just as well as BERT but with 300x less parameters.
They're calling the new architecture pQRNN because of its combination of PRADO (recently open-sourced), a previous iteration which achieved state of the art results and a quasi-RNN encoder for fast parallel processing.
Why it matters: Despite their incredible performance, many NLP architectures are unfeasible to run without large-scale compute. I remember trying to implement BERT on a previous project and had to abandon it because of how large the model was, our compute restraints just couldn't handle it. pQRNN performing at almost the same level as BERT but with a 300x smaller footprint opens the door for many use-cases where compute power is restricted, such as on mobile devices.
What it is: From ordinary linear regression to logistic regression to classifiers to decision trees to neural networks, this phenomenal work from Danny Friedman goes through the concepts (theory), construction (building blocks) and implementation (using the building blocks) of many of the most important concepts in machine learning.
Why it matters: I'm a big fan of resources which teach the whole picture, ground up and top down at the same time. So if you're after a theory, math and code explanation one stop shop of different machine learning techniques, the Machine Learning from Scratch: Derivations in Concept and Code book is it.
What it is: There are plenty of resources out there for what to learn but not so many out there for how to learn. This piece fills the gap of the latter for learning the math behind machine learning.
If you've decided to dip your toe into the world of machine learning but are wondering whether or not your math skills are up to scratch, Learning Math for Machine Learning is worth reading.
Why it matters: One of my favorite quotes from the article talked about the simple reason why people are good at math: they've got plenty of practice doing math. And so when they get stuck, much like you get stuck on a coding problem, they're comfortable trying to figure things out. In saying that, it's often not that you can't learn something, it's just you haven't practiced it enough.
I related closely to another quote on "playful exploration" when learning. It talked to the point of solving problems by dancing from one perspective to another and math being one of the many perspectives you can view a problem with.
What it is: Oh, I loved this one! An incredible piece by Eugene Yan about how both individual data scientists and data science teams can be benefit by having end-to-end skillsets. By end-to-end, Eugene means being able to take a raw problem, figure out a solution and build it yourself.
Why it matters: How often do projects get stopped because of a roadblock down the line? As in, you've done your part and you're waiting on someone else to do theirs? Or God forbid, the reverse, someone else waiting on you to do your part.
In Data Scientists Should Be More End-to-End, Eugene Yan argues why building machine learning solutions doesn't necessary require an overspecialization in one thing but more a generalization across a wide range of things (e.g. data processing, modeling, data engineering, deployment, monitoring).
Be sure to check out the great set of practical steps in the "Do your own projects end-to-end" section.
What it is: Speaking of going more end-to-end, GitHub Actions (automatic workflows based on actions you perform on GitHub) might be a tool you can use for your experiment running and tracking pipelines.
In Using GitHub Actions for MLOps & Data Science, Hamel Husain, a machine learning engineer at GitHub walks you through a series of example use cases for machine learning and data science focused GitHub Actions.
Updated your model's hyperparameters and saved the file to GitHub? Why not kick off a modeling experiment with Weights & Biases?
Saved an annotated Jupyter Notebook to your GitHub repo? Why not create a blog post out of it with fastpages?
Why it matters: If you find yourself repeating the same processes over and over again (a common practice in machine learning) you'll probably want to automate them. GitHub Actions gives you flexibility of doing so all in the same location as the rest of your code.
What it is: Ever wanted to train AI models with the efficiency of a startup but with the power of a technology giant? Well, you're in luck. That's where Microsoft's DeepSpeed comes in.
DeepSpeed is an open-source PyTorch-compatible deep learning optimization library to help you train larger models faster.
More specifically, it offers four major features:
Why it matters: Regardless of whether you're in a large company or at a small startup, if you were offered the ability to train your machine learning models more efficiently, you probably wouldn't say no. So it's worth looking into whether DeepSpeed can step your model training up a gear.
What it is: You might've built an image classification model before and thought, hey that's pretty cool.
But what if you wanted to get a little more fine-grained and instead of classifying a whole image, find individual objects within an image?
And more importantly, where those objects appear?
That's where object detection comes in. And the team from Roboflow (an amazing computer vision startup I've become a big fan of) have created an end-to-end guide for your object detection needs. From use cases to data labelling to data augmentation to model building for object detection, the guide has you covered.
Why it matters: It can be tricky to navigate a problem without the right tool set. Roboflow not only creates amazing computer vision tools (they can host, preprocess and model your image data), they also practice what they preach by creating world-class guides on how you can use computer vision for your own problems.
What it is: Have you been to the supermarket lately? Chances are, you bought something which was grown on a large farm. You might've already known this but did you know that farm might've been using computer vision models powered by PyTorch?
In AI for AG: Production machine learning for agriculture, Chris Padwick, Director of Computer Vision and Machine Learning at Blue River Technology (a company which builds smart farming technology) goes through how they built their See & Spray device.
See & Spray uses computer vision models powered by PyTorch to identify weeds from plants and then spraying them if necessary.
Why it matters: It can be hard to see how the deep learning models you build can relate to the real-world but the See & Spray use case is a prime example. Not only does it solve the problem of weeds growing where they shouldn't but due to being able to identify weeds from plants it also uses far less chemicals than spraying the whole field, in turn, saving costs and avoiding unnecessary contamination.
Phew!
Looks like September was another massive month on tour for the ML world.
As always, let me know if there's anything you think should be included in a future post. Liked something here? Send us a tweet.
In the meantime, keep learning, keep creating.
See you next month,
Daniel www.mrdbourke.com | YouTube
By the way, I'm a full time instructor with Zero To Mastery Academy teaching people Machine Learning in the most efficient way possible. You can see a couple of our courses below or see all Zero To Mastery courses by visiting the courses page.