Use Code: ZTMMAX to get 21% OFF any membership. Expires soon 👇

Deep Learning With TensorFlow: From Setup To Data Preparation

Daniel Bourke
Daniel Bourke
hero image

Welcome to Part 1 in my brand new 3-Part series on Tensorflow and Deep Learning.

Sidenote: Technically this 'mini series' is part of my larger 'Introduction to Machine Learning' series, but I went so deep on this particular section, I needed to make it into 3 parts!

Be sure to check out the other parts in the series, as they all lead into each other:

  • Part 1: Data Acquisition, Exploration, Preparation, and turning data into a Tensorflow dataset (This is the guide you’re reading right now)
  • Part 2: Building and training Tensorflow models, and creating a neural network (Coming soon)
  • Part 3: How to evaluate a model - create, compile, and make predictions (Coming soon)

Are you looking to learn the basics of Tensorflow for Deep Learning?

Well, you’re in the right place!

The goal of this series is to give you an overview of deep learning (and more specifically, transfer learning) when using Tensorflow and Keras.

Even better still?

Rather than just tell you what this all means, I’m also going to walk you through a project that you can follow along with, so you can learn as you go.

The project we’re going to build is called ‘Dog Vision’. It’s a neural network capable of identifying different dog breeds via images.

dog project outline

How this will work:

In this first part of the series (this guide), we’ll set up the project, get our data, explore it, create a training set, and then turn our data into a Tensorflow dataset.

This is an essential skill for any machine learning project, and its why we're spending an entire post dedicated to it.

I'll show you how to do this from start to finish in just a second.

In the second part of the series, we’ll take the dataset that we'll create in this guide, and use it to build a neural network, train the model, and then fit the model on the data.

While the third part of this series deals with the evaluation, prediction, and deployment phases of that model, which are crucial for understanding how to assess and utilize trained models effectively.

As you can see, we’re going to work all the way from dataset preparation to model building, training, and evaluation, so this is a complete project walkthrough from start to finish.

Not only will it help you to understand Tensorflow better, but you’ll have hands-on experience and a project for your portfolio by the end of it!

Why listen to me?

My name is Daniel Bourke, and I'm the resident Machine Learning instructor here at Zero To Mastery.

Originally self-taught, I worked for one of Australia's fastest-growing artificial intelligence agencies, Max Kelsen, and have worked on Machine Learning and data problems across a wide range of industries including healthcare, eCommerce, finance, retail, and more.

I'm also the author of Machine Learning Monthly, write my own blog on my experiments in ML, and run my own YouTube channel - which has hit over 8 Million views.

Sidenote: If you want to deep dive into Machine Learning and learn how to use these tools even further, then check out my complete Machine Learning and Data Science course or watch the first few videos for free.

learn machine learning ai and data science

It’s one of the most popular, highly rated Machine Learning and Data Science bootcamps online, as well as the most modern and up-to-date. Guaranteed.

You'll go from a complete beginner with no prior experience to getting hired as a Machine Learning Engineer this year, so it’s helpful for ML Engineers of all experience levels.

Want a sample of the course? Well, check out the video below:


If you already have a good grasp of Machine Learning, and just want to focus on Tensorflow for Deep Learning, I have a course on that also that you can check out here.

learn tensorflow

With that out of the way, let’s get into this guide, and cover some fundamentals right up top.

What is TensorFlow?

TensorFlow is an open-source machine learning and deep learning framework originally developed by Google.

tensorflow

It allows you to manipulate data and write deep learning algorithms using Python code.

As you can imagine, this is incredibly handy, but it gets better because TensorFlow also has several built-in capabilities to leverage accelerated computing hardware (e.g. GPUs, Graphics Processing Units and TPUs, Tensor Processing Units).

It’s so popular that many of the world's largest companies power their machine learning workloads with TensorFlow.

companies that use tensorflow

What is Deep Learning?

Deep Learning is a form of machine learning where data passes through a series of progressive layers which all contribute to learning an overall representation of that data.

The "deep" in deep learning comes from the number of layers used in the neural network.

Each layer performs a pre-defined operation, and the series of progressive layers combine to form what's referred to as a neural network.

For example

A photo may be turned into numbers (e.g. red, green, and blue pixel values) and those numbers are then manipulated mathematically through each progressive layer to learn patterns in the photo.

So when someone says deep learning or (artificial neural networks), they're typically referring to the same thing.

Note: Artificial intelligence (AI), machine learning (ML) and deep learning are all broad terms. You can think of AI as the overall technology, machine learning as a type of AI, and deep learning as a type of machine learning.

So if someone refers to AI, you can often assume they are often talking about machine learning or deep learning, although not always.

What can Deep Learning be used for?

Deep learning is such a powerful technique that new use cases are being discovered every day. In fact, most of the modern forms of artificial intelligence (AI) applications you see, are powered by deep learning.

Two of the most useful types of this are predictive and generative AI.

PREDICTIVE VS GENERATIVE AI and ML

Predictive AI learns the relationship between data and labels such as photos of dogs and their breeds (supervised learning). So that when it sees a new photo of a dog, it can predict its breed based on what it has learned so far. This is what we’re going to build.

Generative AI generates something new given an input such as creating new text given input text. The most common example of this is ChatGPT.

Some examples of Predictive AI problems include:

predictive ai problems
  • Tesla's self-driving cars using deep learning and object detection models to power their computer vision systems
  • Apple's Photos app uses deep learning to recognize faces in images and create Photo Memories
  • Siri and Google Assistant use deep learning to transcribe speech and understand voice commands
  • Nutrify (an app my brother and I built) uses predictive AI to recognize food in images
  • Magika uses deep learning to classify a file into what type it is (e.g. .jpeg, .py, .txt)
  • Text classification models such as DeBERTa use deep learning to classify text into different categories such as "positive" and "negative" or "spam" or "not spam"

Some examples of Generative AI problems include:

generative ai problems
  • Stable Diffusion uses generative AI to generate images given a text prompt
  • ChatGPT and other large language models (LLMs) such as Llama, Claude, Gemini, and Mistral use deep learning to process text and return a response
  • GitHub Copilot uses generative AI to generate code snippets given the surrounding context

All of these AI use cases are powered by deep learning.

However, more often than not, whenever you get started on a deep learning problem, you'll start with transfer learning.

deep learning examples

What is Transfer Learning?

Transfer learning is one of the most powerful and useful techniques in modern AI and machine learning.

It involves taking what one model (or neural network) has learned in a similar domain and then applying it to your own. I.e what knowledge or experience has it found out already, that you can use and apply to your problem.

For example

In our case, we're going to use transfer learning to take the patterns a neural network has learned from the 1 million+ images and over 1000 classes in ImageNet (a gold standard computer vision benchmark) and apply them to our own problem of recognizing dog breeds.

transfer learning

However, this same concept can be applied to many different domains.

For example

You could take a large language model (LLM) that has been pre-trained on most of the text on the internet and learned very well the patterns in natural language and customize it for your own specific chat use case.

The biggest benefit of transfer learning is that it often allows you to get outstanding results with less data and time.

Let's get started!

So now you understand all this, let’s get started on our deep learning project.

We’ll set up the project, get our data, explore it, create a training set, and then turn our data into a Tensorflow dataset, ready for part 2 of this series.

Grab a coffee and lets dive in...

Step #1. Getting setup

This project is designed to run in Google Colab, an online Jupyter Notebook that provides free access to GPUs (Graphics Processing Units).

google colab

Why use a GPU?

Since neural networks perform a large number of calculations behind the scenes (the main one being matrix multiplication), you need a computer chip that can perform these calculations quickly, otherwise, you'll be waiting all day just for a model to train.

Without getting into the complex reasons why, generally GPUs are much faster at performing matrix multiplications than CPUs.

Getting set up with Google Colab

For a quick rundown on how to use Google Colab, see their introductory guide (it's quite similar to a Jupyter Notebook with a few different options).

Google Colab also comes with many data science and machine learning libraries pre-installed, including TensorFlow/Keras, so that works perfect for our needs.

Getting a GPU on Google Colab

Before running any code, we'll need to make sure that our Google Colab instance is connected to a GPU.

You can do this by going to Runtime -> Change runtime type -> and then select a GPU (this may restart your existing runtime).

colab gpus

Importing TensorFlow

We'll do so using the common abbreviation tf.

import tensorflow as tf
tf.__version__

You’ll know it’s worked when it replies with the version that has been imported.

import tensorflow

Check for a GPU on your colab runtime environment

Now let's check to see if TensorFlow has access to a GPU (this isn't 100% required to complete this project but will speed things up dramatically).

We can do so with the method tf.config.list_physical_devices().

# Do we have access to a GPU?
device_list = tf.config.list_physical_devices()
if "GPU" in [device.device_type for device in device_list]:
  print(f"[INFO] TensorFlow has GPU available to use. Woohoo!! Computing will be sped up!")
  print(f"[INFO] Accessible devices:\n{device_list}")
else:
  print(f"[INFO] TensorFlow does not have GPU available to use. Models may take a while to train.")
  print(f"[INFO] Accessible devices:\n{device_list}")

The response in colab should look like this:

[INFO] TensorFlow has GPU available to use. Woohoo!! Computing will be sped up!
[INFO] Accessible devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')

Now that we have this set up, it’s time to add some data to use.

Step #2. Getting our data

All machine learning (and deep learning) projects start with data.

The good news is that there are several options and locations to get data for a deep learning project, such as Kaggle, Hugging Face, or even direct from TensorFlow.

In our case, the dataset we're going to use is called the Stanford Dogs dataset (sometimes referred to as the ‘ImageNet dogs’, as the images are dogs separated from ImageNet).

Our overall project goal is to build a computer vision model that performs better than the original Stanford Dogs paper, which had an average of 22% accuracy per class across 120 classes of dogs.

However, because the Stanford Dogs dataset has been around for a while (since 2011, which as of writing this in 2024 is like a lifetime in deep learning), it's available from several resources:

Why is this important?

When you're starting out with practicing deep learning projects, there's usually no shortage of datasets available.

However, when you start work on your own projects or within a company environment, you'll likely start to work on custom datasets i.e. datasets you build yourself or aren't available publicly online.

This means that the main difference between existing datasets and custom datasets is that existing datasets often come preformatted and ready to use. Whereas custom datasets often require some preprocessing before they're ready to use within a machine learning project.

So let’s start with best practices and practice formatting a dataset.

First of all though, we need the data so go ahead and get that dataset downloaded. You can use any option but for this example, I’ve grabbed mine from TensorFlow Datasets.

# Download the dataset into train and test split using TensorFlow Datasets
# import tensorflow_datasets as tfds
# ds_train, ds_test = tfds.load('stanford_dogs', split=['train', 'test'])

Once it’s downloaded, you’ll notice that the data comes in three main files, like so:

  1. Images (757MB) - images.tar
  2. Annotations (21MB) - annotation.tar
  3. Lists with train/test splits (0.5MB) - lists.tar

Our goal is to get a file structure like this:

goal data structure

Important: Set up a Google Colab fail safe!

If you're using Google Colab for this project, remember that any data uploaded to the Google Colab session gets deleted if the session disconnects!

To make sure that you don't have to keep redownloading the data every time you leave and come back to Google Colab, you're going to:

  1. Download the data if it doesn't already exist on Google Drive
  2. Copy it to Google Drive (because Google Colab connects nicely with Google Drive) if it isn't already there
  3. If the data already exists on Google Drive (we've been through steps 1 & 2), we'll import it instead

So here’s how to do this:

  1. Mount Google Drive
  2. Setup constants such as our base directory to save files to, the target files we'd like to download and the target URL we'd like to download from
  3. Set up our target local path to save to
  4. Check if the target files all exist in Google Drive and if they do, copy them locally.
  5. If the target files don't exist in Google Drive, download them from the target URL with the !wget command
  6. Create a file on Google Drive to store the download files
  7. Copy the downloaded files to Google Drive for use later if needed

A fair few steps, but nothing we can't handle. Plus, this is all good practice for dealing with and manipulating data, which is a very important skill in the machine learning engineers' toolbox.

Note: The following data download section is designed to run in Google Colab. If you are running locally, feel free to modify the code to save to a local directory instead of Google Drive.

Input:

from pathlib import Path
from google.colab import drive

# 1. Mount Google Drive (this will bring up a pop-up to sign-in/authenticate)
# Note: This step is specifically for Google Colab, if you're working locally, you may need a different setup
drive.mount("/content/drive")

# 2. Setup constants
# Note: For constants like this, you'll often see them created as variables with all capitals
TARGET_DRIVE_PATH = Path("drive/MyDrive/tensorflow/dog_vision_data")
TARGET_FILES = ["images.tar", "annotation.tar", "lists.tar"]
TARGET_URL = "http://vision.stanford.edu/aditya86/ImageNetDogs"

# 3. Setup local path
local_dir = Path("dog_vision_data")

# 4. Check if the target files exist in Google Drive, if so, copy them to Google Colab
if all((TARGET_DRIVE_PATH / file).is_file() for file in TARGET_FILES):
  print(f"[INFO] Copying Dog Vision files from Google Drive to local directory...")
  print(f"[INFO] Source dir: {TARGET_DRIVE_PATH} -> Target dir: {local_dir}")
  !cp -r {TARGET_DRIVE_PATH} .
  print("[INFO] Good to go!")

else:
  # 5. If the files don't exist in Google Drive, download them
  print(f"[INFO] Target files not found in Google Drive.")
  print(f"[INFO] Downloading the target files... this shouldn't take too long...")
  for file in TARGET_FILES:
    # wget is short for "world wide web get", as in "get a file from the web"
    # -nc or --no-clobber = don't download files that already exist locally
    # -P = save the target file to a specified prefix, in our case, local_dir
    !wget -nc {TARGET_URL}/{file} -P {local_dir} # the "!" means to execute the command on the command line rather than in Python

  print(f"[INFO] Saving the target files to Google Drive, so they can be loaded later...")

  # 6. Ensure target directory in Google Drive exists
  TARGET_DRIVE_PATH.mkdir(parents=True, exist_ok=True)

  # 7. Copy downloaded files to Google Drive (so we can use them later and not have to re-download them)
  !cp -r {local_dir}/* {TARGET_DRIVE_PATH}/

As you’re doing this, you should get a popup window asking if you want to connect your Google drive.

connect top google drive

Click yes and then let it finish running.

The final output in colab should look like this:

Output:

Mounted at /content/drive
[INFO] Copying Dog Vision files from Google Drive to local directory...
[INFO] Source dir: drive/MyDrive/tensorflow/dog_vision_data -> Target dir: dog_vision_data
[INFO] Good to go!

And just like that, we now have our data downloaded and a failsafe in place!

This may seem like a bit of work but it's an important step with any deep learning project, and in fairness, all you really need to do is copy the code 😜.

Initial checks of our data

OK so let’s check this data has downloaded properly to local_dir (dog_vision_data).

We can first make sure it exists with Path.exists() and then we can iterate through its contents with Path.iterdir() and print out the .name attribute of each file, like so:

Input:

if local_dir.exists():
  print(str(local_dir) + "/")
  for item in local_dir.iterdir():
    print("  ", item.name)

Output:

dog_vision_data/
   lists.tar
   images.tar
   annotation.tar

Excellent! That's exactly the format we wanted.

What is a .tar file?

Now you might've noticed that each file ends in .tar but what is this?

I’ll be honest, I wasn’t sure off the top of my head, so I did what every great engineer and coder does, and I googled some solutions!

Searching for "what is .tar?", I found:

In computing, [tar](https://en.wikipedia.org/wiki/Tar_(computing) is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes.

Exploring a bit more, I found that the .tar format is similar to a .zip file, however, .zip offers compression, whereas .tar mostly combines many files into one.

How to ‘untar’ the .tar files

So how do we "untar" the files in images.tar, annotation.tar and lists.tar?

Well, we can use the !tar command (or just tar from outside of a Jupyter Cell)!

As an added bonus, by doing this, we will also expand all of the files within each of the .tar archives.

However, rather than just use that command, we'll also use a couple of flags to help us out:

  • The -x flag tells tar to extract files from an archive
  • The -f flag specifies that the following argument is the name of the archive file
  • You can combine flags by putting them together -xf

So let's try it out!

# Untar images, notes/tags:
# -x = extract files from the zipped file
# -v = verbose
# -z = decompress files
# -f = tell tar which file to deal with
!tar -xf dog_vision_data/images.tar
!tar -xf dog_vision_data/annotation.tar
!tar -xf dog_vision_data/lists.tar

And boom, it’s as easy as that.

The question now of course, is what new files did we get after we separated the .tar files?

Well, we can check in Google Colab by inspecting the "Files" tab on the left, or with Python by using os.listdir(".") where "." means "the current directory".

Input:

import os

os.listdir(".") # "." stands for "here" or "current directory"```

**Output:**

```plaintext
['.config',
 'dog_vision_data',
 'file_list.mat',
 'drive',
 'train_list.mat',
 'Images',
 'Annotation',
 'test_list.mat',
 'sample_data']

Ooooh!

Looks like we've got some new files:

  • train_list.mat - a list of all the training set images
  • test_list.mat - a list of all the testing set images
  • Images/ - a folder containing all of the images of dogs
  • Annotation/ - a folder containing all of the annotations for each image
  • file_list.mat - a list of all the files (training and test list combined)

Our next step is to go through them and see what we've got.

Step #3. Exploring the data

Before building a model, it's always a good idea to explore your dataset for a bit to see what kind of data you're working with.

Exploring a dataset can mean different things to whichever engineer you ask, but for me personally I like to do the following checks:

  • View at least 100+ random samples for a "vibe check". For example, if you have a large dataset of images, randomly sample 10 images at a time and view them. Or if you have a large dataset of texts, what do some of them say? The same with audio. It will often be impossible to view all samples in your dataset, but you can start to get a good idea of what's inside by randomly inspecting samples
  • Visualize, visualize, visualize! This is the data explorer's motto. Use it often. As in, it's good to get statistics about your dataset but it's often even better to view 100s of samples with your own eyes (see the point above)
  • Check the distributions and other various statistics. How many samples are there? If you're dealing with classification, how many classes and labels per class are there? Which classes don't you understand? If you don't have labels, investigate clustering methods to put similar samples close together

Before we get into these checks though, here’s a quick overview of what our data should be like by the end of this guide.

A quick reminder on our target data format

Since our goal is to build a computer vision model to classify dog breeds, we need a way to tell our model what breed of dog is in what image.

A common data format for a classification problem like this, is to have samples stored in folders named after their class name.

For example:

Ideal data structure

In the case of dog images, we'd put all of the images labeled "chihuahua" in a folder called chihuahua/ (and so on for all the other classes and images).

Sidenote: This structure of folder format doesn't just work for images, it can also work for text, audio, and other kinds of classification data too.

We could even split these folders so that training images go in train/chihuahua/ and testing images go in test/chihuahua/, and that’s exactly what we'll be working towards creating.

With that recap out of the way, let’s look at the data we have so far.

Exploring the file list types

As I went through the downloaded dog image files, I noticed some called train_list.mat, test_list.mat and full_list.mat.

But what are these types of files?

Well, before Python became the default language for machine learning and deep learning, many models and datasets were built in MATLAB.

matlab

This means that we’re going to need to be able to open these also.

How to open Matlab files in colab

The good news is that we can use the scipy library (a scientific computing library) to open these.

But even better news, is that Google Colab comes with scipy preinstalled, and we can use the scipy.io.loadmat() methodto open a .mat file, like so:

Input

import scipy

# Open lists of train and test .mat
train_list = scipy.io.loadmat("train_list.mat")
test_list = scipy.io.loadmat("test_list.mat")
file_list = scipy.io.loadmat("file_list.mat")

# Let's inspect the output and type of the train_list
train_list, type(train_list)

This should then create the following output.

Output

({'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct  9 08:36:13 2011',
  '__version__': '1.0',
  '__globals__': [],
  'file_list': array([[array(['n02085620-Chihuahua/n02085620_5927.jpg'], dtype='<U38')],
         [array(['n02085620-Chihuahua/n02085620_4441.jpg'], dtype='<U38')],
         [array(['n02085620-Chihuahua/n02085620_1502.jpg'], dtype='<U38')],
         ...,
         [array(['n02116738-African_hunting_dog/n02116738_6754.jpg'], dtype='<U48')],
         [array(['n02116738-African_hunting_dog/n02116738_9333.jpg'], dtype='<U48')],
         [array(['n02116738-African_hunting_dog/n02116738_2503.jpg'], dtype='<U48')]],
        dtype=object),
  'annotation_list': array([[array(['n02085620-Chihuahua/n02085620_5927'], dtype='<U34')],
         [array(['n02085620-Chihuahua/n02085620_4441'], dtype='<U34')],
         [array(['n02085620-Chihuahua/n02085620_1502'], dtype='<U34')],
         ...,
         [array(['n02116738-African_hunting_dog/n02116738_6754'], dtype='<U44')],
         [array(['n02116738-African_hunting_dog/n02116738_9333'], dtype='<U44')],
         [array(['n02116738-African_hunting_dog/n02116738_2503'], dtype='<U44')]],
        dtype=object),
  'labels': array([[  1],
         [  1],
         [  1],
         ...,
         [120],
         [120],
         [120]], dtype=uint8)},
 dict)

Alright, it looks like we get a dictionary with several fields that we may be interested in.

First off, let's check out the keys of the dictionary.

Input

train_list.keys()

Output

dict_keys(['__header__', '__version__', '__globals__', 'file_list', 'annotation_list', 'labels'])

My guess is that the file_list key is what we're after, as this looks like a large array of image names (the files all end in .jpg).

So how about we see how many files are in each file_list key, to help us get an idea if this is correct?

Input:

# Check the length of the file_list key
print(f"Number of files in training list: {len(train_list['file_list'])}")
print(f"Number of files in testing list: {len(test_list['file_list'])}")
print(f"Number of files in full list: {len(file_list['file_list'])}")

Output:

Number of files in training list: 12000
Number of files in testing list: 8580
Number of files in full list: 20580

Beautiful! It looks like these lists contain our training and test splits, and not only that, but the full list has a list of all the files in the dataset.

So let's inspect the train_list['file_list'] a little further.

Input:

train_list['file_list']

Output:

array([[array(['n02085620-Chihuahua/n02085620_5927.jpg'], dtype='<U38')],
       [array(['n02085620-Chihuahua/n02085620_4441.jpg'], dtype='<U38')],
       [array(['n02085620-Chihuahua/n02085620_1502.jpg'], dtype='<U38')],
       ...,
       [array(['n02116738-African_hunting_dog/n02116738_6754.jpg'], dtype='<U48')],
       [array(['n02116738-African_hunting_dog/n02116738_9333.jpg'], dtype='<U48')],
       [array(['n02116738-African_hunting_dog/n02116738_2503.jpg'], dtype='<U48')]],
      dtype=object)

OK, it looks like we've got an array of arrays, so how about we turn them into a Python list for easier handling?

How to turn an array of arrays into a Python list

We can do so by extracting each individual item via indexing and list comprehension.

For example

Let's see what it's like to get a single file name.

Input:

# Get a single filename
train_list['file_list'][0][0][0]

Output:

'n02085620-Chihuahua/n02085620_5927.jpg'

That worked out ok, so now let's get a Python list of all the individual file names (e.g. n02097130-giant_schnauzer/n02097130_2866.jpg) so we can use them later.

Input:

# Get a Python list of all file names for each list
train_file_list = list([item[0][0] for item in train_list["file_list"]])
test_file_list = list([item[0][0] for item in test_list["file_list"]])
full_file_list = list([item[0][0] for item in file_list["file_list"]])

len(train_file_list), len(test_file_list), len(full_file_list)

Output:

(12000, 8580, 20580)

Wonderful!

Now just to double check this, why don't we view a random sample of the filenames we extracted.

Important: If you remember from before, one of my favorite things to do whilst exploring data is to continually view random samples of it. Whether it be file names or images or text snippets, so that I can make sure the data we have is correct at a glance, without having to hope for the best, or check every single item.

That’s also why I use random samples.

Sure, you can always view the first X number of samples in your data, however, I find that continually viewing random samples of the data gives you a better overview of the different kinds of data you're working with.

Not only that, but it also gives you the small chance of stumbling upon a potential error.

How to view random samples in our data

We can view random samples of the data using Python's random.sample() method.

Input:

import random

random.sample(train_file_list, k=10)

Our output should look something like this.

Output:

['n02094258-Norwich_terrier/n02094258_439.jpg',
 'n02113624-toy_poodle/n02113624_3624.jpg',
 'n02102973-Irish_water_spaniel/n02102973_3635.jpg',
 'n02102318-cocker_spaniel/n02102318_2048.jpg',
 'n02098286-West_Highland_white_terrier/n02098286_1261.jpg',
 'n02088238-basset/n02088238_10095.jpg',
 'n02108915-French_bulldog/n02108915_9457.jpg',
 'n02098286-West_Highland_white_terrier/n02098286_5979.jpg',
 'n02109047-Great_Dane/n02109047_31274.jpg',
 'n02095889-Sealyham_terrier/n02095889_760.jpg']

As you can see, it’s exactly what we wanted, with various dog breed names being shown.

How to make sure that your test set and training set is separate

Now let's do a quick check to make sure none of the training image file names appear in the testing image file names list.

This is important because the number 1 rule in machine learning is to always keep the test set separate from the training set.

We can check that there are no overlaps by turning train_file_list into a Python set() and then using the intersection() method.

Input

# How many files in the training set intersect with the testing set?
len(set(train_file_list).intersection(test_file_list))

Ideally, our output should show zero.

Output:

0

Excellent! It looks like there are no overlaps.

We could even put an assert check to raise an error if there are any overlaps (e.g. the length of the intersection is greater than 0).

(assert works in the fashion: assert expression, message_if_expression_fails).

If the assert check doesn't output anything, then we're good to go!

# Make an assertion statement to check there are no overlaps (try changing test_file_list to train_file_list to see how it works)
assert len(set(train_file_list).intersection(test_file_list)) == 0, "There are overlaps between the training and test set files, please check them."

Woohoo!

Looks like there are no overlaps, so let's keep exploring the data.

Exploring the Annotation folder

How about we look at the Annotation folder next.

We can click the folder on the file explorer on the left to see what's inside, but we can also explore the contents of the folder with Python.

Let's use os.listdir() to see what's inside.

Input:

os.listdir("Annotation")[:10]

Output:

['n02111129-Leonberg',
 'n02102973-Irish_water_spaniel',
 'n02110806-basenji',
 'n02105251-briard',
 'n02093991-Irish_terrier',
 'n02099267-flat-coated_retriever',
 'n02110627-affenpinscher',
 'n02112137-chow',
 'n02094114-Norfolk_terrier',
 'n02095570-Lakeland_terrier']

Looks like there are files each with a dog breed name with several numbered files inside, which is exactly what we want.

Also, each of these files also contains an HTML version of an annotation relating to an image, such as ‘Annotation/n02085620-Chihuahua/n02085620_10074

<annotation>
	<folder>02085620</folder>
	<filename>n02085620_10074</filename>
	<source>
		<database>ImageNet database</database>
	</source>
	<size>
		<width>333</width>
		<height>500</height>
		<depth>3</depth>
	</size>
	<segment>0</segment>
	<object>
		<name>Chihuahua</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>25</xmin>
			<ymin>10</ymin>
			<xmax>276</xmax>
			<ymax>498</ymax>
		</bndbox>
	</object>
</annotation>

The fields include the name of the image, the size of the image, the label of the object, and where it is (i.e its bounding box coordinates).

If we were performing object detection (finding the location of a thing in an image), we'd pay attention to the <bndbox> coordinates. However, because we're focused on classification, our main consideration is the mapping of image name to class name.

So what else can we do with this information in this folder?

Well, because we're dealing with 120 classes of dog breeds, let's write a function to check the number of subfolders in the Annotation directory (there should be 120 subfolders, one for each breed of dog).

To do so, we can use Python's pathlib.Path class, along with Path.iterdir() to loop over the contents of Annotation and then use Path.is_dir() to check if the target item is a directory.

Input:

from pathlib import Path

def count_subfolders(directory_path: str) -> int:
    """
    Count the number of subfolders in a given directory.

    Args:
    directory_path (str): The path to the directory in which to count subfolders.

    Returns:
    int: The number of subfolders in the specified directory.

    Examples:
    >>> count_subfolders('/path/to/directory')
    3  # if there are 3 subfolders in the specified directory
    """
    return len([name for name in Path(directory_path).iterdir() if name.is_dir()])

directory_path = "Annotation"
folder_count = count_subfolders(directory_path)
print(f"Number of subfolders in {directory_path} directory: {folder_count}")

Ideally we should get an output of 120 dog breeds, like so.

Output:

Number of subfolders in Annotation directory: 120

Perfect! There are 120 subfolders of annotations, one for each class of dog we'd like to identify.

However, on further inspection of our file lists, it looks like the class name is already in the filepath.

Input:

# View a single training file pathname
train_file_list[0]

Output:

'n02085620-Chihuahua/n02085620_5927.jpg'

With this information, we now know that image n02085620_5927.jpg should contain a Chihuahua, so let's double check.

How to display an image in Google colab

Because colab comes with IPython built in, you can use IPython.display.Image(), to show the image, like so.

Input:

from IPython.display import Image
Image(Path("Images", train_file_list[0]))

Output:

dog

Hooray, it's our doggo!

Exploring the Images folder

We've explored the Annotations folder, so now let's check out our Images folder.

We know that the image file names come in the format class_name/image_name, for example, n02085620-Chihuahua/n02085620_5927.jpg.

To make things a little simpler, let's create the following:

  1. A mapping from folder name -> class name in dictionary form, for example, {'n02113712-miniature_poodle': 'miniature_poodle', 'n02092339-Weimaraner': 'weimaraner', 'n02093991-Irish_terrier': 'irish_terrier'...}. This will help us when visualizing our data from its original folder
  2. A list of all unique dog class names with simple formatting, for example, ['affenpinscher', 'afghan_hound', 'african_hunting_dog', 'airedale', 'american_staffordshire_terrier'...]

Let's start by getting a list of all the folders in the Images directory with os.listdir().

Input:

# Get a list of all image folders
image_folders = os.listdir("Images")
image_folders[:10]

Output:

['n02111129-Leonberg',
 'n02102973-Irish_water_spaniel',
 'n02110806-basenji',
 'n02105251-briard',
 'n02093991-Irish_terrier',
 'n02099267-flat-coated_retriever',
 'n02110627-affenpinscher',
 'n02112137-chow',
 'n02094114-Norfolk_terrier',
 'n02095570-Lakeland_terrier']

Excellent!

Now let's make a dictionary that maps from the folder name to a simplified version of the class name, for example:

{'n02085782-Japanese_spaniel': 'japanese_spaniel',
'n02106662-German_shepherd': 'german_shepherd',
'n02093256-Staffordshire_bullterrier': 'staffordshire_bullterrier',
...}

Here’s what that code might look like.

# Create folder name -> class name dict
folder_to_class_name_dict = {}
for folder_name in image_folders:
  # Turn folder name into class_name
  # E.g. "n02089078-black-and-tan_coonhound" -> "black_and_tan_coonhound"
  # We'll split on the first "-" and join the rest of the string with "_" and then lower it
  class_name = "_".join(folder_name.split("-")[1:]).lower()
  folder_to_class_name_dict[folder_name] = class_name

# Make sure there are 120 entries in the dictionary
assert len(folder_to_class_name_dict) == 120

Now with the folder name to class name mapping created, let's view the first 10 to make sure it works.

Input:

list(folder_to_class_name_dict.items())[:10]

Output:

[('n02111129-Leonberg', 'leonberg'),
 ('n02102973-Irish_water_spaniel', 'irish_water_spaniel'),
 ('n02110806-basenji', 'basenji'),
 ('n02105251-briard', 'briard'),
 ('n02093991-Irish_terrier', 'irish_terrier'),
 ('n02099267-flat-coated_retriever', 'flat_coated_retriever'),
 ('n02110627-affenpinscher', 'affenpinscher'),
 ('n02112137-chow', 'chow'),
 ('n02094114-Norfolk_terrier', 'norfolk_terrier'),
 ('n02095570-Lakeland_terrier', 'lakeland_terrier')]

Perfect!

We can also get a list of unique dog names by getting the values() of the folder_to_class_name_dict and turning it into a list, like so:

Input:

dog_names = sorted(list(folder_to_class_name_dict.values()))
dog_names[:10]

Output:

[('n02111129-Leonberg', 'leonberg'),
 ('n02102973-Irish_water_spaniel', 'irish_water_spaniel'),
 ('n02110806-basenji', 'basenji'),
 ('n02105251-briard', 'briard'),
 ('n02093991-Irish_terrier', 'irish_terrier'),
 ('n02099267-flat-coated_retriever', 'flat_coated_retriever'),
 ('n02110627-affenpinscher', 'affenpinscher'),
 ('n02112137-chow', 'chow'),
 ('n02094114-Norfolk_terrier', 'norfolk_terrier'),
 ('n02095570-Lakeland_terrier', 'lakeland_terrier')]

Now we've got:

  1. folder_to_class_name_dict - a mapping from the folder name to the class name
  2. dog_names - a list of all the unique dog breeds we're working with

How to visualize a group of random images

How about we follow the data explorers' motto of visualize, visualize, visualize, and view some random images to make sure they are correct?

To help us visualize, let's create a function that takes in a list of image paths and then randomly selects 10 of those paths to display.

The function will:

  1. Take in a select list of image paths
  2. Create a grid of matplotlib plots (e.g. 2x5 = 10 plots to plot on)
  3. Randomly sample 10 image paths from the input image path list (using random.sample())
  4. Iterate through the flattened axes via axes.flat which is a reference to the attribute numpy.ndarray.flat
  5. Extract the sample path from the list of samples
  6. Get the sample title from the parent folder of the path using Path.parent.stem and then extract the formatted dog breed name by indexing folder_to_class_name_dict
  7. Read the image with plt.imread() and show it on the target ax with ax.imshow()
  8. Set the title of the plot to the parent folder name with ax.set_title() and turn the axis marks off with ax.axis("off") (this makes for pretty plots)
  9. Show the plot with plt.show()

paThat’s a lot of steps, but again it’s nothing we can't handle so let's do it.

Input:

import random

from pathlib import Path
from typing import List

import matplotlib.pyplot as plt

# 1. Take in a select list of image paths
def plot_10_random_images_from_path_list(path_list: List[Path],
                                         extract_title: bool=True) -> None:
  # 2. Set up a grid of plots
  fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))

  # 3. Randomly sample 10 paths from the list
  samples = random.sample(path_list, 10)

  # 4. Iterate through the flattened axes and corresponding sample paths
  for i, ax in enumerate(axes.flat):

    # 5. Get the target sample path (e.g. "Images/n02087394-Rhodesian_ridgeback/n02087394_1161.jpg")
    sample_path = samples[i]

    # 6. Extract the parent directory name to use as the title (if necessary)
    # (e.g. n02087394-Rhodesian_ridgeback/n02087394_1161.jpg -> n02087394-Rhodesian_ridgeback -> rhodesian_ridgeback)
    if extract_title:
      sample_title = folder_to_class_name_dict[sample_path.parent.stem]
    else:
      sample_title = sample_path.parent.stem

    # 7. Read the image file and plot it on the corresponding axis
    ax.imshow(plt.imread(sample_path))

    # 8. Set the title of the axis and turn of the axis (for pretty plots)
    ax.set_title(sample_title)
    ax.axis("off")

  # 9. Display the plot
  plt.show()

plot_10_random_images_from_path_list(path_list=[Path("Images") / Path(file) for file in train_file_list])

Output:

dog image selection

Those are some nice looking dogs! Not only that, but each name correlates with the correct type of dog also, so, we know this works.

(As a rule of thumb, I would probably repeat this maybe 10 times, so that I can examine 100 or so images, and make sure it's still giving me the correct data. This way, I know it's working before moving too far down the line).

How to explore the distribution of our data

After visualization, another valuable way to explore the data is by checking the data distribution.

What do I mean by this?

Well, distribution refers to the "spread" of data, and in our case, it’s ‘how many images of dogs do we have per breed?’. So, a balanced distribution would mean having roughly the same number of images for each breed (e.g. 100 images per dog breed).

Right now there are 120 breeds and 1200 images, so it's a good guess that this might be the case, but we’re better off checking.

Also, it's worth noting that there's also a deeper level of distribution than just images per dog breed that we want to check.

For example

Ideally, the images for each different breed are well distributed as well.

What I mean by this, is that we wouldn't want to have 100 identical copies of the same image per dog breed, as this wouldn’t be as useful to train our model with.

Instead we'd like the images of each particular breed to be in different scenarios, different lighting, different angles. This way our model is able to recognize the correct dog breed no matter what angle the photo is taken from.

So here’s how to do this.

To figure out how many images we have per class, let's write a function to count the number of images per subfolder in a given directory.

Specifically, we'll want the function to:

  1. Take in a target directory/folder
  2. Create a list of all the subdirectories/subfolders in the target folder
  3. Create an empty list, image_class_counts to append subfolders and their counts to
  4. Iterate through all of the subdirectories
  5. Get the class name of the target folder as the name of the folder
  6. Count the number of images in the target folder using the length of the list of image paths (we can get these with Path().rglob(*.jpg) where *.jpg means "all files with the extension .jpg
  7. Append a dictionary of {"class_name": class_name, "image_count": image_count} to the image_class_counts list (we create a list of dictionaries so we can turn this into a pandas DataFrame)
  8. Return the image_class_counts list

Here’s what that code would look like.

# Create a dictionary of image counts
from pathlib import Path
from typing import List, Dict

# 1. Take in a target directory
def count_images_in_subdirs(target_directory: str) -> List[Dict[str, int]]:
    """
    Counts the number of JPEG images in each subdirectory of the given directory.

    Each subdirectory is assumed to represent a class, and the function counts
    the number of '.jpg' files within each one. The result is a list of
    dictionaries with the class name and corresponding image count.

    Args:
        target_directory (str): The path to the directory containing subdirectories.

    Returns:
        List[Dict[str, int]]: A list of dictionaries with 'class_name' and 'image_count' for each subdirectory.

    Examples:
        >>> count_images_in_subdirs('/path/to/directory')
        [{'class_name': 'beagle', 'image_count': 50}, {'class_name': 'poodle', 'image_count': 60}]
    """
    # 2. Create a list of all the subdirectoires in the target directory (these contain our images)
    images_dir = Path(target_directory)
    image_class_dirs = [directory for directory in images_dir.iterdir() if directory.is_dir()]

    # 3. Create an empty list to append image counts to
    image_class_counts = []

    # 4. Iterate through all of the subdirectories
    for image_class_dir in image_class_dirs:

        # 5. Get the class name from image directory (e.g. "Images/n02116738-African_hunting_dog" -> "n02116738-African_hunting_dog")
        class_name = image_class_dir.stem

        # 6. Count the number of images in the target subdirectory
        image_count = len(list(image_class_dir.rglob("*.jpg")))  # get length all files with .jpg file extension

        # 7. Append a dictionary of class name and image count to count list
        image_class_counts.append({"class_name": class_name,
                                   "image_count": image_count})

    # 8. Return the list
    return image_class_counts

Ho ho, what a function!

Now let’s go ahead and run it on our Images directory and view the first few indexes to make sure it works.

Input:

image_class_counts = count_images_in_subdirs("Images")
image_class_counts[:3]

Output:

[{'class_name': 'n02111129-Leonberg', 'image_count': 210},
 {'class_name': 'n02102973-Irish_water_spaniel', 'image_count': 150},
 {'class_name': 'n02110806-basenji', 'image_count': 209}]

It works!

Better yet, since our image_class_counts variable is in the form of a list of dictionaries, we can turn it into a pandas DataFrame.

Why do this?

Mainly because a pandas DataFrame helps provide a structured and tabular format that is easy to manipulate and analyze. Not only that, but it also has built-in methods for sorting, filtering, and aggregating data, which simplifies data handling and enhances readability.

Alright, so now let's sort the DataFrame by "image_count" so the classes with the most images appear at the top. To do this we can use DataFrame.sort_values().

Input:

# Create a DataFrame
import pandas as pd
image_counts_df = pd.DataFrame(image_class_counts).sort_values(by="image_count", ascending=False)
image_counts_df.head()

Output:

initial dataframe

Let's also clean up the "class_name" column to be more readable by mapping the the values to our folder_to_class_name_dict, like so:

Input:

# Make class name column easier to read
image_counts_df["class_name"] = image_counts_df["class_name"].map(folder_to_class_name_dict)
image_counts_df.head()

Output:

updated dataframe

That’s much easier, but we can go one step better right?

How to plot our DataFrame

Because we've got a DataFrame of image counts per class, we can make them more visual by turning them into a plot.

To do so, we can use image_counts_df.plot(kind="bar", ...) along with some other customization, like so.

Input:

# Turn the image counts DataFrame into a graph
import matplotlib.pyplot as plt
plt.figure(figsize=(14, 7))
image_counts_df.plot(kind="bar",
                     x="class_name",
                     y="image_count",
                     legend=False,
                     ax=plt.gca()) # plt.gca() = "get current axis", get the plt we setup above and put the data there

# Add customization
plt.ylabel("Image Count")
plt.title("Total Image Counts by Class")
plt.xticks(rotation=90, # Rotate the x labels for better visibility
           fontsize=8) # Make the font size smaller for easier reading
plt.tight_layout() # Ensure things fit nicely
plt.show()

Output:

total image counts by class

Beautiful!

It also looks like our classes are quite balanced, with each breed of dog having ~150 or more images.

We can even find out some other quick stats about our data with DataFrame.describe().

Input:

# Get various statistics about our data distribution
image_counts_df.describe()

Output:

stats about our data distribution

The table above shows a similar story to the plot, in that we can see the minimum number of images per class is 148, whereas the maximum number of images is 252.

(If one class had 10x less images than another class, we may look into collecting more data to improve the balance. Our main concern here is that there's enough of each type).

The main takeaways here are:

  • When working on a classification problem, ideally, all classes have a similar number of samples (however, in some problems this may be unattainable, such as fraud detection, where you may have 1000x more "not fraud" samples to "fraud" samples
  • If you wanted to add a new class of dog breeds to the existing 120, ideally, you'd have at least ~150 images for it (though as we'll see with transfer learning, the number of required images could be less as long as they're high quality)

Step #4. Creating training and test data split directories

After exploring the data, one of the next best things you can do is create experimental data splits.

This includes:

data splits

The good news is that our dog dataset already comes with specified training and test set splits, so we'll stick with those.

But we'll also create a smaller training set (a random 10% of the training data) so we can stick to the machine learning engineers' motto of experiment, experiment, experiment! and run quicker experiments.

Important: If you don’t know this already, one of the most important things in machine learning is being able to experiment quickly. As in, try a new model, try a new set of hyperparameters, or try a new training setup.

This means that when you start out, you want the time between your experiments to be as small as possible so you can quickly figure out what doesn't work so you can spend more time on and run larger experiments with what does work.

As previously discussed, we're working towards a directory structure of:

ideal data structure layout

So let's write some code to create this:

  • images/train/ directory to hold all of the training images
  • images/test/ directory to hold all of the testing images
  • And make a directory inside each of images/train/ and images/test/ for each of the dog breed classes

We can make each of the directories we need using Path.mkdir().

As for the dog breed directories, we can simply loop through the list of dog_names and create a folder for each inside the images/train/ and images/test/ directories, like so:

from pathlib import Path

# Define the target directory for image splits to go
images_split_dir = Path("images_split")

# Define the training and test directories
train_dir = images_split_dir / "train"
test_dir = images_split_dir / "test"

# Using Path.mkdir with exist_ok=True ensures the directory is created only if it doesn't exist
train_dir.mkdir(parents=True, exist_ok=True)
test_dir.mkdir(parents=True, exist_ok=True)
print(f"Directory {train_dir} is exists.")
print(f"Directory {test_dir} is exists.")

# Make a folder for each dog name
for dog_name in dog_names:
  # Make training dir folder
  train_class_dir = train_dir / dog_name
  train_class_dir.mkdir(parents=True, exist_ok=True)
  # print(f"Making directory: {train_class_dir}")

  # Make testing dir folder
  test_class_dir = test_dir / dog_name
  test_class_dir.mkdir(parents=True, exist_ok=True)
  # print(f"Making directory: {test_class_dir}")

# Make sure there is 120 subfolders in each
assert count_subfolders(train_dir) == len(dog_names)
assert count_subfolders(test_dir) == len(dog_names)

Output:

Directory images_split/train is exists.
Directory images_split/test is exists.

Excellent!

Now we can check out the data split directories/folders we created by inspecting them in the files panel in Google Colab.

Alternatively, we can check the names of each by listing the subdirectories inside them.

Input:

# See the first 10 directories in the training split dir
sorted([str(dir_name) for dir_name in train_dir.iterdir() if dir_name.is_dir()])[:10]

Output:

['images_split/train/affenpinscher',
 'images_split/train/afghan_hound',
 'images_split/train/african_hunting_dog',
 'images_split/train/airedale',
 'images_split/train/american_staffordshire_terrier',
 'images_split/train/appenzeller',
 'images_split/train/australian_terrier',
 'images_split/train/basenji',
 'images_split/train/basset',
 'images_split/train/beagle']

Our directory layout is looking good, but you might've noticed that all of our dog breed directories are empty, so let's change that by getting some images in there.

To do so, we'll create a function called copy_files_to_target_dir() which will copy images from the Images directory into their respective directories inside images/train and images/test.

More specifically, it will:

  1. Take in a list of source files to copy (e.g. train_file_list) and a target directory to copy files to
  2. Iterate through the list of source files to copy (we'll use tqdm which comes installed with Google Colab to create a progress bar of how many files have been copied)
  3. Convert the source file path to a Path object
  4. Split the source file path and create a Path object for the destination folder (e.g. "n02112018-Pomeranian" -> "pomeranian")
  5. Get the target file name (e.g. "n02112018-Pomeranian/n02112018_6208.jpg" -> "n02112018_6208.jpg")
  6. Create a destination path for the source file to be copied to (e.g. images_split/train/pomeranian/n02112018_6208.jpg).
  7. Ensure the destination directory exists, similar to the step we took in the previous section (you can't copy files to a directory that doesn't exist)
  8. Print out the progress of copying (if necessary)
  9. Copy the source file to the destination using Python's shutil.copy2(src, dst)
from pathlib import Path
from shutil import copy2
from tqdm.auto import tqdm

# 1. Take in a list of source files to copy and a target directory
def copy_files_to_target_dir(file_list: list[str],
                             target_dir: str,
                             images_dir: str = "Images",
                             verbose: bool = False) -> None:
    """
    Copies a list of files from the images directory to a target directory.

    Parameters:
    file_list (list[str]): A list of file paths to copy.
    target_dir (str): The destination directory path where files will be copied.
    images_dir (str, optional): The directory path where the images are currently stored. Defaults to 'Images'.
    verbose (bool, optional): If set to True, the function will print out the file paths as they are being copied. Defaults to False.

    Returns:
    None
    """
    # 2. Iterate through source files
    for file in tqdm(file_list):

      # 3. Convert file path to a Path object
      source_file_path = Path(images_dir) / Path(file)

      # 4. Split the file path and create a Path object for the destination folder
      # e.g. "n02112018-Pomeranian" -> "pomeranian"
      file_class_name = folder_to_class_name_dict[Path(file).parts[0]]

      # 5. Get the name of the target image
      file_image_name = Path(file).name

      # 6. Create the destination path
      destination_file_path = Path(target_dir) / file_class_name / file_image_name

      # 7. Ensure the destination directory exists (this is a safety check, can't copy an image to a file that doesn't exist)
      destination_file_path.parent.mkdir(parents=True, exist_ok=True)

      # 8. Print out copy message if necessary
      if verbose:
        print(f"[INFO] Copying: {source_file_path} to {destination_file_path}")

      # 9. Copy the original path to the destination path
      copy2(src=source_file_path, dst=destination_file_path)

And just like that, our copying function is created!

Let's test it out by copying the files in the train_file_list to train_dir.

Input:

# Copy training images from Images to images_split/train/...
copy_files_to_target_dir(file_list=train_file_list,
                         target_dir=train_dir,
                         verbose=False) # set this to True to get an output of the copy process
                                        # (warning: this will output a large amount of text)

Output:

0%|          | 0/12000 [00:00<?, ?it/s]

Woohoo!

Looks like our copying function copied 12,000 training images from their respective directories inside images_split/train/.

How about we do the same for test_file_list and test_dir?

Input:

copy_files_to_target_dir(file_list=test_file_list,
                         target_dir=test_dir,
                         verbose=False)

Output:

 0%|          | 0/8580 [00:00<?, ?it/s]

Nice! We’ve now got 8,580 testing images copied from Images to images_split/test/.

So now let's write some code to check that the number of files in the train_file_list is the same as the number of images files in train_dir (and the same for the test files).

Input:

# Get list of of all .jpg paths in train and test image directories
train_image_paths = list(train_dir.rglob("*.jpg"))
test_image_paths = list(test_dir.rglob("*.jpg"))

# Make sure the number of images in the training and test directories equals the number of files in their original lists
assert len(train_image_paths) == len(train_file_list)
assert len(test_image_paths) == len(test_file_list)

print(f"Number of images in {train_dir}: {len(train_image_paths)}")
print(f"Number of images in {test_dir}: {len(test_image_paths)}")

Output:

Number of images in images_split/train: 12000
Number of images in images_split/test: 8580

It seems to have worked, but let’s make sure to adhere to the data explorers' motto of visualize, visualize, visualize!, and plot some random images from the train_image_paths list, to double check.

Input:

# Plot 10 random images from the train_image_paths
plot_10_random_images_from_path_list(path_list=train_image_paths,
                                     extract_title=False) # don't need to extract the title since the image directories are already named simply

Output:

more dog images

How to make a 10% training dataset split

Now we're going to make another training split that contains a random 10% (approximately 1,200 images, since the original training set has 12,000 images) of the data from the original training split.

Why do this?

Well, although it's true that machine learning models generally perform better with more data, having more data often means longer computation times, and longer computation times means the time between our experiments gets longer, which is not what we want in the beginning.

In the beginning of any new machine learning project, your focus should be to reduce the amount of time between experiments as much as possible.

Why?

Because like I said earlier, running more experiments means you can figure out what doesn't work. And if you figure out what doesn't work, you can start working closer towards what does.

Once you find something that does work, you can then start to scale up your experiments with more data, bigger models, longer training times, etc.

So here’s how to do this.

To make our 10% training dataset, let's copy a random 10% of the existing training set to a new folder called images_split/train_10_percent, so that we've got the layout:

10 percent training split

Let's start by creating that folder first.

# Create train_10_percent directory
train_10_percent_dir = images_split_dir / "train_10_percent"
train_10_percent_dir.mkdir(parents=True, exist_ok=True)

Now we should have 3 split folders inside images_split.

Input:

os.listdir(images_split_dir)

Output:

['test', 'train_10_percent', 'train']

And it works! So now let's create a list of random training sample file paths using Python's random.sample() function.

We'll want the total length of the list to equal 10% of the original training split, so to make things reproducible, we'll use a random seed (this is not 100% necessary, it just makes it so we get the same 10% of training image paths each time).

Input:

import random

# Set a random seed
random.seed(42)

# Get a 10% sample of the training image paths
train_image_paths_random_10_percent = random.sample(population=train_image_paths,
                                                    k=int(0.1*len(train_image_paths)))

# Check how many image paths we got
print(f"Original number of training image paths: {len(train_image_paths)}")
print(f"Number of 10% training image paths: {len(train_image_paths_random_10_percent)}")
print("First 5 random 10% training image paths:")
train_image_paths_random_10_percent[:5]

Output:

Original number of training image paths: 12000
Number of 10% training image paths: 1200
First 5 random 10% training image paths:

[PosixPath('images_split/train/miniature_pinscher/n02107312_2706.jpg'),
 PosixPath('images_split/train/irish_wolfhound/n02090721_272.jpg'),
 PosixPath('images_split/train/greater_swiss_mountain_dog/n02107574_3274.jpg'),
 PosixPath('images_split/train/italian_greyhound/n02091032_3763.jpg'),
 PosixPath('images_split/train/bloodhound/n02088466_7962.jpg')]

Random 10% training image paths acquired!

So now let's copy them to the images_split/train_10_percent directory using a similar code to our copy_files_to_target_dir() function.

Input:

# Copy training 10% split images from images_split/train/ to images_split/train_10_percent/...
for source_file_path in tqdm(train_image_paths_random_10_percent):

  # Create the destination file path
  destination_file_and_image_name = Path(*source_file_path.parts[-2:]) # "images_split/train/yorkshire_terrier/n02094433_2223.jpg" -> "yorkshire_terrier/n02094433_2223.jpg"
  destination_file_path = train_10_percent_dir / destination_file_and_image_name # "yorkshire_terrier/n02094433_2223.jpg" -> "images_split/train_10_percent/yorkshire_terrier/n02094433_2223.jpg"

  # If the target directory doesn't exist, make it
  target_class_dir = destination_file_path.parent
  if not target_class_dir.is_dir():
    # print(f"Making directory: {target_class_dir}")
    target_class_dir.mkdir(parents=True,
                           exist_ok=True)

  # print(f"Copying: {source_file_path} to {destination_file_path}")
  copy2(src=source_file_path,
        dst=destination_file_path)

Output:

 0%|          | 0/1200 [00:00<?, ?it/s]

1200 images copied!

Now let's check our training 10% set distribution and make sure we've got some images for each class.

We can use our count_images_in_subdirs() function to count the images in each of the dog breed folders in the train_10_percent_dir.

Input:

# Count images in train_10_percent_dir
train_10_percent_image_class_counts = count_images_in_subdirs(train_10_percent_dir)
train_10_percent_image_class_counts_df = pd.DataFrame(train_10_percent_image_class_counts).sort_values("image_count", ascending=True)
train_10_percent_image_class_counts_df.head()

Output:

10 percent training dataframe

Hmm okay…

It looks like a few classes have only a handful of images, which might not work great.

So let's make sure there are 120 subfolders by checking the length of the train_10_percent_image_class_counts_df.

Input:

# How many subfolders are there?
print(len(train_10_percent_image_class_counts_df))

Output:

120

Beautiful, our training 10% dataset split has a folder for each of the dog breed classes.

Note: Ideally our random 10% training set would have the same distribution per class as the original training set.

However, for this example, we've taken a global random 10% rather than a random 10% per class. This is okay for now, however for more fine-grained tasks, you may want to make sure your smaller training set is better distributed.

How to plot the distribution of your training set

For one last check, let's plot the distribution of our train 10% dataset.

Input:

# Plot distribution of train 10% dataset.
plt.figure(figsize=(14, 7))
train_10_percent_image_class_counts_df.plot(kind="bar",
                     x="class_name",
                     y="image_count",
                     legend=False,
                     ax=plt.gca()) # plt.gca() = "get current axis", get the plt we setup above and put the data there

# Add customization
plt.title("Train 10 Percent Image Counts by Class")
plt.ylabel("Image Count")
plt.xticks(rotation=90, # Rotate the x labels for better visibility
           fontsize=8) # Make the font size smaller for easier reading
plt.tight_layout() # Ensure things fit nicely
plt.show()

Output:

10 percent image training data plot

Excellent! Our 10% training dataset distribution looks similar to the original training set distribution.

Step #5. Turning datasets into TensorFlow Datasets

Alright, we've spent a bunch of time getting our dog images into different folders, but how do we get the images from different folders into a machine learning model?

Well, we need a way to turn our images into numbers.

Or more specifically, we're going to turn our images into tensors, which is where the "Tensor" comes from in "TensorFlow".

where tensorflow comes from

So what are tensors exactly?

A tensor is simply a way to represent something using numbers.

The things they represent can be almost anything you can think of, such as text, images, audio, rows, and columns), and there are several different ways to load data into TensorFlow.

However, the basic formula is the same across each data type. You have data -> use TensorFlow to turn it into tensors.

This is the main reason why we spent time getting our data into the standard image classification format (where the class name is the folder name), because TensorFlow includes several utility functions to load data from this directory format.

tensorflow utility functions

What is a tf.data.Dataset?

A tf.data.Dataset is TensorFlow's way to efficiently store a potentially large set of elements, thanks to the tf.data.Dataset API.

Because it's such an efficient method, it’s what we’re going to use for our dog images.

And, since we're working with images, as well with tf.keras.utils.image_dataset_from_directory(), we'll also pass in the following parameters:

  • directory = the target directory we'd like to turn into a tf.data.Dataset
  • label_mode = the kind of labels we'd like to use, in our case it's "categorical" since we're dealing with a multi-class classification problem (we would use "binary" if we were working with binary classification problem)
  • batch_size = the number of images we'd like our model to see at a time (due to computation limitations, our model won't be able to look at every image at once so we split them into small batches and the model looks at each batch individually), generally 32 is a good value to start, as this means our model will look at 32 images at a time (this number is flexible)
  • image_size = the size we'd like to shape our images to before we feed them to our model (height x width)
  • shuffle = whether we'd like our dataset to be shuffled to randomize the order.
  • seed = If we're shuffling the order in a random fashion, do we want that to be reproducible?

Note: Values such as batch_size and image_size are known as hyperparameters, meaning they're values that you can decide what to set them as.

As for the best value for a given hyperparameter, that depends highly on the data you're working with, problem space and compute capabilities you've got available. Best to experiment!

With all this being said, let's see it in practice!

Creating our own tf.data.Dataset

We'll make 3 of these tf.data.Dataset's:

  • Train_10_percent_ds
  • train_ds, and
  • Test_ds

Here’s the code to make this happen:

Input:

import tensorflow as tf

# Create constants
IMG_SIZE = (224, 224)
BATCH_SIZE = 32
SEED = 42

# Create train 10% dataset
train_10_percent_ds = tf.keras.utils.image_dataset_from_directory(
    directory=train_10_percent_dir,
    label_mode="categorical", # turns labels into one-hot representations (e.g. [0, 0, 1, ..., 0, 0])
    batch_size=BATCH_SIZE,
    image_size=IMG_SIZE,
    shuffle=True, # shuffle training datasets to prevent learning of order
    seed=SEED
)

# Create full train dataset
train_ds = tf.keras.utils.image_dataset_from_directory(
    directory=train_dir,
    label_mode="categorical",
    batch_size=BATCH_SIZE,
    image_size=IMG_SIZE,
    shuffle=True,
    seed=SEED
)

# Create test dataset
test_ds = tf.keras.utils.image_dataset_from_directory(
    directory=test_dir,
    label_mode="categorical",
    batch_size=BATCH_SIZE,
    image_size=IMG_SIZE,
    shuffle=False, # don't need to shuffle the test dataset (this makes evaluations easier)
    seed=SEED
)

Output:

Found 1200 files belonging to 120 classes.
Found 12000 files belonging to 120 classes.
Found 8580 files belonging to 120 classes.

Note: If you're working with similar styles of data (e.g. all dog photos), it's best practice to shuffle training datasets to prevent the model from learning any order in the data. Sneaky machines eh!

Alright, so now that we have our tf.data.Datasets created, let's check out one of them.

Input:

Train_10_percent_ds

Output:

<_PrefetchDataset element_spec=(TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None, 120), dtype=tf.float32, name=None))>

You'll notice a few things going on here.

Essentially, we've got a collection of tuples:

  1. The image tensor(s) - TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None) where (None, 224, 224, 3) is the shape of the image tensor (None is the batch size, (224, 224) is the IMG_SIZE we set and 3 is the number of color channels, as in, red, green, blue or RGB since our images are in color)
  2. The label tensor(s) - TensorSpec(shape=(None, 120), dtype=tf.float32, name=None) where None is the batch size and 120 is the number of labels we're using
  3. The batch size often appears as None since it's flexible and can change on the fly
  4. Each batch of images is associated with a batch of labels

Instead of just talking about these, let's check out what a single batch looks like.

We can do so by turning the tf.data.Dataset into an iterable with Python's built-in iter() and then getting the "next" batch with next().

Input

# What does a single batch look like?
image_batch, label_batch = next(iter(train_ds))
image_batch.shape, label_batch.shape

Output:

(TensorShape([32, 224, 224, 3]), TensorShape([32, 120]))

As you can see, we get back a single batch of images and labels.

Not only that, but it also looks like we have a single image_batch with a shape of [32, 224, 224, 3] ([batch_size, height, width, colour_channels]), and our labels have a shape of [32, 120] ([batch_size, labels]).

So what does this all mean?

Simply put, these are the numerical representations of our data images and labels!

Note: The shape of a tensor does not necessarily reflect the values inside a tensor. The shape only reflects the dimensionality of a tensor. For example, [32, 224, 224, 3] is a 4-dimensional tensor. Values inside a tensor can be any number (positive, negative, 0, float, integer, etc) representing almost any kind of data.

We can further inspect our data by looking at a single sample.

Input

# Get a single sample from a single batch
print(f"Single image tensor:\n{image_batch[0]}\n")
print(f"Single label tensor: {label_batch[0]}") # notice the 1 is the index of the target label (our labels are one-hot encoded)
print(f"Single sample class name: {dog_names[tf.argmax(label_batch[0])]}")

Output:

Single image tensor:
[[[196.61607  174.61607  160.61607 ]
  [197.84822  175.84822  161.84822 ]
  [200.       178.       164.      ]
  ...
  [ 60.095097  79.75804   45.769207]
  [ 61.83293   71.22575   63.288315]
  [ 77.65755   83.65755   81.65755 ]]

 [[196.       174.       160.      ]
  [197.83876  175.83876  161.83876 ]
  [199.07945  177.07945  163.07945 ]
  ...
  [ 94.573715 110.55229   83.59694 ]
  [125.869865 135.26268  127.33472 ]
  [122.579605 128.5796   126.579605]]

 [[195.73691  173.73691  159.73691 ]
  [196.896    174.896    160.896   ]
  [199.       177.       163.      ]
  ...
  [ 26.679413  38.759026  20.500835]
  [ 24.372307  31.440136  26.675896]
  [ 20.214453  26.214453  24.214453]]

 ...

 [[ 61.57369   70.18976  104.72547 ]
  [189.91965  199.61607  213.28572 ]
  [247.26637  255.       252.70387 ]
  ...
  [113.40158   83.40158   57.40158 ]
  [110.75214   78.75214   53.752136]
  [107.37048   75.37048   50.370483]]

 [[ 61.27007   69.88614  104.42185 ]
  [188.93079  198.62721  212.29686 ]
  [246.33257  255.       251.77007 ]
  ...
  [110.88623   80.88623   54.88623 ]
  [102.763245  70.763245  45.763245]
  [ 99.457634  67.457634  42.457638]]

 [[ 60.25893   68.875    103.41071 ]
  [188.58261  198.27904  211.94868 ]
  [245.93112  254.6097   251.36862 ]
  ...
  [105.02222   75.02222   49.022217]
  [109.11186   77.11186   52.111866]
  [106.56936   74.56936   49.56936 ]]]

Single label tensor: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Single sample class name: schipperke

And boom! We've got a numerical representation of a dog image (in the form of red, green, blue pixel values)!

This is exactly the kind of format we want for our model.

But can we this reverse this process to check that it worked correctly (So instead of image to numbers, can we go from numbers back to an image?)

You bet we can!

How to visualize images from our TensorFlow Dataset

OK, so now let's turn our single sample from a tensor format, back to an image format.

We can do so by passing the single sample image tensor to matplotlib's plt.imshow() (we'll also need to convert its datatype from float32 to uint8 to avoid matplotlib color range issues).

Input:

plt.imshow(image_batch[0].numpy().astype("uint8")) # convert tensor to uint8 to avoid matplotlib colour range issues
plt.title(dog_names[tf.argmax(label_batch[0])])
plt.axis("off");

Output:

Visualizing images from our TensorFlow Dataset

It works!

We went from dog image and classification to numbers (a tensor), and then back again.

As you know by now though, we don’t want to accept just one result. We want to check a larger sample for peace of mind, so let’s do that now.

How to plot multiple images from our tensor data

We can do this by first setting up a plot with multiple subplots, and then iterate through our dataset with tf.data.Dataset.take(count=1).

This will "take" 1 batch of data (in our case, one batch is 32 samples) which we can then index on for each subplot.

Input:

# Create multiple subplots
fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))

# Iterate through a single batch and plot images
for images, labels in train_ds.take(count=1): # note: because our training data is shuffled, each "take" will be different
  for i, ax in enumerate(axes.flat):
    ax.imshow(images[i].numpy().astype("uint8"))
    ax.set_title(dog_names[tf.argmax(labels[i])])
    ax.axis("off")

Output:

How to plot multiple images from our tensor data

Aren't those good looking dogs!

How to get labels from our TensorFlow Dataset

Since our data is now in the tf.data.Dataset format, there are a couple of important attributes we can pull from it if we wanted to.

The first that I recommend, is the collection of filepaths associated with a tf.data.Dataset These are accessible by the .file_paths attribute.

Note: You can often see a list of associated methods and attributes of a variable/class in Google Colab (or other IDEs) by pressing TAB afterwards (e.g type variable_name. + TAB).

Input

# Get the first 5 file paths of the training dataset
train_ds.file_paths[:5]

Output:

['images_split/train/boston_bull/n02096585_1753.jpg',
 'images_split/train/kerry_blue_terrier/n02093859_855.jpg',
 'images_split/train/border_terrier/n02093754_2281.jpg',
 'images_split/train/rottweiler/n02106550_11823.jpg',
 'images_split/train/airedale/n02096051_5884.jpg']

We can also get the class names associated with a dataset by using .class_names.

TensorFlow will then read these from the names of our target folders in the images_split directory, like so.

Input:

# Get the class names TensorFlow has read from the target directory
class_names = train_ds.class_names
class_names[:5]

Output:

['affenpinscher',
 'afghan_hound',
 'african_hunting_dog',
 'airedale',
 'american_staffordshire_terrier']

Finally, we can also make sure that the class names are the same across our datasets by comparing them, with the following code:

assert set(train_10_percent_ds.class_names) == set(train_ds.class_names) == set(test_ds.class_names)

Configuring our datasets for performance

There's one last step we're going to do before we build our first TensorFlow model, and that's to configure our datasets for performance.

Why?

Simply because data loading is one of the biggest bottlenecks in machine learning!

Sure, modern GPUs can perform calculations (matrix multiplications) to find patterns in data quite quickly. However, for the GPU to perform such calculations, the data needs to be there to be used, and that’s where the time sink is.

Good news for us is that if we follow the TensorFlow best practices for tf.data, TensorFlow will take care of all these optimizations and hardware acceleration for us.

We're going to call three methods on our dataset to optimize it for performance:

  • cache() - Cache the elements in the dataset in memory or a target folder (speeds up loading
  • shuffle() - Shuffle a set number of samples in preparation for loading (this will mean our samples and batches of samples will be shuffled), for example, setting shuffle(buffer_size=1000) will prepare and shuffle 1000 elements of data at a time, and
  • prefetch() - Prefetch the next batch of data and prepare it for computation whilst the previous one is being computed on (can scale to multiple prefetches depending on hardware availability). TensorFlow can also automatically configure how many elements/batches to prefetch by setting prefetch(buffer_size=tf.data.AUTOTUNE)

Resource: For more performance tips on loading dataset in TensorFlow, see the Datasets Performance tips guide.

For our needs, let's start by calling cache() on our datasets to save the loaded samples to memory.

We'll then shuffle() the training splits with buffer_size=10*BATCH_SIZE for the training 10% split and buffer_size=100*BATCH_SIZE for the full training set.

Why use these numbers?

Only because that's how many I decided to use via my own personal experimentation. Please feel free to figure out a different number that may work better for your own needs.

Ideally if your dataset isn't too large, you would shuffle all possible samples (TensorFlow has a method of finding the number of samples in a dataset called tf.data.Dataset.cardinality()).

However, we won't call shuffle() on the testing dataset since it isn't required. But, we will call prefetch(buffer_size=tf.data.AUTOTUNE) on each of our datasets to automatically load and prepare a number of data batches.

Input

AUTOTUNE = tf.data.AUTOTUNE # let TensorFlow find the best values to use automatically

# Shuffle and optimize performance on training datasets
# Note: these methods can be chained together and will have the same effect as calling them individually
train_10_percent_ds = train_10_percent_ds.cache().shuffle(buffer_size=10*BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
train_ds = train_ds.cache().shuffle(buffer_size=100*BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)

# Don't need to shuffle test datasets (for easier evaluation)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

And just like that, our dataset is performance optimized!

So what’s next?

Congratulations on sticking with me this far, as I know this was a fairly large guide dedicated to just one aspect of our project process (and every ML project to be honest).

This task really is so important when it comes to machine learning and deep learning workflows. Not only in terms of making sure that your data is accurate and set up correctly, but also optimized. (Any time you can make it faster, you 100% should!)

You now have a solid understanding of data acquisition, exploration, preparation, and turning into a Tensorflow dataset. Great work 😉.

In the next part of this guide, we’ll create our first neural network with TensorFlow, so make sure to subscribe via the link below so you don’t miss it. (Coming soon!)

P.S.

Remember also what I said earlier.

If you want to deep dive into Machine Learning and learn how to use these tools even further, then check out my complete Machine Learning and Data Science course or watch the first few videos for free.

learn machine learning ai and data science

It’s one of the most popular, highly rated Machine Learning and Data Science bootcamps online, as well as the most modern and up-to-date. Guaranteed.

You'll go from a complete beginner with no prior experience to getting hired as a Machine Learning Engineer this year, so it’s helpful for ML Engineers of all experience levels.

If you already have a good grasp of Machine Learning, and just want to focus on Tensorflow for Deep Learning, I have a course on that also that you can check out here.

learn tensorflow

When you join as a Zero To Mastery Academy member, you’ll have access to both of these courses, as well as every other course in our training library!

Not only that, but you will also be able to ask me questions, as well as chat to other students and machine learning professionals via our private Discord community.

So go ahead and check those out, and don’t forget to subscribe below so you don’t miss Parts 2 and 3 of this series on Tensorflow and deep learning!

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

Top 10 Machine Learning Projects To Boost Your Resume preview
Top 10 Machine Learning Projects To Boost Your Resume

Looking for the best machine learning projects to make your resume shine? Here are my top 10 recommendations (with 3 'can't miss' projects!)

How One ZTM Student Landed A Senior Engineering Role at NVIDIA preview
How One ZTM Student Landed A Senior Engineering Role at NVIDIA

From Game Dev to ML/AI to Senior Engineer at Nvidia. Read Hiren's career journey here to see what it takes to get hired in the best roles at the best companies.