Building and Training TensorFlow Models to Create a Neural Network

Daniel Bourke
Daniel Bourke
hero image

Welcome to Part 2 in my brand new 3-Part series on Tensorflow and Deep Learning.

Sidenote: Technically this 'mini series' is part of my larger 'Introduction to Machine Learning' series, but I went so deep on this particular section, I needed to make it into 3 parts!

Be sure to check out the other parts in this TensorFlow series, as they all lead into each other:

So a quick recap of the series so far.

The goal of this series is to give you an overview of deep learning (and more specifically, transfer learning) when using Tensorflow and Keras.

Even better still?

Rather than just tell you what this all means, I’m also going to walk you through a project that you can follow along with, so you can learn as you go.

The project we’re going to build is called ‘Dog Vision’. It’s a neural network capable of identifying different dog breeds via images.

dog project outline

How this series works

In the first part of the series, we took the time to set up the project, get our data, explore it, create a training set, and then turn our data into a Tensorflow dataset.

I highly recommend you read Part 1 first if you haven’t already, so that you understand what’s happening.

In this new part of the series (this article you’re reading right now), we’ll take that dataset that we created in Part 1, and use it to build a neural network, train the model, and then fit the model on the data.

Then, in the third and final part of this series, we’ll evaluate our model, make predictions, and work through the deployment phases, which are crucial for understanding how to assess and utilize the trained models effectively.

So as you can see, we’re going to work all the way from dataset preparation to model building, training, and evaluation, so this is a complete project walkthrough from start to finish.

Not only will it help you to understand Tensorflow better, but you’ll have hands-on experience and a project for your portfolio by the end of it!

Why listen to me?

My name is Daniel Bourke, and I'm the resident Machine Learning instructor here at Zero To Mastery.

Originally self-taught, I worked for one of Australia's fastest-growing artificial intelligence agencies, Max Kelsen, and have worked on Machine Learning and data problems across a wide range of industries including healthcare, eCommerce, finance, retail, and more.

I'm also the author of Machine Learning Monthly, write my own blog on my experiments in ML, and run my own YouTube channel - which has hit over 8 Million views.

Sidenote: If you want to deep dive into Machine Learning and learn how to use these tools even further, then check out my complete Machine Learning and Data Science course or watch the first few videos for free.

learn machine learning ai and data science

It’s one of the most popular, highly rated Machine Learning and Data Science bootcamps online, as well as the most modern and up-to-date. Guaranteed.

You'll go from a complete beginner with no prior experience to getting hired as a Machine Learning Engineer this year, so it’s helpful for ML Engineers of all experience levels.

Want a sample of the course? Well, check out the video below:


If you already have a good grasp of Machine Learning, and just want to focus on Tensorflow for Deep Learning, I have a course on that also that you can check out here.

learn tensorflow

With that out of the way, let’s get into this guide.

What is a neural network?

Neural networks are one of the most flexible and customizable ‘deep learning’ machine learning models available, and you can create a neural network to fit almost any kind of data.

In fact, the "deep" in deep learning refers to the many layers that can be contained inside a neural network.

How neural networks work

A neural network will often follows the structure of:

Input layer -> Middle layers -> Output layer.

unstructured-data-anatomy-of-a-neural-network

The main premise is that data goes in one end, gets manipulated by many small functions in an attempt to learn patterns/weights which represent the data to produce useful outputs.

  • The input layer takes in the data
  • The middle layers perform calculations on the data and hopefully learn patterns (also called weights/biases) to represent the data
  • And the output layer performs a final transformation on the learned patterns to make them usable in human applications

What goes into the middle layers of a neural network?

That's an excellent question, and it’s hard to answer fully because there are so many different options.

However, for the interest of keeping this guide fairly simple, we’re going to focus on two of the most popular modern kinds of neural networks:

  • Convolutional Neural Networks (CNNs), and
  • Transformers (the Transformer is the "T" in GPT, Generative Pretrained Transformer, such as in ChatGPT)
neural network middle layers

Because our problem is in the computer space, we're going to use a CNN.

But instead of crafting our own CNN from scratch, we're going to take an existing CNN model and apply it to our own problem, by using a method called ‘transfer learning’.

So what is transfer learning?

Transfer learning is the process of getting an existing working model and adjusting it to your own problem. This means you can get better results in less time with less data, without having to build something from scratch.

An existing model may have the following features:

  • They are trained on lots of data (in the case of computer vision, existing models are often pre trained on ImageNet, a dataset of 1M+ images, this means they've already learned patterns across many different kinds of images)
  • They are crafted by expert researchers (large universities and companies such as Google and Meta often open-source their best models for others to try and use)
  • They are already trained on lots of computing hardware, without you needing to have the same resources. (The larger the model and the larger the dataset, the more compute power you need, not everyone has access to 10s, 100s or 1000s of GPUs)
  • They are proven to perform well on a given task through several studies (this means it has a good chance on performing well on your task if it's similar)

You may be thinking, ok this all sounds incredible, so where can I get pretrained models?

Well the good news is, there are plenty of places to find pretrained models!

where to find pretrained models

How do you choose which to use?

Well, for most new machine learning problems, if you're looking to get good results quickly, you should generally look for a pretrained model similar to your problem and use transfer learning to adapt it to your own domain.

Building our neural network

With that in mind, and since we're focused on TensorFlow/Keras, we're going to be using a pretrained model from here tf.keras.applications.

More specifically, we're going to take the tf.keras.applications.efficientnet_v2.EfficientNetV2B0() model from the 2021 machine learning paper EfficientNetV2: Smaller Models and Faster Training from Google Research and apply it to our own problem.

This model has been trained on ImageNet1k (1M+ images across 1000 different diverse classes, there is a version called ImageNet22k with 14M+ images across 22,000 categories) so it has a good baseline understanding of patterns in images across a wide domain.

ImageNet is also the same location where we got our images for the dataset in part 1 so it’s a win:win situation.

Creating our base model

Let’s see if we can adjust those patterns slightly to our dog images.

To do this, we’ll create an instance of it and call it base_model.

Input:

# Create the input shape to our model
INPUT_SHAPE = (*IMG_SIZE, 3)

base_model = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
    include_top=True, # do want to include the top layer? (ImageNet has 1000 classes, so the top layer is formulated for this, we want to create our own top layer)
    include_preprocessing=True, # do we want the network to preprocess our data into the right format for us? (yes)
    weights="imagenet", # do we want the network to come with pretrained weights? (yes)
    input_shape=INPUT_SHAPE # what is the input shape of our data we're going to pass to the network? (224, 224, 3) -> (height, width, colour_channels)
)

Output:

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/efficientnet_v2/efficientnetv2-b0.h5
29403144/29403144 [==============================] - 0s 0us/step

And that’s our base model created! Easy right?

We can find out information about our base model by calling base_model.summary().

Input:

# Note: Uncomment to see full output
# base_model.summary()

Here’s a truncated output of base_model.summary():

unstructured-data-effnetv2b0-model-summary

Woah! Look at all those layers... this is what the "deep" in deep learning means! A deep number of layers.

How about we count the number of layers?

Input:

# Count the number of layers
print(f"Number of layers in base_model: {len(base_model.layers)}")

Output:

Number of layers in base_model: 273

273 layers!

Wow, there's a lot going on.

Rather than step through each layer and explain what's happening in each layer, I'll leave that for the curious mind to research on their own.

Just know that when starting out deep learning you don't need to know what's happening in every layer in a model to be able to use a model.

For now, let's pay attention to a few things:

  • The input layer (the first layer) input shape, this will tell us the shape of the data the model expects as input
  • The output layer (the last layer) output shape, this will tell us the shape of the data the model will output.
  • The number of parameters of the model, these are "learnable" numbers (also called weights) that a model will use to derive patterns out of and represent the data. Generally, the more parameters a model has, the more learning capacity it has
  • The number of layers a model has. Generally, the more layers a model has, the more learning capacity it has (each layer will learn progressively deeper patterns from the data). However, this caps out at a certain range

So let’s break these down.

Understanding model input and output shapes

One of the most important practical steps in using a deep learning model is input and output shapes.

Two questions to ask:

  • What is the shape of my input data?
  • What is the ideal shape of my output data?

We ask about shapes because in all deep learning models input and output data comes in the form of tensors.

where tensorflow comes from

This goes for text, audio, images and more.

The raw data gets converted to a numerical representation first before being passed to a model.

For example

In our case, our input data has the shape of [(32, 224, 224, 3)] or [(batch_size, height, width, colour_channels)].

And our ideal output shape will be [(32, 120)] or [(batch_size, number_of_dog_classes).

Your input and output shapes will differ depending on the problem and data you're working with. But as you get deeper into the world of machine learning (and deep learning), you'll find input and output shapes are one of the most common errors.

We can check our model's input and output shapes with the .input_shape and .output_shape attributes.

Input:

# Check the input shape of our model
base_model.input_shape

Output:

(None, 224, 224, 3)

Nice! It looks like our model's input shape is where we want it.

Remember None in this case is equivalent to a wild card dimension, meaning it could be any value, but we've set ours to 32.

This is because the model we chose, tf.keras.applications.efficientnet_v2.EfficientNetV2B0, has been trained on images the same size as our images.

If our model had a different input shape, we'd have to make sure we processed our images to be the same shape.

So now let's check the output shape.

Input:

# Check the model's output shape
base_model.output_shape

Output:

(None, 1000)

Hmm, is this what we're after?

No, not really. You see, since we have 120 dog classes in our dataset, we'd ideally like an output shape of (None, 120).

So then why is it by default (None, 1000)?

Well, this is because the model has been trained already on ImageNet, a dataset of 1,000,000+ images with 1000 classes (hence the 1000 in the output shape).

Changing the output shape

So how can we change this?

Well, let’s recreate a base_model instance, except this time we'll change the classes parameter to 120, like so:

Input:

# Create a base model with 120 output classes
base_model = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
    include_top=True,
    include_preprocessing=True,
    weights="imagenet",
    input_shape=INPUT_SHAPE,
    classes=len(dog_names)
)

base_model.output_shape

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-62-5e9b29e6f858> in <cell line: 2>()
      1 # Create a base model with 120 output classes
----> 2 base_model = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
      3     include_top=True,
      4     include_preprocessing=True,
      5     weights="imagenet",

/usr/local/lib/python3.10/dist-packages/keras/src/applications/efficientnet_v2.py in EfficientNetV2B0(include_top, weights, input_tensor, input_shape, pooling, classes, classifier_activation, include_preprocessing)
   1128     include_preprocessing=True,
   1129 ):
-> 1130     return EfficientNetV2(
   1131         width_coefficient=1.0,
   1132         depth_coefficient=1.0,

/usr/local/lib/python3.10/dist-packages/keras/src/applications/efficientnet_v2.py in EfficientNetV2(width_coefficient, depth_coefficient, default_size, dropout_rate, drop_connect_rate, depth_divisor, min_depth, bn_momentum, activation, blocks_args, model_name, include_top, weights, input_tensor, input_shape, pooling, classes, classifier_activation, include_preprocessing)
    932 
    933     if weights == "imagenet" and include_top and classes != 1000:
--> 934         raise ValueError(
    935             "If using `weights` as `'imagenet'` with `include_top`"
    936             " as true, `classes` should be 1000"

ValueError: If using `weights` as `'imagenet'` with `include_top` as true, `classes` should be 1000Received: classes=120

Oh no, we get an error!

If we look closer at the error, we’ll see this section:

ValueError: If using weights as 'imagenet' with include_top as true, classes should be 1000 Received: classes=120

So what does this mean?

Well, what this is saying is that if we want to keep using the pretrained 'imagenet' weights (which we do so that we can leverage the visual patterns/features that the model has already learned on ImageNet), then we need to change the parameters to the base_model.

How to change our base model parameters

So, what we're going to do is create our own top layers, and we can do this by setting include_top=False.

What this means is that we'll use most of the model's existing layers to extract features and patterns out of our images, but then customize the final few layers to our own problem.

This kind of transfer learning is called feature extraction.

It’s a setup where you use an existing model's pretrained weights to extract features (or patterns) from your own custom data. You can then use those extracted features and further tailor them to your own use case.

For example

Let's go ahead and create an instance of base_model without a top layer.

Input:

# Create a base model with no top
base_model = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
    include_top=False, # don't include the top layer (we want to make our own top layer)
    include_preprocessing=True,
    weights="imagenet",
    input_shape=INPUT_SHAPE,
)

# Check the output shape
base_model.output_shape

Output:

(None, 7, 7, 1280)

Hmm, so what's going on here with this new output shape?

This still isn't what we want, because we're after (None, 120) for our number of dog classes.

So how about we check the number of layers again?

Input:

# Count the number of layers
print(f"Number of layers in base_model: {len(base_model.layers)}")

Output

Number of layers in base_model: 270

Looks like our new base_model has fewer layers than our previous one.

This is because we used include_top=False.

This means we've still got 270 base layers to extract features and patterns from our images, however, it also means we get to customize the output layers to our liking.

We'll come back to this shortly.

Understanding model parameters

In traditional programming, the developer explicitly defines the rules or algorithms that manipulate the input data to produce the desired output. However, in machine learning, the process is different and essentially reversed.

In machine learning:

  1. Inputs and Outputs: You provide the model with a set of inputs (e.g., images of dogs) and the corresponding ideal outputs (e.g., labels indicating that the images are of dogs)
  2. Training: The model then uses this data to learn the relationship between the inputs and the outputs. During training, the model makes predictions and adjusts its internal parameters based on the difference between its predictions and the actual outputs
  3. Learning: Through iterative adjustments (using techniques such as gradient descent), the model identifies patterns and develops its own rules for mapping inputs to outputs. These learned rules are not explicitly programmed but are derived from the data itself
  4. Inference: Once trained, the model can take new, unseen inputs and predict the outputs based on the rules it has learned

In essence, machine learning involves providing the model with examples of the desired output for given inputs, allowing it to then learn the underlying patterns and rules needed to replicate that relationship.

This process allows the model to generalize from the training data to new data it hasn't seen before.

A model's parameters are the learned rules, and learned is the key word here.

In an ideal setup, we never tell the model what parameters to learn. Instead, it learns them itself by connecting input data to labels in supervised learning and by grouping together similar samples in unsupervised learning.

Note: Parameters are values learned by a model whereas hyperparameters (e.g. batch size) are values set by a human.

Parameters also get referred to as "weights" or "patterns" or "learned features" or "learned representations".

Generally, the more parameters a model has, the more capacity it has to learn. While each layer in a deep learning model will have a specific number of parameters (these vary depending on which layer you use).

The benefit of using a preconstructed model and transfer learning is that someone else has done the hard work in finding what combination of layers leads to a good set of parameters (a big thank you to these wonderful people).

How to count the parameters in our pretrained model

We can count the number of parameters in a model/layer via the .count_params() method.

Input:

# Check the number of parameters in our model
base_model.count_params()

Output:

5919312

Wow! Our model has 5,919,312 parameters.

That means each time an image goes through our model, it will be influenced in some small way by 5,919,312 numbers.

Each one of these is a potential learning opportunity (except for parameters that are non-trainable but we'll get to that soon too).

Now, you may be thinking, 5 million+ parameters sounds like a lot, and it is.

However, many modern large scale models, such as GPT-3 (175B) and GPT-4 (200B+? the actual number of parameters was never released) deal in the billions of parameters (Note: this is written in 2024, so if you're reading this in future, parameter counts may be in the trillions).

Generally, more parameters leads to better models, however, there are always trade offs.

More parameters means more compute power to run the models.

In practice, if you have limited compute power (e.g. a single GPU on Google Colab which is what we used), then it's best to start with smaller models and gradually increase the size when necessary.

We can get the trainable and non-trainable parameters from our model with the trainable_weights and non_trainable_weights attributes. (Remember, parameters are also referred to as weights).

Note: Trainable weights are parameters of the model which are updated by backpropagation during training (they are changed to better match the data). Whereas non-trainable weights are parameters of the model which are not updated by backpropagation during training (they are fixed in place).

How to count the trainable and non trainable parameters in your model

Let's write a function to count the non-trainable and trainable parameters of our model.

Input:

import numpy as np

def count_parameters(model, print_output=True):
  """
  Counts the number of trainable, non-trainable and total parameters of a given model.
  """
  trainable_parameters = np.sum([np.prod(layer.shape) for layer in model.trainable_weights])
  non_trainable_parameters = np.sum([np.prod(layer.shape) for layer in model.non_trainable_weights])
  total_parameters = trainable_parameters + non_trainable_parameters
  if print_output:
    print(f"Model {model.name} parameter counts:")
    print(f"Total parameters: {total_parameters}")
    print(f"Trainable parameters: {trainable_parameters}")
    print(f"Non-trainable parameters: {non_trainable_parameters}")
  else:
    return total_parameters, trainable_parameters, non_trainable_parameters

count_parameters(model=base_model, print_output=True)

Output:

Model efficientnetv2-b0 parameter counts:
Total parameters: 5919312
Trainable parameters: 5858704
Non-trainable parameters: 60608

Nice! It looks like our function worked, and most of our model's parameters are trainable.

This means they will be tweaked as they see more images of dogs.

However, a standard practice in transfer learning is to freeze the base layers of a model and only train the custom top layers to suit your problem.

unstructured-data-our-dog-vision-model

Here you can see an example of how we can take a pretrained model and customize it to our own use case.

This kind of transfer learning workflow is often referred to as a feature extracting workflow as the base layers are frozen (not changed during training) and only the top layers are trained.

Note: In this image the EfficientNetB0 architecture is being demonstrated, however we're going to be using the EfficientNetV2B0 architecture which is slightly different. I've simply used the older architecture image from the research paper as a newer one wasn't available.

In other words, keep the patterns an existing model has learned on a similar problem (if they're good) to form a base representation of an input sample and then manipulate that base representation to suit our needs.

So why do this?

Simply because it's faster. The less trainable parameters, the faster your model training will be, and the faster your experiments will be.

But how will we know this works?

Well, we're going to run experiments to test it…

How to freeze the parameters of our base model

Okay, so how do we freeze the parameters of our base_model?

We can set its .trainable attribute to False.

Input:

# Freeze the base model
base_model.trainable = False
base_model.trainable

Output:

False

This means that our base_model is now frozen, so let's check how this affected the number of trainable and non-trainable parameters.

Input:

count_parameters(model=base_model, print_output=True)

Output:

Model efficientnetv2-b0 parameter counts:
Total parameters: 5919312.0
Trainable parameters: 0.0
Non-trainable parameters: 5919312

Beautiful!

All of the parameters in our base_model are now non-trainable (frozen), which means they won't be updated during training.

Sidenote: If you're struggling to follow along, or feeling a little overwhelmed, then make sure to check out my complete Machine Learning and Data Science course. I cover this exact project inside that course.

How to pass data through our model

We've spoken a couple of times about how our base_model is a "feature extractor" or "pattern extractor", but what does this mean?

It means that when a data sample goes through the base_model, its numbers get manipulated into a compressed set of features.

In other words, the layers of the model will each perform a calculation on the sample eventually leading to an output tensor with patterns the model has deemed most important.

This is often referred to as a ‘compressed feature space’, and it's one of the central ideas of deep learning.

For example

If we take a large input such as an image tensor of shape [224, 224, 3]) and compress it into a smaller output such as a feature vector of shape [1280]) that captures a useful representation of the input.

unstructured-data-feature-vector-extraction

Note: A feature vector is also referred to as an embedding. This is basically a compressed representation of a data sample that makes it useful.

The concept of embeddings is not limited to images either, the concept of embeddings stretches across all data types (text, images, video, audio + more).

We can see this in action by passing a single image through our base_model.

Input:

# Extract features from a single image using our base model
feature_extraction = base_model(image_batch[0])
feature_extraction

Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-957d897dc1dc> in <cell line: 2>()
      1 # Extract features from a single image using our base model
----> 2 feature_extraction = base_model(image_batch[0])
      3 feature_extraction

/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
---> 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb

/usr/local/lib/python3.10/dist-packages/keras/src/engine/input_spec.py in assert_input_compatibility(input_spec, inputs, layer_name)
    296                 if spec_dim is not None and dim is not None:
    297                     if spec_dim != dim:
--> 298                         raise ValueError(
    299                             f'Input {input_index} of layer "{layer_name}" is '
    300                             "incompatible with the layer: "

ValueError: Input 0 of layer "efficientnetv2-b0" is incompatible with the layer: expected shape=(None, 224, 224, 3), found shape=(224, 224, 3)

Oh no, another error!

If we look closer, we can see this section:

Output:

ValueError: Input 0 of layer "efficientnetv2-b0" is incompatible with the layer: expected shape=(None, 224, 224, 3), found shape=(224, 224, 3)

We've stumbled upon one of the most common errors in machine learning, which is shape errors.

In our case, the shape of the data we're trying to put into the model doesn't match the input shape the model is expecting.

Our input data shape is (224, 224, 3) ((height, width, colour_channels)), however, our model is expecting (None, 224, 224, 3) ((batch_size, height, width, colour_channels)).

Fixing our shape error

We can fix this error by adding a singular batch_size dimension to our input and thus make it (1, 224, 224, 3) (a batch_size of 1 for a single sample).

To do so, we can use the tf.expand_dims(input=target_sample, axis=0) where target_sample is our input tensor and axis=0 means we want to expand the first dimension.

Input:

# Current image shape
shape_of_image_without_batch = image_batch[0].shape

# Add a batch dimension to our single image
shape_of_image_with_batch = tf.expand_dims(input=image_batch[0], axis=0).shape

print(f"Shape of image without batch: {shape_of_image_without_batch}")
print(f"Shape of image with batch: {shape_of_image_with_batch}")

Output:

Shape of image without batch: (224, 224, 3)
Shape of image with batch: (1, 224, 224, 3)

Perfect!

Now let's pass this image with a batch dimension to our base_model.

Input:

# Extract features from a single image using our base model
feature_extraction = base_model(tf.expand_dims(image_batch[0], axis=0))
feature_extraction

Output:

<tf.Tensor: shape=(1, 7, 7, 1280), dtype=float32, numpy=
array([[[[-2.19177201e-01, -3.44185606e-02, -1.40321642e-01, ...,
          -1.44454449e-01, -2.73809850e-01, -7.41252452e-02],
         [-8.69670734e-02, -6.48750067e-02, -2.14546964e-01, ...,
          -4.57209721e-02, -2.77900100e-01, -8.20885971e-02],
         [-2.76872963e-01, -8.26781020e-02, -3.85153107e-02, ...,
          -2.72128999e-01, -2.52802134e-01, -2.28105962e-01],
         ...,
         [-1.01604000e-01, -3.55145968e-02, -2.23027021e-01, ...,
          -2.26227805e-01, -8.61771777e-02, -1.60450727e-01],
         [-5.87608740e-02, -4.65543661e-03, -1.06193267e-01, ...,
          -2.87548676e-02, -9.06914026e-02, -1.82624385e-01],
         [-6.27618432e-02, -1.38620799e-03,  1.52704502e-02, ...,
          -7.85450079e-03, -1.84584558e-01, -2.62404829e-01]],

        [[-2.17334151e-01, -1.10280879e-01, -2.74605274e-01, ...,
          -2.22405165e-01, -2.74738282e-01, -1.01998925e-01],
         [-1.40700653e-01, -1.66820198e-01, -2.77449101e-01, ...,
           2.40375683e-01, -2.77627349e-01, -9.07808691e-02],
         [-2.40916476e-01, -2.00582087e-01, -2.38370374e-01, ...,
          -8.27576742e-02, -2.78428614e-01, -1.23056054e-01],
         ...,
         [-2.67296195e-01, -5.43131726e-03, -6.44061863e-02, ...,
          -3.34720500e-02, -1.55141622e-01, -3.23073938e-02],
         [-2.66513556e-01, -2.09966358e-02, -1.50375053e-01, ...,
          -6.29274473e-02, -2.69798309e-01, -2.74081439e-01],
         [-8.39830115e-02, -1.58605091e-02, -2.78447241e-01, ...,
          -1.43555822e-02, -2.77474761e-01,  1.37483165e-01]],

        [[-2.15840712e-01,  4.50323820e-01, -7.51058161e-02, ...,
          -2.43637279e-01, -2.75048614e-01, -6.00421876e-02],
         [-2.39066556e-01, -2.25066260e-01, -4.89832312e-02, ...,
          -2.77957618e-01, -1.14677951e-01, -2.69968715e-02],
         [-1.60943881e-01, -2.12972730e-01, -1.08622171e-01, ...,
          -2.78464079e-01, -1.95970193e-01, -2.92074662e-02],
         ...,
         [-2.67642140e-01, -7.13412274e-10, -2.47387841e-01, ...,
          -1.27752789e-03,  1.69062471e+00, -1.07747754e-02],
         [-2.69456387e-01, -3.02123808e-05, -2.19904676e-01, ...,
          -1.19841937e-02,  6.54936790e-01,  4.92877871e-01],
         [-1.83339473e-02, -9.84105989e-02, -2.77752399e-01, ...,
          -9.53171253e-02, -2.76987553e-01, -1.81873620e-01]],

        ...,

        [[-6.59235120e-02, -1.64803467e-03, -1.58951283e-01, ...,
          -1.34164095e-01, -6.30896613e-02, -7.77927637e-02],
         [-1.83377475e-01, -4.98497509e-04, -1.57654762e-01, ...,
          -4.48885784e-02, -1.06884383e-01, -2.78372377e-01],
         [-2.45749369e-01, -9.95399058e-03, -1.79216102e-01, ...,
          -1.02837617e-02, -1.84168354e-01, -1.70697242e-01],
         ...,
         [ 2.22050592e-01, -2.04384560e-04, -1.46467671e-01, ...,
          -2.65387502e-02, -1.85434178e-01, -9.71652716e-02],
         [ 1.52228832e+00, -3.39617883e-03, -3.22414264e-02, ...,
          -1.19287046e-02, -1.46435276e-01, -8.73169452e-02],
         [-1.89164400e-01, -5.49114570e-02, -2.05218419e-01, ...,
          -1.32163316e-01, -1.48950770e-01, -1.18042991e-01]],

        [[-2.16520607e-01, -7.84920622e-03, -1.43650264e-01, ...,
          -1.73660204e-01, -4.83706780e-02, -3.76228467e-02],
         [-2.78293848e-01, -6.24539470e-03, -2.28590608e-01, ...,
          -2.06465453e-01, -1.93291768e-01, -9.23046917e-02],
         [-2.40500003e-01, -2.73558766e-01, -1.58736348e-01, ...,
          -4.13209312e-02, -2.64240265e-01, -3.26484852e-02],
         ...,
         [-2.31358394e-01, -2.72292078e-01, -6.80670887e-02, ...,
          -2.16453914e-02, -2.71368980e-01, -3.88960652e-02],
         [-2.45319903e-01, -2.78179497e-01, -6.18890636e-02, ...,
          -1.86282583e-02, -2.23804727e-01, -2.72233319e-02],
         [-2.31111392e-01, -2.37449735e-01, -5.13911694e-02, ...,
          -4.55225781e-02, -2.74753064e-01, -3.51530202e-02]],

        [[-3.96142267e-02, -1.39998682e-02, -9.56050456e-02, ...,
          -2.33392462e-01, -1.83407709e-01, -4.99856956e-02],
         [-2.60713607e-01, -3.96164991e-02, -1.29626304e-01, ...,
          -2.78417081e-01, -2.78285533e-01, -7.70441368e-02],
         [-8.02241415e-02, -2.30456606e-01, -1.13508031e-01, ...,
          -5.45607917e-02, -2.71063268e-01, -2.75666509e-02],
         ...,
         [-9.41052362e-02, -2.42691532e-01, -5.48249595e-02, ...,
          -2.13044193e-02, -2.63691694e-01, -9.28506851e-02],
         [-9.08804908e-02, -2.40457997e-01, -7.88932368e-02, ...,
          -3.80579121e-02, -2.71065891e-01, -4.05692160e-02],
         [-1.26358300e-01, -2.17053503e-01, -7.44825602e-02, ...,
          -5.66985942e-02, -2.75216103e-01, -6.91162944e-02]]]],
      dtype=float32)>

Woah! Look at all those numbers!

After passing through ~270 layers, this is the numerical representation our model has created of our input image.

You might be thinking, okay, there's a lot going on here, how can I possibly understand all of them?

Well, with enough effort, you might. However, these numbers are more for a model/computer to understand than for a human to understand, so don’t worry about them.

How to check the shape of your feature_extraction

Let's not stop there, let's check the shape of our feature_extraction.

Input:

# Check shape of feature extraction
feature_extraction.shape

Output:

TensorShape([1, 7, 7, 1280])

Ok, it looks like our model has compressed our input image into a lower dimensional feature space.

Note: Feature space (or latent space or embedding space) is a numerical region where pieces of data are represented by tensors of various dimensions. Feature space is hard for humans to imagine because it could be 1000s of dimensions (humans are only good at imagining 3-4 dimensions at max).

But you can think of feature space as an area where numerical representations of similar items will be close together. If the feature space was a grocery store, one breed of dogs may be in one aisle (similar numbers) whereas another breed of dogs may be in the next aisle. You can see an example of a large embedding space representation of 8M Stack Overflow questions on Nomic Atlas.

Let's compare the new shape to the input shape.

Input:

num_input_features = 224*224*3
feature_extraction_features = 1*7*7*1280

# Calculate the compression ratio
num_input_features / feature_extraction_features

Output:

2.4

Looks like our model has compressed the numerical representation of our input image by 2.4x so far.

But you might've noticed our feature_extraction is still a tensor, so how about we turn it into a vector and compress the representation even further?

How to turn our tensor into a vector

We can do this by taking our feature_extraction tensor and pooling together the inner dimensions.

By pooling, I mean taking the average or the maximum values.

Why?

Well, because a neural network often outputs a large amount of learned feature values but many of them can be insignificant compared to others.

So taking the average or the max across them helps us to compress the representation further while still preserving the most important features.

This process is often referred to as:

  • Average pooling - Take the average across given dimensions of a tensor, can perform with tf.keras.layers.GlobalAveragePooling2D()
  • Max pooling - Take the maximum value across given dimensions of a tensor, can perform with tf.keras.layers.MaxPooling2D()

Let's try to apply average pooling to our feature extraction and see what happens.

Input:

# Turn feature extraction into a feature vector
feature_vector = tf.keras.layers.GlobalAveragePooling2D()(feature_extraction) # pass feature_extraction to the pooling layer
feature_vector

Output:

<tf.Tensor: shape=(1, 1280), dtype=float32, numpy=
array([[-0.11521906, -0.04476562, -0.12476546, ..., -0.09118073,
        -0.08420841, -0.07769417]], dtype=float32)>

As you can see, we've compressed our feature_extraction tensor into a feature vector. (Notice the new shape of (1, 1280)).

Now if you're not sure what all these numbers mean, that's okay. I don't either. All you need to know is that a feature vector (also called an embedding) is supposed to be a numerical representation that's meaningful to computers.

We'll need to perform a few more transforms on it before it's recognizable to us, so let's check out its shape.

How to check our vector shape

Input:

# Check out the feature vector shape
feature_vector.shape

Output:

TensorShape([1, 1280])

We've reduced the shape of feature_extraction from (1, 7, 7, 1280) to (1, 1280).

What this means is we've gone from a tensor with multiple dimensions to a vector with one dimension of size 1280. Our neural network has performed calculations on our image and it is now represented by 1280 numbers.

This is one of the main goals of deep learning, to reduce higher dimensional information into a lower dimensional but still representative space.

Let's calculate how much we've reduced the dimensionality of our single input image.

Input:

# Compare the reduction
num_input_features = 224*224*3
feature_extraction_features = 1*7*7*1280
feature_vector_features = 1*1280

print(f"Input -> feature extraction reduction factor: {num_input_features / feature_extraction_features}")
print(f"Feature extraction -> feature vector reduction factor: {feature_extraction_features / feature_vector_features}")
print(f"Input -> feature extraction -> feature vector reduction factor: {num_input_features / feature_vector_features}")

Output:

Input -> feature extraction reduction factor: 2.4
Feature extraction -> feature vector reduction factor: 49.0
Input -> feature extraction -> feature vector reduction factor: 117.6

That’s a 117.6x reduction from our original image to its feature vector representation!

But why compress the representation like this?

Because representing our data in a compressed format but still with meaningful numbers (to a computer) means that less computation is required to reuse the patterns.

For example

Imagine you had to relearn how to spell words every time you wanted to use them.

Would this be efficient?

Not at all. Instead, you take a while to learn them at the start and then continually reuse this knowledge over time. This is the same with a deep learning model.

It learns representative patterns in data, figures out the ideal connections between inputs and outputs and then reuses them over time in the form of numerical weights.

How to go from image to feature vector (practice time!)

We've covered a fair bit in the past few sections, so let's practice.

The important takeaway here is that one of the main goals of deep learning is to create a model that is able to take some kind of high dimensional data (e.g. an image tensor, a text tensor, an audio tensor) and extract meaningful patterns in it whilst compressing it to a lower dimensional form (e.g. a feature vector or embedding).

We can then use this lower dimensional form for our specific use cases, and one of the most powerful ways to do this is with transfer learning.

Taking an existing model from a similar domain to yours and applying it to your own problem.

To practice turning a data sample into a feature vector, let's start by recreating a base_model instance.

This time, we can also add in a pooling layer automatically using pooling="avg" or pooling="max".

Note: I demonstrated the use of the tf.keras.layers.GlobalAveragePooling2D() layer because not all pretrained models have the functionality of a pooling layer being built-in.

Input:

# Create a base model with no top and a pooling layer built-in
base_model = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
    include_top=False,
    weights="imagenet",
    input_shape=INPUT_SHAPE,
    pooling="avg", # can also use "max"
    include_preprocessing=True,
)

# Check the summary (optional)
# base_model.summary()

# Check the output shape
base_model.output_shape

Output:

(None, 1280)

Boom! We get the same output shape from the base_model as we did when using it with a pooling layer thanks to using pooling="avg".

Let's now freeze these base weights, so they're not trainable.

Input:

# Freeze the base weights
base_model.trainable = False

# Count the parameters
count_parameters(model=base_model, print_output=True)

Output:

Model efficientnetv2-b0 parameter counts:
Total parameters: 5919312.0
Trainable parameters: 0.0
Non-trainable parameters: 5919312

And now we can pass an image through our base model and get a feature vector from it.

Input:

# Get a feature vector of a single image (don't forget to add a batch dimension)
feature_vector_2 = base_model(tf.expand_dims(image_batch[0], axis=0))
feature_vector_2

Output:

<tf.Tensor: shape=(1, 1280), dtype=float32, numpy=
array([[-0.11521906, -0.04476562, -0.12476546, ..., -0.09118073,
        -0.08420841, -0.07769417]], dtype=float32)>

Wonderful!

Now is this the same as our original feature_vector?

Well, we can find out by comparing feature_vector and feature_vector_2 and seeing if all of the values are the same with np.all().

Input:

# Compare the two feature vectors
np.all(feature_vector == feature_vector_2)

Output:

True

Perfect, it worked!

So now let's put it all together and create a full model for our dog vision problem.

Creating a custom model for our dog vision problem

The main steps when creating any kind of deep learning model from scratch are:

  1. Define the input layers
  2. Define the middle layers
  3. Define the output layers

These sound broad because they are. Deep learning models are almost infinitely customizable.

Good news is, thanks to transfer learning, all of our middle layers are defined by base_model (you could argue the input layer is created too).

So now it's up to us to define our input and output layers.

TensorFlow/Keras have two main ways of connecting layers to form a model.

  1. The Sequential model (tf.keras.Sequential) - Useful for making simple models with one tensor in and one tensor out, not suited for complex models
  2. The Functional API - Useful for making more complex and multi-step models but can also be used for simple models

Let's start with the Sequential model.

It takes a list of layers and will pass data through them sequentially.

Our base_model will be the input and middle layers and we'll use a tf.keras.layers.Dense() layer as the output (we'll discuss this shortly).

Creating a model with the Sequential API

The Sequential API is the most straightforward way to create a model.

And because your model comes in the form of a list of layers from input to middle layers to output, each layer is executed sequentially.

Input:

# Create a sequential model
tf.random.set_seed(42)
sequential_model = tf.keras.Sequential([base_model, # input and middle layers
                                        tf.keras.layers.Dense(units=len(dog_names), # output layer
                                                              activation="softmax")])
sequential_model.summary()

Output:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 efficientnetv2-b0 (Functio  (None, 1280)              5919312   
 nal)                                                            

 dense (Dense)               (None, 120)               153720    

=================================================================
Total params: 6073032 (23.17 MB)
Trainable params: 153720 (600.47 KB)
Non-trainable params: 5919312 (22.58 MB)
_________________________________________________________________

Wonderful!

We've now got a model with 6,073,032 parameters, however, only 153,720 of them (the ones in the dense layer) are trainable.

Our dense layer (also called a fully-connected layer or feed-forward layer) takes the outputs of the base_model and performs further calculations on them to map them to our required number of classes (120 for the number of dog breeds).

We can now use activation="softmax" (the Softmax function) to get prediction probabilities. These are values between 0 and 1 which represent how much our model "thinks" a specific image relates to a certain class.

Sidenote: There's another common activation function called Sigmoid, that we could use if we only had two classes, for example, "dog" or "cat".

Confusing, yes, but you'll get used to different functions with practice.

The following table summarizes a few use cases.

sigmoid and softmax

Now that our model is built, let's check our input and output shapes.

Input:

# Check the input shape
sequential_model.input_shape

Output:

(None, 224, 224, 3)

Input:

# Check the output shape
sequential_model.output_shape

Output:

(None, 120)

Our sequential model takes in an image tensor of size [None, 224, 224, 3] and outputs a vector of shape [None, 120] where None is the batch size we specify.

Now let's try our sequential model out with a single image input.

Input:

# Get a single image with a batch size of 1
single_image_input = tf.expand_dims(image_batch[0], axis=0)

# Pass the image through our model
single_image_output_sequential = sequential_model(single_image_input)

# Check the output
single_image_output_sequential

Output:

<tf.Tensor: shape=(1, 120), dtype=float32, numpy=
array([[0.00783153, 0.01119391, 0.00476165, 0.0072348 , 0.00766934,
        0.00753752, 0.00522398, 0.02337082, 0.00579716, 0.00539333,
        0.00549823, 0.01011768, 0.00610076, 0.0109506 , 0.00540159,
        0.0079683 , 0.01227358, 0.01056393, 0.00507148, 0.00996652,
        0.00604106, 0.00729022, 0.0155036 , 0.00745004, 0.00628229,
        0.00796217, 0.00905823, 0.00712278, 0.01243507, 0.006427  ,
        0.00602891, 0.01276839, 0.00652441, 0.00842482, 0.01247454,
        0.00749902, 0.01086363, 0.007803  , 0.0058652 , 0.00474356,
        0.00902809, 0.00715358, 0.00981051, 0.00444271, 0.01031628,
        0.00691859, 0.00699083, 0.0065892 , 0.00966169, 0.01177148,
        0.00908043, 0.00729699, 0.00496712, 0.00509035, 0.00584058,
        0.01068885, 0.00817651, 0.00602052, 0.00901201, 0.01008151,
        0.00495409, 0.01285929, 0.00480146, 0.0108622 , 0.01421483,
        0.00814719, 0.00910061, 0.00798947, 0.00789293, 0.00636969,
        0.00656019, 0.01309155, 0.00754355, 0.00702062, 0.00485884,
        0.00958675, 0.01086809, 0.00682202, 0.00923016, 0.00856321,
        0.00482627, 0.01234931, 0.01140433, 0.00771413, 0.01140642,
        0.00382939, 0.00891482, 0.00409833, 0.00771865, 0.00652135,
        0.00668143, 0.00935989, 0.00784146, 0.00751913, 0.00785116,
        0.00794632, 0.0079146 , 0.00798953, 0.01011222, 0.01318719,
        0.00721227, 0.00736159, 0.01369175, 0.01087009, 0.00510072,
        0.00843218, 0.00451756, 0.00966478, 0.01013771, 0.00715721,
        0.00367131, 0.00825834, 0.00832634, 0.01225684, 0.00724481,
        0.00670675, 0.00536995, 0.01070637, 0.00937007, 0.00998812]],
      dtype=float32)>

Nice!

Our model has output a tensor of prediction probabilities in shape [1, 120], with one value for each of our dog classes.

And thanks to the softmax function, all of these values are between 0 and 1 and they should all add up to 1 (or close to it).

Input:

# Sum the output
np.sum(single_image_output_sequential)

Output:

1.0

Beautiful!

So how do we figure out which of the values our model thinks is most likely?

Well, we take the index of the highest value!

How to find the index of the highest value

We can find the index of the highest value using tf.argmax() or by using np.argmax().

We'll get the highest value (not the index) alongside it.

Note that these values may change every time due to the model/data being randomly initialized. Don't worry too much about them being different, in machine learning randomness is a good thing.

So let's try.

Input:

# Find the index with the highest value
highest_value_index_sequential_model_output = np.argmax(single_image_output_sequential)
highest_value_sequential_model_output = np.max(single_image_output_sequential)

print(f"Highest value index: {highest_value_index_sequential_model_output} ({dog_names[highest_value_index_sequential_model_output]})")
print(f"Prediction probability: {highest_value_sequential_model_output}")

Output:

Highest value index: 7 (basenji)
Prediction probability: 0.023370817303657532

Hmm. This prediction probability value is quite low.

In this example, the model predicts "basenji" with a very low confidence of about 2.34%. With the highest potential value being 1.0, this indicates that the model is not very confident in its prediction.

Checking the original label

Next, let's verify the actual label for our single image to see if the model's prediction was accurate.

Input:

# Check the original label value
print(f"Predicted value: {highest_value_index_sequential_model_output}")
print(f"Actual value: {tf.argmax(label_batch[0]).numpy()}")

Output:

Predicted value: 7
Actual value: 95

Unfortunately, the model predicted the wrong label.

This discrepancy is expected because, although our model has pretrained parameters from ImageNet, the dense layer added at the end is initialized with random parameters. Therefore, initially, our model is essentially guessing the labels.

Verifying predicted vs. actual labels

To complete the analysis, let's compare the text-based labels from the model's prediction and the original ground truth.

Input:

# Index on class_names with our model's highest prediction probability
sequential_model_predicted_label = class_names[tf.argmax(sequential_model(tf.expand_dims(image_batch[0], axis=0)), axis=1).numpy()[0]]

# Get the truth label
single_image_ground_truth_label = class_names[tf.argmax(label_batch[0])]

# Print predicted and ground truth labels
print(f"Sequential model predicted label: {sequential_model_predicted_label}")
print(f"Ground truth label: {single_image_ground_truth_label}")

Output:

Sequential model predicted label: basenji
Ground truth label: schipperke

Here, the model predicted "basenji," whereas the actual label was "schipperke."

This result confirms that our model's initial predictions are not reliable due to the random initialization of the dense layer's parameters.

So what can we do?

Well, we can try another method for creating a model!

How to create a model with the Functional API

As mentioned before, the Keras Functional API is another method for creating more complex models.

It can include multiple different modeling steps, but it can also be used for simple models, and it's the way we'll construct our Dog Vision models going forward.

Let's recreate our sequential_model using the Functional API.

We'll follow the same process as mentioned before:

  1. Define the input layers
  2. Define the middle/hidden layers.
  3. Define the output layers.
  4. Bonus: Connect the inputs and outputs within an instance of tf.keras.Model().

Input:

# 1. Create input layer
inputs = tf.keras.Input(shape=INPUT_SHAPE)

# 2. Create hidden layer
x = base_model(inputs, training=False)

# 3. Create the output layer
outputs = tf.keras.layers.Dense(units=len(class_names), # one output per class
                                activation="softmax",
                                name="output_layer")(x)

# 4. Connect the inputs and outputs together
functional_model = tf.keras.Model(inputs=inputs,
                                  outputs=outputs,
                                  name="functional_model")

# Get a model summary
functional_model.summary()

Output:

Model: "functional_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_4 (InputLayer)        [(None, 224, 224, 3)]     0         

 efficientnetv2-b0 (Functio  (None, 1280)              5919312   
 nal)                                                            

 output_layer (Dense)        (None, 120)               153720    

=================================================================
Total params: 6073032 (23.17 MB)
Trainable params: 153720 (600.47 KB)
Non-trainable params: 5919312 (22.58 MB)
_________________________________________________________________

Our functional model is now created so let's try it out.

It works in the same fashion as our sequential_model. (But hopefully a little better!).

Input:

# Pass a single image through our functional_model
single_image_output_functional = functional_model(single_image_input)

# Find the index with the highest value
highest_value_index_functional_model_output = np.argmax(single_image_output_functional)
highest_value_functional_model_output = np.max(single_image_output_functional)

highest_value_index_functional_model_output, highest_value_functional_model_output

Output:

(69, 0.017855722)

Looks like we got a slightly different value to our sequential_model (or they may be the same if randomness wasn't so random).

Why is this?

Because our functional_model was initialized with a random tf.keras.layers.Dense layer, so the outputs of our functional_model are essentially random as well (neural networks start with random numbers and adjust them to better represent patterns in data).

Not to fear, we'll fix this soon when we train our model.

Right now we've created our model with a few scattered lines of code, so how about we functionize the model creation so we can repeat it later on?

Functionizing model creation

We've created two different kinds of models so far. Each of which uses the same layers, except one was with the Keras Sequential API and the other was with the Keras Functional API.

However, it would be quite tedious to rewrite that modeling code every time we wanted to create a new model right?

So let's create a function called create_model() to replicate the model creation step with the Functional API.

Note: We're focused on the Functional API in this example, since it takes a bit more practice than the Sequential API.

Input:

def create_model(include_top: bool = False,
                 num_classes: int = 1000,
                 input_shape: tuple[int, int, int] = (224, 224, 3),
                 include_preprocessing: bool = True,
                 trainable: bool = False,
                 dropout: float = 0.2,
                 model_name: str = "model") -> tf.keras.Model:
  """
  Create an EfficientNetV2 B0 feature extractor model with a custom classifier layer.

  Args:
      include_top (bool, optional): Whether to include the top (classifier) layers of the model.
      num_classes (int, optional): Number of output classes for the classifier layer.
      input_shape (tuple[int, int, int], optional): Input shape for the model's images (height, width, channels).
      include_preprocessing (bool, optional): Whether to include preprocessing layers for image normalization.
      trainable (bool, optional): Whether to make the base model trainable.
      dropout (float, optional): Dropout rate for the global average pooling layer.
      model_name (str, optional): Name for the created model.

  Returns:
      tf.keras.Model: A TensorFlow Keras model with the specified configuration.
  """
  # Create base model
  base_model = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
    include_top=include_top,
    weights="imagenet",
    input_shape=input_shape,
    include_preprocessing=include_preprocessing,
    pooling="avg" # Can use this instead of adding tf.keras.layers.GlobalPooling2D() to the model
    # pooling="max" # Can use this instead of adding tf.keras.layers.MaxPooling2D() to the model
  )

  # Freeze the base model (if necessary)
  base_model.trainable = trainable

  # Create input layer
  inputs = tf.keras.Input(shape=input_shape, name="input_layer")

  # Create model backbone (middle/hidden layers)
  x = base_model(inputs, training=trainable)
  # x = tf.keras.layers.GlobalAveragePooling2D()(x) # note: you should include pooling here if not using `pooling="avg"`
  # x = tf.keras.layers.Dropout(0.2)(x) # optional regularization layer (search "dropout" for more)

  # Create output layer (also known as "classifier" layer)
  outputs = tf.keras.layers.Dense(units=num_classes,
                                  activation="softmax",
                                  name="output_layer")(x)

  # Connect input and output layer
  model = tf.keras.Model(inputs=inputs,
                         outputs=outputs,
                         name=model_name)

  return model

What a beautiful function!

Let's try it out.

Input:

# Create a model
model_0 = create_model(num_classes=len(class_names))
model_0.summary()

Output:

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_layer (InputLayer)    [(None, 224, 224, 3)]     0         

 efficientnetv2-b0 (Functio  (None, 1280)              5919312   
 nal)                                                            

 output_layer (Dense)        (None, 120)               153720    

=================================================================
Total params: 6073032 (23.17 MB)
Trainable params: 153720 (600.47 KB)
Non-trainable params: 5919312 (22.58 MB)
_________________________________________________________________

Woohoo! Looks like it worked!

Now how about we inspect each of the layers and whether they're trainable?

Input:

for layer in model_0.layers:
  print(layer.name, layer.trainable)

Output:

input_layer True
efficientnetv2-b0 False
output_layer True

Nice, looks like our base_model (efficientnetv2-b0) is frozen and not trainable, while our output_layer is trainable.

This means we'll be reusing the patterns learned in the base_model to feed into our output_layer and then customizing those parameters to suit our own problem.

How to train a model on 10% of the training data (Model 0)

We've seen our model make a couple of predictions on our data, and so far it hasn't done so well. This is expected though, as our model is essentially predicting random class values given an image.

So let's change that by training the final layer on our model to be customized to recognizing images of dogs for our project.

We can do so via five steps:

  1. Creating the model - We've done this ✅
  2. Compiling the model - Here's where we'll tell the model how to improve itself and how to measure its performance
  3. Fitting the model - Here's where we'll show the model examples of what we'd like it to learn (e.g. batches of samples containing pairs of dog images and their breed)
  4. Evaluating the model - Once our model is trained on the training data, we can evaluate it on the testing data (data the model has never seen)
  5. Making a custom prediction - Finally, the best way to test a machine learning model is by seeing how it goes on custom data. This is where we'll try to make a prediction on our own custom images of dogs

We'll work through each of these over the next few sections.

Recreating our model (for clarity)

We’ve done this already, but for the interests of having this task all in one section, let's create our model using the create_model() function that we made earlier.

Input:

# 1. Create model
model_0 = create_model(num_classes=len(class_names),
                       model_name="model_0")

model_0.summary()

Output:

Model: "model_0"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_layer (InputLayer)    [(None, 224, 224, 3)]     0         

 efficientnetv2-b0 (Functio  (None, 1280)              5919312   
 nal)                                                            

 output_layer (Dense)        (None, 120)               153720    

=================================================================
Total params: 6073032 (23.17 MB)
Trainable params: 153720 (600.47 KB)
Non-trainable params: 5919312 (22.58 MB)
_________________________________________________________________

Sidenote: Remember, if you're struggling to follow along, or feeling a little overwhelmed, then make sure to check out my complete Machine Learning and Data Science course. as I cover this exact project inside that course.

Compiling our model

After we've created our model, the next step is to compile it.

We can compile our model_0 using the tf.keras.Model.compile() method.

There are many options we can pass to the compile() method, however, the main ones we'll be focused on are:

  1. The optimizer - this tells the model how to improve based on the loss value
  2. The loss function - this measures how wrong the model is (e.g. how far off are its predictions from the truth, an ideal loss value is 0, meaning the model is perfectly predicting the data)
  3. The metric(s) - this is a human-readable value that shows how your model is performing, for example, accuracy is often used as an evaluation metric

These three settings work together to help improve a model.

Which optimizer should I use?

An optimizer tells a model how to improve its internal parameters (weights) to hopefully improve a loss value.

In most cases, improving the loss means to minimize it (a loss value is a measure of how wrong your model's predictions are, a perfect model will have a loss value of 0).

It does this through a process called gradient descent.

The gradients needed for gradient descent are calculated through backpropagation, a method that computes the gradient of the loss function with respect to each weight in the model.

Once the gradients have been calculated, the optimizer then tries to update the model weights so that they move in the opposite direction of the gradient (if you go down the gradient of a function, you reduce its value).

If you've never heard of the above processes, that's okay. TensorFlow implements many of them behind the scenes.

For now, the main takeaway is that neural networks learn in the following fashion:

  1. Start with random patterns/weights
  2. Look at data (forward pass)
  3. Try to predict data (with current weights)
  4. Measure performance of predictions (loss function, backpropagation calculates gradients of loss with respect to weights)
  5. Update patterns/weights (optimizer, gradient descent adjusts weights in the opposite direction of the gradients to minimize loss)
  6. Look at data (forward pass)
  7. Try to predict data (with updated weights)
  8. Measure performance (loss function)
  9. Update patterns/weights (optimizer)
  10. Repeat all of the above X times
unstructured-data-how-a-neural-network-learns-on-dog-images-classification

Here’s an example of how a neural network learns.

Note the cyclical nature of the learning. You can think of it as a big game of guess and check, where the guess (hopefully) gets better over time.

I'll leave the intricacies of gradient descent and backpropagation to your own extra-curricula research.

For now, we're going to focus on using the tools TensorFlow has to offer to implement this process instead.

As for optimizer functions, there are two main options to get started:

optimizer functions

Why these two?

Because they're the most often used in practice (you can see this via the number of machine learning papers referencing each one on paperswithcode.com).

There are many other optimizers available in the tf.keras.optimizers module too.

The good thing about using a premade optimizer from tf.keras.optimizers is that they usually come with good starting settings. One of the main ones being the learning_rate value.

The learning_rate is one of the most important hyperparameters to set in a neural network training setup.

It determines how much of a step change the optimizer will adjust your model's weights every iteration. Too low and the model won't learn. Too high and the model will try to take too big of steps.

By default, TensorFlow sets the learning rate of the Adam optimizer to 0.001 (tf.keras.optimizers.Adam(learning_rate=0.001)) which is a good setting for many problems to get started with.

We can also set this default with the shortcut optimizer="adam".

Input:

# Create optimizer (short version)
optimizer = "adam"

# The above line is the same as below
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
optimizer

Output:

<keras.src.optimizers.adam.Adam at 0x7f3bb4107040>

Which loss function should I use?

A loss function measures how wrong your model's predictions are.

Why care?

Well, a model with poor predictions in comparison to the truth data will have a high loss value. Whereas a model with perfect predictions (e.g. it gets every prediction correct) will have a loss value of 0.

Different problems have different loss functions.

Some of the most common ones include:

different loss functions

In our case, since we're working with multi-class classification (multiple different dog breeds) and our labels are one-hot encoded, we'll be using tf.keras.losses.CategoricalCrossentropy.

We can leave all of the default parameters as they are as well.

However, if we didn't have activation="softmax" in the final layer of our model, we'd have to change from_logits=False to from_logits=True as the softmax activation function does this conversion for us.

There are more loss functions than the ones we've discussed and you can see many of them on paperswithcode.com. TensorFlow also has many more loss function implementations available in tf.keras.losses.

For now though, let's check out a single sample of our labels to make sure they're one-hot encoded.

Input:

# Check that our labels are one-hot encoded
label_batch[0]

Output:

<tf.Tensor: shape=(120,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.], dtype=float32)>

Excellent! Looks like our labels are indeed one-hot encoded.

Now let's create our loss function as tf.keras.losses.CategoricalCrossentropy(from_logits=False) or "categorical_crossentropy" for short.

We set from_logits=False (this is the default) because our model uses activation="softmax" in the final layer so it's outputting prediction probabilities rather than logits. (Without activation="softmax" the outputs of our model would be referred to as logits, I'll leave this for your own extra-curricular investigation).

Input:

# Create our loss function
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False) # use from_logits=False if using an activation function in final layer of model (default)
loss

Output:

<keras.src.losses.CategoricalCrossentropy at 0x7f3bb4107430>

Which metrics should I use?

The evaluation metric is a human-readable value which is used to see how well your model is performing.

(A slightly confusing concept is that the evaluation metric and loss function can be the same equation).

However, the main difference between a loss function and an evaluation metric is that the loss function will typically be differentiable. There are some exceptions to the rule but in most cases, the loss function will be differentiable, whereas, the evaluation metric does not have to be differentiable.

In the case of regression (predicting a number), your loss function and evaluation metric could be mean squared error (MSE).

Whereas in the case of classification, your loss function will generally be binary cross entropy (for two classes) or categorical cross entropy (for multiple classes) and your evaluation metric(s) could be accuracy, F1-score, precision and/or recall.

TensorFlow provides many pre-built metrics in the tf.keras.metrics module.

built-in tensorflow metrics

The tf.keras.Model.compile() method expects the metrics parameter input as a list.

TL;DR

Since we're working with a classification problem, let's set up our evaluation metric as accuracy.

Input:

# Create list of evaluation metrics
metrics = ["accuracy"]

Bonus: Want to learn more on how a model learns?

We've briefly touched on optimizers, loss functions, gradient descent and backpropagation, which are the backbone of neural network learning.

However, for a more in-depth look at each of these, I recommend checking out the following:

  • 3Blue1Brown's series on Neural Networks - a fantastic 4 part video series on how neural networks are built to how they learn through gradient descent and backpropagation.
  • The Little Book of Deep Learning by François Fleuret - a free ~150 page booklet on the ins and outs of deep learning. The notation may be intimidating at first but with practice you will begin to understand it.

Putting it all together and compiling our model 0

Phew!

We've been through all the main steps in compiling a model:

  1. Creating the optimizer
  2. Creating the loss function
  3. Creating the evaluation metrics

Now let's put everything we've done together and compile our model_0.

First we'll do it with shortcuts (e.g. "accuracy") then we'll do it with specific classes.

Input:

# Compile model with shortcuts (faster to write code but less customizable)
model_0.compile(optimizer="adam",
                loss="categorical_crossentropy",
                metrics=["accuracy"])

# Compile model with classes (will do the same as above)
model_0.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False),
                metrics=["accuracy"])

Fitting a model on the data

Now that we have our model created and compiled, it’s time to fit it to the data.

This means we're going to pass all of the data that we have from Part 1 in this series (dog images and their assigned labels) through our model and ask it to try and learn the relationship between the images and the labels.

Fitting the model is step 3 in our list:

  1. Creating the model - We've done this ✅
  2. Compiling the model - We've done this ✅
  3. Fitting the model - Here's where we'll show the model examples of what we'd like it to learn (e.g. the relationship between an image of a dog and its breed)
  4. Evaluating the model - Once our model is trained on the training data, we can evaluate it on the testing data (data the model has never seen)
  5. Making a custom prediction - Finally, the best way to test a machine learning model is by seeing how it goes on custom data. This is where we'll try to make a prediction on our own custom images of dogs

We can fit our model_0 instance with the tf.keras.Model.fit() method.

The main parameters of the fit() method we'll be paying attention to are:

  • x = What data do you want the model to train on?
  • y = What labels do you want your model to learn the patterns from your data to?
  • batch_size = The number of samples your model will look at per gradient update (e.g. 32 samples at a time before updating its internal patterns)
  • epochs = How many times do you want the model to go through all samples (e.g. epochs=5 means looking at all of the data 5 times)?
  • validation_data = What data do you want to evaluate your model's learning on?

There are plenty more options in the TensorFlow/Keras documentation for the fit() method. However, these options will be more than enough for us.

In our case, let's keep our experiments quick and set the following:

  • x=train_10_percent_ds - Since we've crafted a tf.data.Dataset, our x and y values are combined into one. We'll also start by training on 10% of the data for quicker experimentation (if things work on a smaller subset of the data, we can always increase it).
  • epochs=5 - The more epochs you do, the more opportunities your model has to learn patterns, however, it also prolongs training.
  • validation_data=test_ds - We'll evaluate the model's learning on the test dataset (samples it's never seen before).

Let's do it!

Time to train our first neural network and bring Dog Vision 🐶👁️ to life!

Note: If you don't have a GPU here, training will likely take a considerably long time.

You can activate a GPU in Google Colab by going to Runtime -> Change runtime type -> Hardware accelerator -> GPU. Note that changing a runtime type will mean you will have to restart your runtime and rerun all of the cells above, but it will take far less time to run.

Input:

# Fit model_0 for 5 epochs
epochs = 5
history_0 = model_0.fit(x=train_10_percent_ds,
                        epochs=epochs,
                        validation_data=test_ds)

Output:

Epoch 1/5
38/38 [==============================] - 27s 482ms/step - loss: 3.9758 - accuracy: 0.3000 - val_loss: 3.0500 - val_accuracy: 0.5415
Epoch 2/5
38/38 [==============================] - 14s 379ms/step - loss: 2.0531 - accuracy: 0.8008 - val_loss: 1.8650 - val_accuracy: 0.7041
Epoch 3/5
38/38 [==============================] - 14s 375ms/step - loss: 1.0491 - accuracy: 0.9025 - val_loss: 1.3060 - val_accuracy: 0.7548
Epoch 4/5
38/38 [==============================] - 14s 373ms/step - loss: 0.6138 - accuracy: 0.9483 - val_loss: 1.0317 - val_accuracy: 0.7910
Epoch 5/5
38/38 [==============================] - 14s 373ms/step - loss: 0.4157 - accuracy: 0.9683 - val_loss: 0.8927 - val_accuracy: 0.8044

It looks like our model performed outstandingly well, and achieved a validation accuracy of ~80% after just 5 epochs of training!

This is far better than the original Stanford Dogs paper results of 22% accuracy. (The paper that this project is based on).

How did it perform so much better?

Well, that's the power of transfer learning combined with a series of modern updates to neural network architectures, hardware and training regimes.

So what’s next?

And just like that, our neural network has been built and trained.

However, we have to remember that these are just numbers on a page, and we’ll need to evaluate our results in more detail to be sure.

We'll get more in-depth on evaluations in Part 3 of this series. (Coming soon).

In the next part of this guide, we’ll evaluate our trained model and make predictions, so that we can get this project finished and up and running, and you can add this huge project to your portfolio!

Be sure to subscribe via the link below so you don’t miss it.

P.S.

If you want to deep dive into Machine Learning and learn how to use these tools even further, then check out my complete Machine Learning and Data Science course or watch the first few videos for free.

learn machine learning ai and data science

It’s one of the most popular, highly rated Machine Learning and Data Science bootcamps online, as well as the most modern and up-to-date. Guaranteed.

You'll go from a complete beginner with no prior experience to getting hired as a Machine Learning Engineer this year, so it’s helpful for ML Engineers of all experience levels.

Or, if you already have a good grasp of Machine Learning, and just want to focus on Tensorflow for Deep Learning, I have a course on that also that you can check out here.

learn tensorflow

When you join as a Zero To Mastery Academy member, you’ll have access to both of these courses, as well as every other course in our training library!

Not only that, but you will also be able to ask me questions, as well as chat to other students and machine learning professionals via our private Discord community.

So go ahead and check those out, and don’t forget to subscribe below so you don’t miss Part 3 of this series on Tensorflow and deep learning!

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

Top 10 Machine Learning Projects To Boost Your Resume preview
Top 10 Machine Learning Projects To Boost Your Resume

Looking for the best machine learning projects to make your resume shine? Here are my top 10 recommendations (with 3 'can't miss' projects!)

How One ZTM Student Landed A Senior Engineering Role at NVIDIA preview
How One ZTM Student Landed A Senior Engineering Role at NVIDIA

From Game Dev to ML/AI to Senior Engineer at Nvidia. Read Hiren's career journey here to see what it takes to get hired in the best roles at the best companies.