Model Evaluation, Prediction, and Deployment with TensorFlow

Daniel Bourke
Daniel Bourke
hero image

Welcome to Part 3 in my brand new 3-Part series on Tensorflow and Deep Learning.

Sidenote: Technically this 'mini series' is part of my larger 'Introduction to Machine Learning' series, but I went so deep on this particular section, I needed to make it into 3 parts!

Be sure to check out the other parts in this TensorFlow series, as they all lead into each other:

So a quick recap of the series so far.

The goal of this series is to give you an overview of deep learning (and more specifically, transfer learning) when using Tensorflow and Keras.

Even better still?

Rather than just tell you what this all means, I’m also going to walk you through a project that you can follow along with, so you can learn as you go.

The project we’re going to build is called ‘Dog Vision’. It’s a neural network capable of identifying different dog breeds via images.

dog project outline

How this series works

In the first part of the series, we took the time to set up the project, get our data, explore it, create a training set, and then turn our data into a Tensorflow dataset. This is an essential skill for any machine learning project, but a fairly large task - hence why it was a step on its own

In the second part of the series we took the dataset that we created in Part 1 and used it to build a neural network, train the model, and then fit the model on the data

Finally, in the third part of this series, (which you’re reading right now), we’ll evaluate our model, make predictions, and work through the deployment phases, which are crucial for understanding how to assess and utilize the trained models effectively

So let’s finish up this project!

Why listen to me?

My name is Daniel Bourke, and I'm the resident Machine Learning instructor here at Zero To Mastery.

Originally self-taught, I worked for one of Australia's fastest-growing artificial intelligence agencies, Max Kelsen, and have worked on Machine Learning and data problems across a wide range of industries including healthcare, eCommerce, finance, retail, and more.

I'm also the author of Machine Learning Monthly, write my own blog on my experiments in ML, and run my own YouTube channel - which has hit over 8 Million views.

Sidenote: If you want to deep dive into Machine Learning and learn how to use these tools even further, then check out my complete Machine Learning and Data Science course or watch the first few videos for free.

learn machine learning ai and data science

It’s one of the most popular, highly rated Machine Learning and Data Science bootcamps online, as well as the most modern and up-to-date. Guaranteed.

You'll go from a complete beginner with no prior experience to getting hired as a Machine Learning Engineer this year, so it’s helpful for ML Engineers of all experience levels.

Want a sample of the course? Well, check out the video below:

If you already have a good grasp of Machine Learning, and just want to focus on Tensorflow for Deep Learning, I have a course on that also that you can check out here.

learn tensorflow

With that out of the way, let’s get into this guide.

How to evaluate Model 0 on the test data

The next step in our journey is to evaluate our trained model.

There are several ways to do this:

  • Look at the metrics (such as accuracy)
  • Plot the loss curves
  • Make predictions on the test set and compare them to the truth labels
  • Make predictions on custom samples (not contained in the training or test sets)

We've done the first one, as these metrics were the outputs of our model training. So now we're going to focus on the next two - plotting loss curves and making predictions on the test set.

(Don’t worry as we’ll get to custom images later on also).

So what are loss curves?

Loss curves visualize how your model's loss value performs over time. An ideal loss curve will start high and move towards zero. A perfect model will have a loss value of zero.

loss curves

We say loss "curves" as a plural because you can have a loss curve for each dataset, training, validation, and test.

How do we get a loss curve?

We have a few options.

  • We could manually plot the loss values output from our model training
  • We could programmatically get the values thanks to the History object. This is where the object is returned by the fit method of tf.keras.Model instances

The good news is that we've already got one, from the work we did in Part 2 of this series. It should be saved to history_0. (The model history for model_0).

The History.history attribute contains a record of the training loss values and evaluation metrics for each epoch.

So let's check it out.


# Inspect History.history attribute for model_0


{'loss': [3.926330089569092,
 'accuracy': [0.32249999046325684,
 'val_loss': [2.996889591217041,
 'val_accuracy': [0.5548951029777527,

It works and we've got a history of our model training over time.

Not only that, but it looks like everything is moving in the right direction. Loss is going down whilst accuracy is going up, which is the ideal outcome for our loss curves.

So what now?

Well, how about we adhere to the data explorer's motto and write a function to visualize, visualize, visualize! so we can understand this data easier.

How to plot your loss curves

We'll call the function plot_model_loss_curves() and we'll take a History.history object as input and then plot loss and accuracy curves using matplotlib, like so:


def plot_model_loss_curves(history: tf.keras.callbacks.History) -> None:
  """Takes a History object and plots loss and accuracy curves."""

  # Get the accuracy values
  acc = history.history["accuracy"]
  val_acc = history.history["val_accuracy"]

  # Get the loss values
  loss = history.history["loss"]
  val_loss = history.history["val_loss"]

  # Get the number of epochs
  epochs_range = range(len(acc))

  # Create accuracy curves plot
  plt.figure(figsize=(14, 7))
  plt.subplot(1, 2, 1)
  plt.plot(epochs_range, acc, label="Training Accuracy")
  plt.plot(epochs_range, val_acc, label="Validation Accuracy")
  plt.legend(loc="lower right")
  plt.title("Training and Validation Accuracy")

  # Create loss curves plot
  plt.subplot(1, 2, 2)
  plt.plot(epochs_range, loss, label="Training Loss")
  plt.plot(epochs_range, val_loss, label="Validation Loss")
  plt.legend(loc="upper right")
  plt.title("Training and Validation Loss")



loss curves on our data

Woohoo! Now those are some nice-looking curves.

Our model is doing exactly what we'd like it to do. The accuracy is moving up while the loss is going down. However, you might be wondering why there's a gap between the training and validation loss curves, as ideally, the two lines would closely follow each other.

Well, in our case, the validation loss doesn't decrease as low as the training loss.

This is known as overfitting, which is a common problem in machine learning where a model learns the training data very well but doesn't generalize to other unseen data.

So let me explain...

Overfitting and underfitting (for when your model doesn't perform how you'd like)

You can imagine overfitting as an athlete who excels at running on a specific track with consistent conditions.

This athlete can achieve outstanding times as long as the track and weather conditions remain the same. However, when asked to run on a different track with varying conditions, their performance drops significantly because they haven't adapted to diverse scenarios, such as heat or cold, or traction on the track.

On the other hand, underfitting is like an athlete who performs poorly regardless of the track or conditions. They haven't trained adequately, so they can't achieve good results in any situation.

Or in even simpler terms. One is great as long as it's the ideal conditions. One is poor quality regardless of conditions.

The good news is that our model isn't underfitting. In fact, it's performing at ~80% accuracy on unseen data. This means that we must have overfitting issues.

Now, there are a lot of different ways to fix overfitting. But one of the best ways is to use more data, and guess what - we've got plenty more!

(Remember, these results were achieved using only 10% of the training data).

However, before we train a model with more data, there's another way to quickly evaluate our model on a given dataset just to confirm these results, and that's by using the tf.keras.Model.evaluate() method.

So how about we try it on our model_0?

Using the TensorFlow Keras evaluate() method

We'll save the outputs to a model_0_results variable so we can use them later.


# Evaluate model_0, see:
model_0_results = model_0.evaluate(x=test_ds)

269/269 [==============================] - 13s 47ms/step - loss: 0.8792 - accuracy: 0.8107


[0.8792150616645813, 0.8107225894927979]

As you can see, evaluating our model still shows it's performing at ~80% accuracy despite only seeing 10% of the training data.

We can also get the metrics used by our model with the metrics_names attribute.


# Get our model's metrics names


['loss', 'accuracy']

Model 1: How to train a model on 100% of the training data

Time to step it up a notch.

We've trained a model on 10% of the training data to see if it works and it did, so now let's train a model on 100% of the training data and see what happens.

But before we do, what do you think will happen?

If our model was able to perform well on only 10% of the data, how do you think it will go on 100% of the data?

These types of questions are good to think about in the world of machine learning. After all, that's why the machine learner's motto is experiment, experiment, experiment!

So let's follow our three steps from before:

  1. Create a model using our create_model() function
  2. Compile our model (selecting our optimizer, loss function, and evaluation metric)
  3. Fit our model (this time on 100% of the data for 5 epochs)

Note: Fitting our model on such a large amount of data will take a long time without a GPU. But, if you're using Google Colab, you can access a GPU via Runtime -> Change runtime type -> Hardware accelerator -> GPU. (See Part 2 again for a full walkthrough of this).

So let's get training!


# 1. Create model_1 (the next iteration of model_0)
model_1 = create_model(num_classes=len(class_names),

# 2. Compile model

# 3. Fit model
history_1 =,


Epoch 1/5
375/375 [==============================] - 43s 84ms/step - loss: 1.2725 - accuracy: 0.7607 - val_loss: 0.4849 - val_accuracy: 0.8756
Epoch 2/5
375/375 [==============================] - 30s 80ms/step - loss: 0.3667 - accuracy: 0.9013 - val_loss: 0.4041 - val_accuracy: 0.8770
Epoch 3/5
375/375 [==============================] - 30s 79ms/step - loss: 0.2641 - accuracy: 0.9287 - val_loss: 0.3731 - val_accuracy: 0.8832
Epoch 4/5
375/375 [==============================] - 30s 80ms/step - loss: 0.2043 - accuracy: 0.9483 - val_loss: 0.3708 - val_accuracy: 0.8819
Epoch 5/5
375/375 [==============================] - 30s 80ms/step - loss: 0.1606 - accuracy: 0.9633 - val_loss: 0.3753 - val_accuracy: 0.8767

Woah! It looks like all that extra data helped our model quite a bit, it's now performing at close to ~90% accuracy on the test set.

The question now, of course, is how many epochs should I fit for? Well, how about we evaluate our model_1?

How to evaluate our Model 1 on the test data

Let's start by plotting loss curves first with the data contained within history_1.


# Plot model_1 loss curves


evaluate model 1 on our test data

Hmm, looks like our model performed well, however, the validation accuracy and loss seemed to flatten out. Whereas, the training accuracy and loss seemed to keep improving.

Once again, this is a sign of overfitting (i.e. the model is performing much better on the training set than on the validation/test set). However, since our model looks to be performing quite well I'll leave this overfitting problem as a research project for later.

For now, let's evaluate our model on the test dataset using the evaluate() method.


# Evaluate model_1
model_1_results = model_1.evaluate(test_ds)


269/269 [==============================] - 12s 46ms/step - loss: 0.3753 - accuracy: 0.8767


Looks like that extra data boosted our model's performance from ~80% on the test set to ~90% on the test set. (Exact numbers here may vary due to the inherited randomness in machine learning models).

How to make and evaluate predictions of the best model

Now that we've trained a model, it's time to make predictions with it.

Because that's the whole goal of machine learning. Train a model on existing data, to make predictions on new data.

Our test data is supposed to simulate new data, data our model has never seen before.

We can make predictions with the tf.keras.Model.predict() method, passing it our test_ds (short for test dataset) variable.


# This will output logits (as long as softmax activation isn't in the model)
test_preds = model_1.predict(test_ds)

# Note: If not using activation="softmax" in last layer of model, may need to turn them into prediction probabilities (easier to understand)
# test_preds = tf.keras.activations.softmax(tf.constant(test_preds), axis=-1)


269/269 [==============================] - 13s 44ms/step

So now let's inspect our test_preds by first checking its shape.




(8580, 120)

Okay, looks like our test_pred variable contains 8580 values (one for each test sample) with 120 elements (one value for each dog class).

Let's inspect a single test prediction and see what it looks like.


# Get a "random" variable between all of the test samples
random_test_index = random.randint(0, test_preds.shape[0] - 1)
print(f"[INFO] Random test index: {random_test_index}")

# Inspect a single test prediction sample
random_test_pred_sample = test_preds[random_test_index]

print(f"[INFO] Random test pred sample shape: {random_test_pred_sample.shape}")
print(f"[INFO] Random test pred sample argmax: {tf.argmax(random_test_pred_sample)}")
print(f"[INFO] Random test pred sample label: {dog_names[tf.argmax(random_test_pred_sample)]}")
print(f"[INFO] Random test pred sample max prediction probability: {tf.reduce_max(random_test_pred_sample)}")
print(f"[INFO] Random test pred sample prediction probability values:\n{random_test_pred_sample}")


[INFO] Random test index: 1824
[INFO] Random test pred sample shape: (120,)
[INFO] Random test pred sample argmax: 24
[INFO] Random test pred sample label: brittany_spaniel
[INFO] Random test pred sample max prediction probability: 0.9248308539390564
[INFO] Random test pred sample prediction probability values:
[3.0155065e-06 4.2946940e-05 3.2878995e-06 3.1306336e-05 1.7298260e-06
 1.3368123e-05 2.8498230e-06 6.8758955e-06 2.6828552e-06 4.6089318e-04
 9.8374185e-06 1.9263330e-06 7.6487186e-07 6.1217276e-04 1.2198443e-06
 5.9309714e-06 2.4797799e-05 2.5847612e-06 4.9912862e-05 3.1809162e-07
 1.0326848e-06 2.7293386e-06 2.1035332e-06 5.2793930e-06 9.2483085e-01
 2.6070888e-06 1.6410323e-06 1.4008251e-06 2.0515323e-05 2.1309786e-05
 1.4602327e-06 3.8456672e-04 7.4974610e-05 4.4831428e-05 5.5091264e-06
 2.1345174e-07 2.9732748e-06 5.5520386e-06 8.7954652e-07 1.6277906e-03
 5.3978354e-02 9.6090174e-05 9.6672220e-06 4.4037843e-06 2.5557700e-05
 6.3994042e-07 1.6738920e-06 4.6715216e-04 4.1448075e-06 6.4118845e-05
 2.0398900e-06 3.6135450e-06 4.4963690e-05 2.8406910e-05 3.4689847e-07
 6.2964758e-04 9.1336078e-05 5.2363583e-05 1.2731762e-06 2.4212743e-06
 1.5872080e-06 6.3476455e-06 6.2880179e-07 6.6757898e-06 1.6635622e-06
 4.3550008e-07 2.3698403e-05 1.4149221e-05 3.8156581e-05 1.0464001e-05
 5.0107906e-06 1.7395665e-06 2.8848885e-07 4.2622072e-05 3.2712339e-07
 1.8591476e-07 2.2874669e-05 7.9814470e-07 2.3121322e-05 1.6275973e-06
 4.6186727e-07 7.6188849e-07 3.2468931e-06 3.1449999e-05 2.9600946e-05
 3.8992380e-06 2.8564186e-06 4.1459539e-06 6.0877244e-07 2.5443229e-05
 5.4467969e-06 5.4184858e-07 2.8361776e-04 9.0548929e-05 8.8840829e-07
 9.1714105e-07 1.9990568e-07 1.7958368e-05 7.7042150e-06 2.4126435e-05
 1.9759838e-05 8.2941342e-06 2.5857928e-05 6.1904398e-06 1.4601937e-06
 1.5800337e-05 6.0928446e-06 5.0209674e-05 1.4067524e-05 2.3544631e-05
 1.4134421e-06 9.8844721e-05 9.1535941e-05 2.4448002e-03 5.8540131e-06
 1.2547853e-02 1.3779800e-05 8.0164841e-07 2.5093528e-05 3.7180773e-05]

Okay looks like each individual sample of our test predictions is a tensor of prediction probabilities.

What does that mean?

Well, in essence, each element is a probability between 0 and 1 as to how confident our model is whether the prediction is correct or not.

  • A prediction probability of 1 means the model is 100% confident the given sample belongs to that class
  • While a prediction probability of 0 means the model isn't assigning any value to that class at all
  • And then all the other values fill in between.

Note: Just because a model's prediction probability for a particular sample is closer to 1 on a certain class (e.g. 0.9999) that doesn't mean that it’s correct.

A prediction can have a high probability but still be incorrect, which we’ll see later on, when I add my own face into the images.

The maximum value of our prediction probabilities tensor is what the model considers the most likely prediction given the specific sample.

So, we can take the index of the maximum value (using tf.argmax) and index on the list of dog names to get the predicted class name.

Note: tf.argmax or "argmax" for short gets the index of where the maximum value occurs in a tensor along a specified dimension.

We can use tf.reduce_max to get the maximum value itself.

To make our predictions easier to compare to the test dataset, let's unbundle our test_ds object into two separate arrays called test_ds_images and test_ds_labels.

We can do this by looping through the samples in our test_ds object and appending each to a list (we'll do this with a list comprehension).

Then we can join those lists together into an array with np.concatenate, like so:


import numpy as np

# Extract test images and labels from test_ds
test_ds_images = np.concatenate([images for images, labels in test_ds], axis=0)
test_ds_labels = np.concatenate([labels for images, labels in test_ds], axis=0)

# How many images and labels do we have?
len(test_ds_images), len(test_ds_labels)


(8580, 8580)

Perfect! Now we've got a way to compare our predictions on a given image (in test_ds_images) to its appropriate label in test_ds_labels.

This is one of the main reasons we didn't shuffle the test dataset, because now our predictions tensor has the same indexes as our test_ds_images and test_ds_labels arrays.

This means that if we chose to compare sample number 42, everything would (or at least should) line up.

In fact, let's try just that.


# Set target index
target_index = 42 # try changing this to another value and seeing how the model performs on other samples

# Get test image
test_image = test_ds_images[target_index]

# Get truth label (index of max in test label)
test_image_truth_label = class_names[tf.argmax(test_ds_labels[target_index])]

# Get prediction probabilities
test_image_pred_probs = test_preds[target_index]

# Get index of class with highest prediction probability
test_image_pred_class = class_names[tf.argmax(test_image_pred_probs)]

# Plot the image
plt.figure(figsize=(5, 4))

# Create sample title with prediction probability value
title = f"""True: {test_image_truth_label}
Pred: {test_image_pred_class}
Prob: {np.max(test_image_pred_probs):.2f}"""

# Colour the title based on correctness of pred
          color="green" if test_image_truth_label == test_image_pred_class else "red")



Woohoo! Looks like our model got the prediction right. According to the test data, sample number 42 is in fact an Affenpinscher.

So, our model is working for sample 42 at least, but let’s check some others. In fact, let’s write some code to test a number of different samples at a time.

How to visualize predictions from our best-trained model

Alright, let's check multiple images at the same time to see if the model is correct.


# Choose a random 10 indexes from the test data and compare the values
import random

random.seed(42) # try changing the random seed or commenting it out for different values
random_indexes = random.sample(range(len(test_ds_images)), 10)

# Create a plot with multiple subplots
fig, axes = plt.subplots(2, 5, figsize=(15, 7))

# Loop through the axes of the plot
for i, ax in enumerate(axes.flatten()):
  target_index = random_indexes[i] # get a random index (this is another reason we didn't shuffle the test set)

  # Get relevant target image, label, prediction and prediction probabilities
  test_image = test_ds_images[target_index]
  test_image_truth_label = class_names[tf.argmax(test_ds_labels[target_index])]
  test_image_pred_probs = test_preds[target_index]
  test_image_pred_class = class_names[tf.argmax(test_image_pred_probs)]

  # Plot the image

  # Create sample title
  title = f"""True: {test_image_truth_label}
  Pred: {test_image_pred_class}
  Prob: {np.max(test_image_pred_probs):.2f}"""

  # Colour the title based on correctness of pred
               color="green" if test_image_truth_label == test_image_pred_class else "red")


Visualizing predictions from our best trained model

Looks like our model does quite well, but some dogs don’t seem to be as accurate as others, so let’s look into that.

How to find the accuracy per class

Our model's overall accuracy is ~90%, which is an outstanding result, but what about the accuracy per class?

As in:

  • How did the boxer class perform?
  • Or the australian_terrier?
  • Heck, how accurate is it compared to the original dataset?

If we take a look on the original Stanford Dogs Dataset website, the authors reported the accuracy per class of each of the dog breeds.

Their best-performing class was the african_hunting_dog, which achieved close to 60% accuracy (about ~58% if I'm reading the graph correctly).

stanford dogs original test

How about we try and replicate the same plot with our own results, and then we can see the accuracy of each dog, as well as compare it to the original.

So, first of all, let's create a DataFrame with information about our test predictions and test samples.

We'll start by:

  • Getting the argmax of the test predictions as well as the test labels
  • Then we'll get the maximum prediction probabilities for each sample
  • And then we'll put it all into a DataFrame!

Like so:


# Get argmax labels of test predictions and test ground truth
test_preds_labels = test_preds.argmax(axis=-1)
test_ds_labels_argmax = test_ds_labels.argmax(axis=-1)

# Get highest prediction probability of test predictions
test_pred_probs_max = tf.reduce_max(test_preds, axis=-1).numpy() # extract NumPy since pandas doesn't handle TensorFlow Tensors

# Create DataFrame of test results
test_results_df = pd.DataFrame({"test_pred_label": test_preds_labels,
                                "test_pred_prob": test_pred_probs_max,
                                "test_pred_class_name": [class_names[test_pred_label] for test_pred_label in test_preds_labels],
                                "test_truth_label": test_ds_labels_argmax,
                                "test_truth_class_name": [class_names[test_truth_label] for test_truth_label in test_ds_labels_argmax]})

# Create a column whether or not the prediction matches the label
test_results_df["correct"] = test_results_df["test_pred_class_name"] == test_results_df["test_truth_class_name"]



dog accuracy dataframe

Now that we have our DataFrame we can perform some further analysis, such as getting the accuracy per class.

We can do so by grouping the test_results_df via the "test_truth_class_name" column and then taking the mean of the "correct" column.

We can then create a new DataFrame based on this view and sort the values by correctness (e.g. the classes with the highest performance should be up the top).


# Calculate accuracy per class
accuracy_per_class = test_results_df.groupby("test_truth_class_name")["correct"].mean()

# Create new DataFrame to sort classes by accuracy
accuracy_per_class_df = pd.DataFrame(accuracy_per_class).reset_index().sort_values("correct", ascending=False)


accuracy of dog dataframe

Woah! Looks like we've got a fair few dog classes that are 100% accurate or close to it.

That's outstanding!

Now let's recreate the horizontal bar plot used on the original Stanford Dogs research paper page.


# Let's create a horizontal bar chart to replicate a similar plot to the original Stanford Dogs page
plt.figure(figsize=(10, 17))
plt.ylabel("Class Name")
plt.title("Dog Vision Accuracy per Class")
plt.ylim(-0.5, len(accuracy_per_class_df["test_truth_class_name"]) - 0.5)  # Adjust y-axis limits to reduce white space
plt.gca().invert_yaxis()  # This will display the first class at the top


dog vision accuracy per class

It looks like our model performs incredibly well across the vast majority of all dog classes.

In fact, when we compare it to the original Stanford Dogs horizontal bar graph we can see that their best-performing class got close to 60% accuracy. However, it's only when we take a look at our worst-performing classes do we see a handful of classes with just under 60% accuracy.

Not bad at all!


# Inspecting our worst performing classes (note how only a couple of classes perform at ~55% accuracy or below)


dog test accuracy inspection

What an awesome result! We've now replicated and even vastly improved a Stanford research paper.

So now that we've seen how well our model performs, how about we check where it performed poorly, and try to figure out why.

How to find the most wrong examples

A great way to inspect your models errors is to find the examples where the prediction had a high probability but the prediction was wrong.

This is often called the "most wrong" samples. The model was very confident in its prediction, but was wrong.

So, let's filter for the top 100 most wrong by sorting the incorrect predictions by the "test_pred_prob" column.


# Get most wrong
top_100_most_wrong = test_results_df[test_results_df["correct"] == 0].sort_values("test_pred_prob", ascending=False)[:100]


finding the most wrong samples

One way to inspect these most wrong predictions would be to go through the different breeds one by one and see why the model would've confused them, such as comparing miniature_pinscher to doberman (two quite similar-looking dog breeds).

That’s a lot of manual work, so instead, let’s get a random 10 samples and plot them to see what they look like instead.


# Get 10 random indexes of "most wrong" predictions


Index([2001, 1715, 8112, 1642, 5480, 6383, 7363, 4155, 7895, 4105], dtype='int64')

How about we plot these indexes?


# Choose a random 10 indexes from the test data and compare the values
import random

random_most_wrong_indexes = top_100_most_wrong.sample(n=10).index

# Iterate through test results and plot them
# Note: This is why we don't shuffle the test data, so that it's in original order when we evaluate it.
fig, axes = plt.subplots(2, 5, figsize=(15, 7))
for i, ax in enumerate(axes.flatten()):
  target_index = random_most_wrong_indexes[i]

  # Get relevant target image, label, prediction and prediction probabilities
  test_image = test_ds_images[target_index]
  test_image_truth_label = class_names[tf.argmax(test_ds_labels[target_index])]
  test_image_pred_probs = test_preds[target_index]
  test_image_pred_class = class_names[tf.argmax(test_image_pred_probs)]

  # Plot the image

  # Create sample title
  title = f"""True: {test_image_truth_label}
  Pred: {test_image_pred_class}
  Prob: {np.max(test_image_pred_probs):.2f}"""

  # Colour the title based on correctness of pred
               color="green" if test_image_truth_label == test_image_pred_class else "red",


Finding the most wrong examples

Inspecting the "most wrong" examples, it's easy to see where the model got confused. Some of these breeds look very similar - with some of them being miniature versions.

These samples can show us where we might want to collect more data or correct our data's labels.

Before that though, how about we make a confusion matrix for further evaluation?

How to create a confusion matrix

A confusion matrix helps to visualize the performance of a classification algorithm by comparing the predicted classes to the actual classes (truth vs. predictions).

We can create one using Scikit-Learn's sklearn.metrics.confusion_matrix by passing in our y_true and y_pred values.

Then we can display it using sklearn.metrics.ConfusionMatrixDisplay.

Note: Because we have 120 different classes, running the code below to show the confusion matrix plot may take a minute or so to load , as it's quite a big plot. So be warned!


from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Create a confusion matrix
confusion_matrix_dog_preds = confusion_matrix(y_true=test_ds_labels_argmax, # requires all labels to be in same format (e.g. not one-hot)
# Create a confusion matrix plot
confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix_dog_preds,
fig, ax = plt.subplots(figsize=(25, 25))
ax.set_title("Dog Vision Confusion Matrix")


Create a confusion matrix

Now that's one big confusion matrix!

It looks like most of the darker blue boxes are down the middle diagonal (where we'd like them to be).

But there are a few instances where the model confuses classes such as scottish_deerhound and irish_wolfhound.

And looking up those two breeds we can see that they look visually similar, and are actually a common source of confusion.

dog confusion

Honestly, if it wasn’t for the height difference, I would think this was the same dog with photos from different angles!

How to save and load the best model

We've covered a lot of ground from loading data to training and evaluating a model. But what if you wanted to use that model somewhere else, such as on a website or in an application?

The first step is saving it to a file.

We can save our model using the method and then specify the filepath as well as the save_format parameters.

We'll use filepath="dog_vision_model.keras" as well as save_format="keras' to save our model to the new and versatile .keras format.

Let's save our best performing model_1.

Note: You may also see models being saved with the SavedModel format as well as HDF5 formats. However, it's recommended to use the newer .keras format. See the TensorFlow documentation on saving and loading a model for more.


# Save the model to .keras
model_save_path = "dog_vision_model.keras",


Model saved!

And we can check it worked by loading it back in using the tf.keras.models.load_model() method.


# Load the model
loaded_model = tf.keras.models.load_model(filepath=model_save_path)

With the file saved, now we can evaluate our loaded_model to make sure it performs well on the test dataset.


# Evaluate the loaded model
loaded_model_results = loaded_model.evaluate(test_ds)


269/269 [==============================] - 15s 47ms/step - loss: 0.3753 - accuracy: 0.8767

How about we check if the loaded_model_results are the same as the model_1_results?


assert model_1_results == loaded_model_results

Our trained model and loaded model results are the same!

We could now use our dog_vision_model.keras file in an application to predict a dog breed based on an image.

Note: If you're using Google Colab, remember that if your Google Colab instance gets disconnected, it will delete all local files.

So if you want to keep your dog_vision_model.keras, be sure to download it or copy it to Google Drive.

How to make predictions on custom images with the best model

So how about we see how our model goes on real world images. Because that's the whole goal of machine learning right? To see how your model goes in the real world?

So let’s make that happen.

More specifically, let's try our best model on images of my dogs (Bella 🐶 and Seven 7️⃣, yes, Seven is her actual name) and an extra wildcard image of me!

You can download the photos from my GitHub here.


# Download a set of custom images from GitHub and unzip them
!wget -nc


--2024-04-26 01:43:26--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: [following]
--2024-04-26 01:43:26--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1091355 (1.0M) [application/zip]
Saving to: ‘’      100%[===================>]   1.04M  --.-KB/s    in 0.05s   

2024-04-26 01:43:27 (21.6 MB/s) - ‘’ saved [1091355/1091355]

  inflating: dog-photo-4.jpeg        
  inflating: dog-photo-1.jpeg        
  inflating: dog-photo-2.jpeg        
  inflating: dog-photo-3.jpeg 

We can also inspect our images in the file browser and see that they're under the name dog-photo-*.jpeg .

How about we iterate through them and visualize each one?


# Create list of paths for custom dog images
custom_image_paths = ["dog-photo-1.jpeg",

# Iterate through list of dog images and plot each one
fig, axes = plt.subplots(1, 4, figsize=(15, 7))
for i, ax in enumerate(axes.flatten()):


using custom images


The first three photos look well and good but we can see dog-photo-4.jpeg is a photo of me in a black hoodie pulling a blue steel face.

Why include a non dog photo?

I’ll tell you why in just a second. For now, let's use our loaded_model to try and make a prediction on the first dog image dog-photo-1.jpeg .

We can do so with the predict() method.


# Try and make a prediction on the first dog image


IndexError                                Traceback (most recent call last)
<ipython-input-129-336b90293288> in <cell line: 2>()
      1 # Try and make a prediction on the first dog image
----> 2 loaded_model.predict("dog-photo-1.jpeg")

/usr/local/lib/python3.10/dist-packages/keras/src/utils/ in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
---> 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb

/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ in __getitem__(self, key)
    960       else:
    961         if self._v2_behavior:
--> 962           return self._dims[key]
    963         else:
    964           return self.dims[key]

IndexError: tuple index out of range

Oh no, we get an error:

`IndexError: tuple index out of range

Why is this?

Well, we can see that the code is trying to get the shape of our image.

However, we didn't pass an image to the predict() method. We only passed a filepath, and our model expects inputs in the same format it was trained on - hence the issue.

So let's load our image and resize it.

We can do so with tf.keras.utils.load_img() .


# Load the image (into PIL format)
custom_image = tf.keras.utils.load_img(
  target_size=IMG_SIZE, # (224, 224) or (img_height, img_width)

type(custom_image), custom_image


(PIL.Image.Image, <PIL.Image.Image image mode=RGB size=224x224>)

Excellent, we've loaded our first custom image.

But now let's turn our image into a tensor, as our model was trained on image tensors, so it expects image tensors as input.

We can convert our image from PIL format to array format with tf.keras.utils.img_to_array().


# Turn the image into a tensor
custom_image_tensor = tf.keras.utils.img_to_array(custom_image)


(224, 224, 3)

Nice! We've got an image tensor of shape (224, 224, 3).

So how about we make a prediction on it?


# Make a prediction on our custom image tensor


ValueError                                Traceback (most recent call last)
<ipython-input-132-bd82d1e41fed> in <cell line: 2>()
      1 # Make a prediction on our custom image tensor
----> 2 loaded_model.predict(custom_image_tensor)

/usr/local/lib/python3.10/dist-packages/keras/src/utils/ in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
---> 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb

/usr/local/lib/python3.10/dist-packages/keras/src/engine/ in tf__predict_function(iterator)
     13                 try:
     14                     do_return = True
---> 15                     retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
     16                 except:
     17                     do_return = False

ValueError: in user code:

    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/", line 2440, in predict_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/", line 2425, in step_function  **
        outputs =, args=(data,))
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/", line 2413, in run_step  **
        outputs = model.predict_step(data)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/", line 2381, in predict_step
        return self(x, training=False)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/", line 298, in assert_input_compatibility
        raise ValueError(

    ValueError: Input 0 of layer "model_1" is incompatible with the layer: expected shape=(None, 224, 224, 3), found shape=(32, 224, 3)

We get another error…

ValueError: Input 0 of layer "model_1" is incompatible with the layer: expected shape=(None, 224, 224, 3), found shape=(32, 224, 3)

So what went wrong?

Well, it looks like our model is expecting a batch size dimension on our input tensor also.

We can get this by either turning the input tensor into a single element array or by using tf.expand_dims(input, axis=0) to expand the dimension of the tensor on the 0th axis.


# Option 1: Add batch dimension to custom_image_tensor
print(f"Shape of custom image tensor: {np.array([custom_image_tensor]).shape}")
print(f"Shape of custom image tensor: {tf.expand_dims(custom_image_tensor, axis=0).shape}")


Shape of custom image tensor: (1, 224, 224, 3)
Shape of custom image tensor: (1, 224, 224, 3)

Wonderful! We've now got a custom image tensor of shape (1, 224, 224, 3) ((batch_size, img_height, img_width, colour_channels)).

Let's try and predict with it again now.


# Get prediction probabilities from our model
pred_probs = loaded_model.predict(tf.expand_dims(custom_image_tensor, axis=0))


1/1 [==============================] - 2s 2s/step

array([[1.83611644e-06, 3.09535017e-06, 3.86047805e-06, 3.19048486e-05,
        1.66974694e-03, 1.27542022e-04, 7.03033629e-06, 1.19856362e-04,
        1.01050091e-05, 3.87266744e-04, 6.44192414e-06, 1.67636438e-06,
        8.94749770e-04, 5.01931618e-06, 1.60283549e-03, 9.41093604e-05,
        4.67637838e-05, 8.51367513e-05, 5.67736897e-05, 6.14693909e-06,
        2.67342989e-06, 1.47549901e-04, 4.17501433e-05, 3.90995192e-05,
        9.50478498e-05, 1.47656752e-02, 3.08718845e-05, 1.58209339e-04,
        8.39364156e-03, 1.17800606e-03, 2.69454729e-04, 1.02170045e-04,
        7.42143384e-05, 8.22680071e-04, 1.73064705e-04, 8.98789040e-06,
        6.77722392e-06, 2.46034167e-03, 1.21447938e-05, 3.06540052e-04,
        1.12927992e-04, 1.30907722e-06, 1.19819895e-04, 3.28008295e-03,
        4.22435085e-04, 2.56334723e-04, 6.35078293e-04, 6.96951101e-05,
        1.82968670e-05, 6.66733533e-02, 1.65604251e-06, 4.85742465e-04,
        3.82422912e-03, 4.36909148e-04, 1.34899176e-06, 4.04351122e-05,
        2.30197293e-05, 7.29483800e-05, 1.31009811e-05, 1.30437169e-04,
        1.27625071e-05, 3.21804691e-06, 6.78410470e-06, 3.72191658e-03,
        9.23305777e-07, 4.05427454e-06, 1.32554891e-02, 8.34832132e-01,
        1.84010264e-06, 5.39118366e-04, 2.44915718e-05, 1.35658804e-04,
        9.53144918e-04, 3.80869096e-05, 3.43683018e-06, 3.57066506e-06,
        2.41459438e-05, 2.93612948e-06, 1.27533756e-04, 2.15716864e-05,
        3.21038242e-05, 7.87725276e-06, 1.70349504e-05, 4.27997729e-05,
        5.72475437e-06, 1.81680916e-05, 1.28094471e-04, 7.12008550e-05,
        8.24760180e-04, 6.14038622e-03, 4.27179504e-03, 3.55221750e-03,
        1.20739173e-03, 4.15856484e-04, 1.61429329e-04, 1.58363022e-04,
        3.78229856e-06, 1.03004022e-05, 2.00551622e-05, 1.21213234e-04,
        2.68000053e-06, 1.00253812e-04, 4.04065868e-05, 9.84299404e-05,
        1.29673525e-03, 3.07669543e-05, 1.62672077e-05, 1.17529435e-05,
        3.74953932e-04, 4.74653389e-05, 1.00191637e-05, 1.36496616e-04,
        3.76833777e-05, 1.55215133e-02, 2.33796614e-04, 1.01105807e-05,
        8.56942424e-05, 1.37508148e-04, 3.79100857e-06, 1.04301716e-05]],

It worked! Our model outputs a tensor of prediction probabilities.

We can find the predicted label by taking the argmax of the pred_probs tensor. And we get the predicted class name by indexing on the class_names list using the predicted label.


# Get the predicted class label
pred_label = tf.argmax(pred_probs, axis=-1).numpy()[0]

# Get the predicted class name
pred_class_name = class_names[pred_label]

print(f"Predicted class label: {pred_label}")
print(f"Predicted class name: {pred_class_name}")


Predicted class label: 67
Predicted class name: labrador_retriever

It’s looking good and the errors are all gone.


Simply because our model wants to make predictions on data in the same shape and format it was trained on.

So if you trained a model on image tensors with a certain shape and datatype, your model will want to make predictions on the same kind of image tensors with the same shape and datatype.

Now that it’s all set up correctly, how about we try and make predictions on multiple images?

To do so, let's make a function that replicates the workflow from above.


def pred_on_custom_image(image_path: str,  # Path to the image file
                         model,  # Trained TensorFlow model for prediction
                         target_size: tuple[int, int] = (224, 224),  # Desired size of the image for input to the model
                         class_names: list = None,  # List of class names (optional for plotting)
                         plot: bool = True): # Whether to plot the image and predicted class
  Loads an image, preprocesses it, makes a prediction using a provided model,
  and optionally plots the image with the predicted class.

      image_path (str): Path to the image file.
      model: Trained TensorFlow model for prediction.
      target_size (int, optional): Desired size of the image for input to the model. Defaults to 224.
      class_names (list, optional): List of class names for plotting. Defaults to None.
      plot (bool, optional): Whether to plot the image and predicted class. Defaults to True.

     str: The predicted class.

  # Prepare and load image
  custom_image = tf.keras.utils.load_img(

  # Turn the image into a tensor
  custom_image_tensor = tf.keras.utils.img_to_array(custom_image)

  # Add a batch dimension to the target tensor (e.g. (224, 224, 3) -> (1, 224, 224, 3))
  custom_image_tensor = tf.expand_dims(custom_image_tensor, axis=0)

  # Make a prediction with the target model
  pred_probs = model.predict(custom_image_tensor)

  # pred_probs = tf.keras.activations.softmax(tf.constant(pred_probs))
  pred_class = class_names[tf.argmax(pred_probs, axis=-1).numpy()[0]]

  # Plot if we want
  if not plot:
    return pred_class, pred_probs
    plt.figure(figsize=(5, 3))
    plt.title(f"pred: {pred_class}\nprob: {tf.reduce_max(pred_probs):.3f}")

What a good-looking function!

So now let's try it out on dog-photo-2.jpeg.


# Make prediction on custom dog photo 2


1/1 [==============================] - 0s 27ms/step
custom image prediction correct

Woohoo!!! Our model got it right!

Let's repeat the process for our other custom images.


# Predict on multiple images
fig, axes = plt.subplots(1, 4, figsize=(15, 7))
for i, ax in enumerate(axes.flatten()):
  image_path = custom_image_paths[i]
  pred_class, pred_probs = pred_on_custom_image(image_path=image_path,
  ax.set_title(f"pred: {pred_class}\nprob: {tf.reduce_max(pred_probs):.3f}")


1/1 [==============================] - 0s 28ms/step
1/1 [==============================] - 0s 26ms/step
1/1 [==============================] - 0s 25ms/step
1/1 [==============================] - 0s 28ms/step
final custom image predictions


Our Dog Vision 🐶👁 model has come to life.

It looks like our model got it right for 3 of our 4 custom dog photos (my dogs Bella and Seven are labrador retrievers, with a potential mix of something else).

But the model seemed to also think the photo of me was a Boston bulldog!

Note: Due to the randomness of machine learning, your result may be different here. If so, please let me know, I'd love to see what other kinds of dogs the model thinks I am 😆.

You might be wondering, why does our model do this? Why does it think I’m a dog?

It's because our model has been strictly trained to always predict a dog breed, no matter what image it receives. So no matter what image we pass to our model, it will always try to predict a dog from the image.

So how could we fix this?

One solution would be to create a filter system.

  • Have the first system train another model to predict whether the input image is of a dog or is not of a dog
  • And then only letting our Dog Vision 🐶👁 model predict on the images that are of dogs

Is it a dog? If no then skip. If yes, then what type of dog?

For example

For my food prediction app Nutrify, I combined multiple machine learning models to create a workflow.


One model is set up for detecting food (Food Not Food), and another model is set up for identifying what food is in the image (FoodVision, similar to Dog Vision).

This creates a much better user experience if an app is customer facing.

In my Nutrify app for example, taking photos of objects that aren't food and having them identified as food can be a poor customer experience. So it filters them first and doesn’t allow non food items to be added.

These are some of the workflows you'll have to think about when you eventually deploy your own machine-learning models.

Because although machine learning models are often very powerful, they aren't perfect. This is why implementing guidelines and checks around them is still a very active area of research.

Key takeaways from this project

Some final thoughts to end this project with.

#1. Data, data, data!

In any machine learning problem, getting a dataset and preparing it so that it is in a usable format will likely be the first and often most important step. Hence why we spent so much time getting the data ready in Part 1 of this series.

It will also always be an ongoing process, as although we've worked with thousands of dog images, our models could still be improved. As we saw going from training with 10% of the data to 100% of the data, one of the best ways to improve a model is with more data.

Also, explore your data early and often.

#2. Use transfer learning where possible - especially when starting out

For most new problems, you should generally look to see if a pre-trained model exists and see if you can adapt it to your use case.

Ask yourself:

  • What format is my data in?
  • What are my ideal inputs and outputs?
  • Is there a pre-trained model for my use case?

#3. Utilize neural networks

TensorFlow and Keras provide building blocks for neural networks which are powerful machine learning models capable of learning patterns in a wide range of data from text to audio to images and more.

Make sure to take advantage of them when you can.

#4. Experiment, experiment, experiment!

It's highly unlikely you'll ever get the best-performing model on your first try, and this is ok. Machine learning is very experimental by nature.

It’s the scientific method of finding out all the ways things don’t work, so that we can find the method that does. We just use ML to help us find this out faster.

So always keep this front of mind in any machine learning project. Y

our results are never stationary and can often always be improved. This includes experimenting on the data, the model, the training setup and the outputs (how does your model work in practice?).

What’s next?

So that concludes our neural network, deep learning, and transfer learning project!

Great work on getting this far - especially if you followed along and built your own project for your portfolio. It’s always important you don’t just read these guides, but put this information into action so that you can better learn from it.

As for what’s next?

Well, as I mentioned up top, technically this 'mini series' is part of my larger 'Introduction to Machine Learning' series. (I just went so deep on this particular section that I needed to make it into 3 parts).

There is one more part to this overall series coming soon, that's incredibly relevant to every project, and that’s how to communicate and share your work as a Machine Learning Engineer / Data Scientist.

Be sure to subscribe via the link below so you don’t miss it.


If you want to deep dive into Machine Learning and learn how to use these tools even further, then check out my complete Machine Learning and Data Science course or watch the first few videos for free.

learn machine learning ai and data science

It’s one of the most popular, highly rated Machine Learning and Data Science bootcamps online, as well as the most modern and up-to-date. Guaranteed.

You'll go from a complete beginner with no prior experience to getting hired as a Machine Learning Engineer this year, so it’s helpful for ML Engineers of all experience levels.

Or, if you already have a good grasp of Machine Learning, and just want to focus on Tensorflow for Deep Learning, I have a course on that also that you can check out here.

learn tensorflow

When you join as a Zero To Mastery Academy member, you’ll have access to both of these courses, as well as every other course in our training library!

Not only that, but you will also be able to ask me questions, as well as chat to other students and machine learning professionals via our private Discord community.

So go ahead and check those out, and don’t forget to subscribe below so you don’t miss the final part on this larger series on Machine Learning!

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

Top 10 Machine Learning Projects To Boost Your Resume preview
Top 10 Machine Learning Projects To Boost Your Resume

Looking for the best machine learning projects to make your resume shine? Here are my top 10 recommendations (with 3 'can't miss' projects!)

How One ZTM Student Landed A Senior Engineering Role at NVIDIA preview
How One ZTM Student Landed A Senior Engineering Role at NVIDIA

From Game Dev to ML/AI to Senior Engineer at Nvidia. Read Hiren's career journey here to see what it takes to get hired in the best roles at the best companies.