Beginner’s Guide to Pytorch Loss Functions

Daniel Bourke
Daniel Bourke
hero image

If you’ve ever trained a model in PyTorch, then you’ve no doubt seen a loss function in the code.  The thing is, it’s one of those lines that shows up in every tutorial, but it rarely ever gets explained fully.

  • What do loss functions actually do? 

  • What are the different types?

  • Why are there so many options?

  • Do they matter?

And so most people will either copy what others are doing or use whichever comes along with their model, but the reality is that using the wrong option can tank your model's results.

Which is why in this guide I’ll break down what loss functions are, how they work, and how to choose the right one for your project.

Sidenote: Want to learn more about Pytorch and how to use it effectively? Check out my complete Pytorch course:

You'll learn Deep Learning with PyTorch by building a massive 3-part real-world milestone project, so that by the end, you'll have the skills and portfolio to get hired as a Deep Learning Engineer!

Students of this course have gone on to work at Google, Tesla, NVIDIA, and more. There’s no reason you can’t do the same. So make sure to check it out.

With that out of the way, let’s get into this guide...

What is a loss function?

In simple terms, loss functions are how our ML models learn.

We give the model a goal, and then every time it makes a prediction, the loss function checks how far off it was and gives it a score. This is a single number that says, “Here’s how wrong that was”. The model then uses that number to adjust itself and try again, so that over time, as the loss goes down, the predictions get better.

(It can be a little confusing for beginners because a lower number means our model is performing better, so a loss of 0 means the model got it exactly right.)

This score or ‘loss’ is the foundation of the training loop. It’s how the model knows what to fix, which direction to go, and how big of a step to take. Without it, your model has no way to improve because it has no feedback on how it’s doing.

For example

Say the correct values your model should be predicting are:

Target: [3.0, -0.5, 2.0]

But your model predicts this instead:

Predicted: [2.5, 0.0, 2.1]

Clearly it’s close but not quite right and so that’s where the loss function steps in. It takes both sets of values and compares them. 

One common way to do this is with mean squared error, which calculates the average of the squared differences.

Here’s how that looks in code:

import torch
import torch.nn as nn

predicted = torch.tensor([2.5, 0.0, 2.1], requires_grad=True)
target = torch.tensor([3.0, -0.5, 2.0])

loss_fn = nn.MSELoss()
loss = loss_fn(predicted, target)

print(loss.item())  # Prints a single number, like 0.12 or 0.23

That number then becomes the signal PyTorch uses to update the model in its training loop.

I’ll cover how that all works later when we get into using loss functions in practice. But first, let’s take a closer look at the types of loss functions PyTorch provides, how they work, and why you'd pick one over another.

The different types of loss functions in PyTorch

Not all problems are the same so you need different ways to measure mistakes.

  • Some models predict numbers

  • Some pick from a list

  • Others tag multiple things at once

That’s why PyTorch gives you different loss functions depending on the kind of prediction you’re making, and they fall into one of three groups:

Regression loss functions

The word “regression” here refers to regression models in statistics, where the output is a number rather than a category. These are used when your model is predicting continuous numbers such as prices, weights, or durations. 

Loss functions in this group (like MSELoss or L1Loss) calculate the

MSELoss (Mean Squared Error Loss)

This is one of the most common loss functions used when your model is trying to predict numbers, and you want your predictions to be as close as possible to the correct values.

So how does MSELoss help with that?

Like all loss functions, it compares each predicted number to the actual number and measures the difference. Then it squares that difference and averages all those squared values across the dataset. That final number becomes the signal PyTorch uses to guide the model during training.

Why square the difference? Two reasons.

First, it makes sure that positive and negative errors don’t cancel each other out. Without squaring, if your model overestimates one prediction by 3 and underestimates another by 3, you’d end up with a total error of zero which would falsely suggest the model is doing fine.

Second, squaring makes big mistakes stand out more. An error of 5 becomes 25, while an error of 1 stays small. This pushes the model to focus on fixing the larger mistakes first which often has the biggest impact on performance.

Here’s a real-world example. Say you’re building a model to predict student test scores. You feed in some features and get back:

Predicted scores: [78, 85, 92]
Actual scores:    [80, 87, 90]

The model was a bit off with it sometimes being too low, sometimes too high. MSELoss will calculate how far off each prediction was, square it, and then average the results:

  • (78 - 80)² = 4

  • (85 - 87)² = 4

  • (92 - 90)² = 4

Average of [4, 4, 4] = 4

So your loss is 4. That number becomes the feedback PyTorch uses to help the model adjust and do better on the next try.

Here’s what that looks like in code:

import torch
import torch.nn as nn

predicted = torch.tensor([78.0, 85.0, 92.0])
actual = torch.tensor([80.0, 87.0, 90.0])

loss_fn = nn.MSELoss()
loss = loss_fn(predicted, actual)

print(loss.item())  # Output: 4.0

TL;DR

MSELoss tells the model how far off it was on average, and it punishes bigger errors more heavily to help the model fix them fast.

You’d use this whenever your model is solving a problem where the answer is a number and being “almost right” is still valuable.

L1Loss (Mean Absolute Error)

L1Loss is another option for models that predict numbers. Like MSELoss, it compares your model’s predictions to the actual values, but this time instead of squaring the errors, it just takes the absolute difference between them.

That sounds like a small change, but it shifts the way your model learns.

Where MSELoss puts more weight on bigger errors, L1Loss treats all mistakes the same, so if a prediction is off by 5, that counts as 5 - no matter if it was too high or too low. And since there’s no squaring involved, large errors don’t explode in size.

Why do this? 

Well it makes L1Loss more resistant to outliers. So if your dataset includes a few unusual values that don’t follow the general pattern, L1Loss won’t let them take over the training process. Basically, it’s a more balanced measure that keeps the model from overreacting to noisy or extreme data.

For example

Imagine you’re predicting delivery times and most days the model’s pretty close. However, every once in a while, there’s a massive delay due to traffic. 

With MSELoss, those rare delays would get squared, making them look far more important than they really are. But with L1Loss, they still count, but they just don’t overwhelm everything else.

Here’s what that looks like in code:

import torch
import torch.nn as nn

predicted = torch.tensor([78.0, 85.0, 92.0])
actual = torch.tensor([80.0, 87.0, 90.0])

loss_fn = nn.L1Loss()
loss = loss_fn(predicted, actual)

print(loss.item())  # Output: 2.0
This tells us that, on average, the model was off by 2 points across these predictions. That number becomes the feedback PyTorch uses to adjust the model weights.

TL;DR

Use L1Loss when you care about being close to the correct value but don’t want your model to panic over rare big mistakes. It gives you a steadier, more balanced way to measure how far off the model was without making outliers too important.

Classification loss functions

When your model’s goal is to choose the right label from a set of options, you’re dealing with a classification problem, but what is this? Well, unlike regression, where the output is a number, classification is about picking one class from a list.

You’ll use this kind of loss function for tasks like: 

  • Image classification

  • Sentiment analysis

  • Or anything where each input has exactly one correct answer.

Since the output isn’t a number to match directly, the model handles things differently. It assigns a score to each possible class, with higher scores meaning that it has higher confidence that it’s chosen correctly. The loss function’s job is to help the model adjust those scores over time so that the correct label ends up with the highest score.

Just like with regressions loss functions, there are a few options for this.

CrossEntropyLoss CrossEntropyLoss

CrossEntropyLoss (from torch.nn.CrossEntropyLoss) is the go-to loss function for classification problems where each input has exactly one correct label. You’ll see it used in tasks like image classification, sentiment analysis, or intent detection — anything where the model needs to choose the right option from a fixed list.

Here’s how it works.

The model starts by giving a raw score to each class — numbers called logits — which reflect how confident it is in each option. But these scores aren’t probabilities yet, and they can be any value, positive or negative.

So PyTorch runs those scores through a function called softmax. This takes all the logits and turns them into probabilities between 0 and 1 that add up to 1. The higher the original score, the higher the probability for that class. At this point, the model is essentially saying, “Out of all the options, here’s how likely I think each one is.”

Then CrossEntropyLoss looks at how much probability the model gave to the correct answer. If it gave most of the probability to the right class, the loss is low. If it leaned toward a wrong answer, the loss is higher. That number becomes the feedback PyTorch uses to help the model do better next time.

Here’s what that looks like in code:

import torch
import torch.nn as nn

# These are the raw scores (logits) the model predicted for 3 classes
# Try changing these values to get a different loss output
predicted_logits = torch.tensor([[4.0, 1.0, 0.1]])

# The correct class is the first one (index 0)
target = torch.tensor([0])

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(predicted_logits, target)

print(loss.item())  # Outputs a single loss value

# >>> 0.06768565624952316 # lower = better, 0 = perfect prediction
In this example, the model gave the highest score to the correct class, so the loss is relatively low. If it had picked the wrong one, the number would be higher.

Tl;DR

CrossEntropyLoss is what you’ll use most of the time when training models that output a single class per input. However, just make sure your model gives raw logits not probabilities, because it applies softmax internally, so doing it manually would break the result.

NLLLoss NLLLoss (Negative Log Likelihood Loss)

NLLLoss is another loss function used for classification problems, but it gives you a bit more control over how the model handles its predictions.

Most of the time, you’ll use CrossEntropyLoss because it handles everything behind the scenes — it takes the model’s raw scores, turns them into probabilities, and calculates how far off the prediction was. NLLLoss, on the other hand, gives you control over that final step. It’s used when you want to shape or reuse the model’s prediction scores in more specific ways.

So when would you actually need that?

Imagine you’re building a language model that predicts the next word or character in a sentence. These models often need to generate sequences one token at a time and score each one carefully. In those cases, you might want to adjust how confident the model is in its predictions, reuse those confidence scores later during generation, or share weights between layers to save memory. ``NLLLoss`` lets you do that by letting you handle the final scoring step yourself.

Another example is custom classification architectures. Maybe you're building a model that makes predictions in a less typical way — like adjusting how certainty is measured or applying temperature scaling to smooth out predictions. Again, ``NLLLoss`` gives you more flexibility.

Here’s how it works.

Instead of taking raw scores and converting them into probabilities automatically, NLLLoss expects you to give it

Here’s what that looks like in code:

import torch
import torch.nn as nn

# Log-probabilities for 3 classes (already processed through log_softmax)
log_probs = torch.log_softmax(torch.tensor([[4.0, 1.0, 0.1]]), dim=1)

# Correct class is index 0
target = torch.tensor([0])

loss_fn = nn.NLLLoss()
loss = loss_fn(log_probs, target)

print(loss.item())

# >>> 0.06768565624952316 # same output as CrossEntropyLoss above

The idea is the same as with CrossEntropyLoss if the model was confident and correct, the loss is low. If it was wrong or unsure, the loss goes up. The only difference is that here,

TL;DR

Use NLLLoss when you're building a more custom model and want to control how confident the model is in its predictions. It’s often used in

If you're working with a standard classification model, CrossEntropyLoss is usually the better choice — it's simpler and does the scoring for you.

Multi-label loss functions

In standard classification, you're asking the model to choose the single best option. But not all problems work that way because sometimes, there’s more than one correct answer. For example, a news article could belong to “politics,” “economy,” and “health” at the same time. 

We call these multi-label classification problems, because instead of picking one class, the model needs to decide which labels apply and then treat each one as a separate yes-or-no question. 

And that’s what multi-label loss functions are built for. 

Unlike single-label classification, where the model compares all classes at once, multi-label loss functions evaluate each label independently. The model gives a separate score for every possible label, and the loss function checks how close each prediction was to the truth.

You’ll use this setup when:

  • Each input can have more than one correct label

  • You want the model to treat each label prediction independently

  • You care about how confident the model is in each decision

You can combine methods to achieve this same effect, such as combining sigmoid with BCELoss manually. However, we can just use BCEWithLogitsLoss instead, as this is the most reliable and beginner-friendly option. It does the heavy lifting for you and is what you’ll see used in most real-world PyTorch projects.

Let’s look at how it works.

BCEWithLogitsLoss BCEWithLogitsLoss

BCEWithLogitsLoss is designed for models that make binary decisions whether a label applies or not.

You’ll see it used in:

  • Image tagging (e.g. “tree,” “sky,” “person”)

  • Multi-topic document classification

  • Detecting multiple objects or traits in the same input

Your model outputs a logit for each possible label. PyTorch then runs each one through a sigmoid function, which squashes it into a value between 0 and 1. That becomes the model’s confidence that a label is present.

Then the loss function compares those probabilities to the correct answers. If the model was confident and correct, the loss is small. If it was confident but wrong, the loss increases. 

Here’s what it looks like in code:

import torch
import torch.nn as nn

# Raw scores for two labels: "beach" and "sunset"
predicted_logits = torch.tensor([0.8, -1.2])

# Ground truth: beach = yes, sunset = no
target = torch.tensor([1.0, 0.0])

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(predicted_logits, target)

print(loss.item())

# >>> 0.31719154119491577 # Lower loss is better, 0 = perfect prediction
In this case, the model leaned toward "beach" being present and "sunset" being absent. Since both predictions were right, the loss will be relatively low.

TL;DR

Use BCEWithLogitsLoss for any binary classification or multi-label classification problem where each item can have more than one correct label, and each label is judged separately. Just remember though that. BCEWithLogitsLoss applies the sigmoid step to the raw logits internally, so don’t do it yourself or you’ll get incorrect results.

Custom loss functions

Most of the time, you’ll use PyTorch’s built-in loss functions because they cover the most common tasks and they work well out of the box. But sometimes, your project won’t fit into those standard molds.

  • Maybe you’re solving a hybrid problem that blends multiple objectives

  • Or maybe you want the model to treat some mistakes as more serious than others

When the built-in options don’t quite capture what “wrong” should mean for your task, that’s when it’s time to write your own. And the good news is that PyTorch makes this surprisingly simple to do. 

You create a custom class, define how your loss should behave, and return a single number that represents how well the model is doing. That number becomes the signal PyTorch uses to update the model just like with any other loss function.

For example

Here you can see a combination of loss functions. It starts with mean squared error but adds a small penalty for large mistakes:

import torch
import torch.nn as nn

class CustomLoss(nn.Module):
    def forward(self, prediction, target):
        mse = torch.mean((prediction - target) ** 2)
        penalty = torch.sum(torch.abs(prediction - target))
        return mse + 0.1 * penalty

This version says, “Use mean squared error as usual, but also gently push the model to avoid large jumps away from the target.”

What does that mean?

Well, the mean squared error does what you’d expect. It measures the average squared distance between the predicted and actual values, which punishes large errors more than small ones. 

The second part is where we add something extra. It looks at the absolute size of each mistake (without squaring it), sums them up, and adds that total to the loss, scaled by a small weight. This gives the model an extra nudge so that even if the overall average is fine, it doesn’t let any individual prediction drift too far off.

This kind of adjustment is useful when:

  • You want the model to make safer, more stable predictions

  • You care about consistency, not just the average performance

  • You’re trying to prevent outliers from throwing off the whole training process

And this is just one idea. You can build loss functions around domain-specific rules, fairness constraints, risk tradeoffs, or whatever matters to your problem. 

This lets you shape it to reflect your goals rather than forcing your goals to fit someone else’s defaults.

Advanced loss functions

It’s worth noting that PyTorch also includes a few other loss functions like KLDivLoss, PoissonNLLLoss, and CTCLoss. However, these are more specialized and you probably won’t need them right away unless you’re working on things like speech recognition, probabilistic models, or sequence prediction. I’ll cover them more in a future guide. For now though, we have enough options to use for most tasks.

And so the question of course is how do we decide which one to use…

How to choose the right loss function

The best way to choose the right loss function for your model is to simply step back and look at your task. 

That means asking a few key questions:

  • What is the model trying to predict?

  • How is that prediction structured?

  • How many answers are there for each input?

  • And what kind of mistakes actually matter?

Once you can answer those, the right choice usually becomes obvious.

For example

Imagine you're working for Uber Eats, and your team is building a model to predict how long a delivery will take. When someone places an order, the app wants to give them a realistic estimate like “Your food will arrive in 36 minutes.”

So let’s go through the questions to help decide which loss function to use.

Question #1. What is the model predicting? 

In this case, it’s a number: the delivery time, in minutes. That tells us we’re dealing with a regression problem, which narrows our options to something like MSELoss or L1Loss.

We can then answer further questions to decide which to use.

Question #2. What mistakes matter and what are we trying to improve?

If the app is off by a few minutes, that’s fine. But if it says 20 minutes and the food takes over an hour, that’s a major problem. So we want the model to avoid big errors. That makes MSELoss a good starting point because it punishes larger mistakes more heavily, which pushes the model to correct them faster.

But there’s still more questions we could ask ourselves.

Question #3. How clean is your data? 

In real life, some deliveries will go way off track — maybe because of a snowstorm, a traffic jam, or a restaurant delay. These are outliers. And because MSELoss squares the errors, it can overreact to them and skew the model’s learning. In this case, L1Loss is a safer choice. It treats all errors equally and isn’t thrown off by a few extreme cases.

Many companies use both at different stages.

  • During early training, when the dataset is clean, they might start with ``MSELoss`` to fix large prediction errors fast and get accurate average estimates

  • Once the model goes live and retrains on real-world data (which can be messy), they might switch to ``L1Loss`` or use a hybrid like Huber loss. That way, they still fix big errors when they matter but don’t let outliers dominate the learning process

That’s the great thing about answering these types of questions, because you can also plan ahead for any changes or tweaks.

How to setup and use a loss function in Pytorch

Picking the right loss function takes some thought, but using it in code? That part’s easy. Once you’ve chosen the right one for your task, it’s just a single line, and it works the same way across the board.

from torch import nn
# Setup the loss function
loss_fn = nn.MSELoss()  # Or nn.CrossEntropyLoss(), nn.BCEWithLogitsLoss(), etc.

Of course, there’s more to the training loop than just this line, so let’s break it down with a quick example and walk through what’s actually happening.

Here’s the basic training loop flow:

  1. Pass your input through the model to get a prediction

  2. Compare that prediction to the correct answer using a loss function

  3. Use that loss to calculate gradients with ``.backward()``

  4. Use an optimizer (from `torch.optim`) to update the model’s weights with ``.step()``

  5. Repeat

So let’s look at this in code, using MSELoss for a regression task:

import torch
import torch.nn as nn
import torch.optim as optim

# Example model
model = nn.Linear(1, 1)

# Example data
inputs = torch.tensor([[2.0], [4.0], [6.0]])
targets = torch.tensor([[4.0], [8.0], [12.0]])

# Loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(100):
    predictions = model(inputs)             # Step 1
    loss = loss_fn(predictions, targets)    # Step 2
    optimizer.zero_grad()                   # Clear old gradients
    loss.backward()                         # Step 3
    optimizer.step()                        # Step 4

print("Final loss:", loss.item())

# >>> Final loss: 0.0030100170988589525 # A loss of 0 means perfect, so this looks like a good output!

Here’s what’s happening step by step:

  • ``model(inputs)`` runs the forward pass and generates predictions

  • ``loss_fn(predictions, targets)`` measures how far off the predictions were

  • ``loss.backward()`` calculates the gradients for each parameter

  • ``optimizer.step()`` updates the model based on those gradients

Easy right!?

This same pattern works no matter what kind of task you're doing. So whether you're doing regression, classification, or multi-label prediction, the loop stays the same. The only thing that changes is which loss function you plug in.

But that only works if you’re using the right one. And that brings us to the most common mistakes.

Common mistakes when using loss functions in PyTorch

Here are some of the most frequent issues and how to avoid them.

Mistake #1. Using the wrong loss function for your task

I’ve said it a few times already, but it's worth listing as it’s one of the most common errors and the easiest to miss when you're starting out. You have to choose the right function for your goal. Otherwise your model won’t improve properly. 

So always follow the question steps before choosing. 

  • What the model is predicting

  • How many outputs there are

  • And what kind of mistakes you care about 

Before picking a loss.

Mistake #2. Feeding the loss function the wrong input format

Even if you’ve chosen the right loss function, it won’t work correctly if you pass in the wrong kind of input. This happens a lot with classification tasks.

For example

CrossEntropyLoss expects raw scores (logits), not probabilities. It applies softmax internally, so if you apply it yourself before calling the loss function, you’ll get incorrect results. Same goes for BCEWithLogitsLoss, which expects raw scores and applies sigmoid on its own. If you pass in values that have already been normalized, your model won’t learn properly even though you picked the right function.

On the flip side, if you're using NLLLoss, you need to remember to apply log_softmax manually before feeding the output into the loss function.

So choosing the correct loss function isn’t enough. You also need to know what kind of inputs it expects such as raw scores, probabilities, or log-probabilities, and then structure your model’s final layer accordingly.

Mistake #3. Skipping optimizer.zero_grad() optimizer.zero_grad() before calling backward() backward()

When you’re training a model in PyTorch, one critical part of the loop is updating the gradients. But what’s easy to forget and just as easy to mess up,  is that PyTorch accumulates gradients by default. 

That means every time you call .backward(), the gradients get added to whatever was already there.

Why care?

Well if you forget to reset the gradients before each training step, your updates won’t reflect the current batch alone. Instead they’ll reflect all the batches before it too, and that completely messes up your training.

That’s why you always need this line before .backward():

optimizer.zero_grad()

It clears out the gradients from the last step so they don’t interfere with the next one. The correct order is:

optimizer.zero_grad()  # Step 1: clear old gradients
loss.backward()        # Step 2: calculate new gradients
optimizer.step()       # Step 3: update the weights

If you switch the order and call .backward() before zero_grad(), the gradients from the previous step will still be there, and your model will be learning from the wrong signal.

It’s a tiny detail but forgetting it means your model won't converge the way it should

Mistake #4. Watching the loss but ignoring real metrics

It’s easy to assume that if the loss is going down, your model must be doing better. But that’s not always true. The loss only tells you how well the model is minimizing the objective you gave it, and not whether it’s actually solving the problem in a meaningful way.

This is important because you might be optimizing for the wrong thing.

For example

Let’s say you're training an ad-ranking model that’s supposed to boost revenue. If you train it to maximize clicks, your loss function will reward anything that improves click-through rate. Over time, it’ll start targeting users who click a lot. The thing is, some people will click anything, but they don’t always buy.

So although the loss keeps dropping and you’re getting more clicks, your actual performance metric (purchases, revenue, etc.) might stall or drop. You’re basically training the model to get more of the wrong people, focusing on those who click more vs those who click AND buy.

So while loss is useful for optimization, it’s not enough. Always pair it with evaluation metrics that reflect your real goal, such as accuracy, F1, precision/recall, or domain-specific KPIs. 

Otherwise, you might end up with a model that’s “winning” the wrong game.

Time to test these loss functions out for yourself!

So as you can see, loss functions are super important in Pytorch because they’re how our model learns. Sure, we need to pick the right one for our goals for it to work but after that, it’s as easy as adding them into your code and letting them run.

Speaking of which, the best way to learn how to use these is to give them a go yourself, so give them a try! 

Build a small model, plug in a loss function, and watch how it affects learning.

P.S.

Remember, if you want to learn more about PyTorch and how to use it effectively, then check out my complete Pytorch course:

You'll learn Deep Learning with PyTorch by building a massive 3-part real-world milestone project, so that by the end, you'll have the skills and portfolio to get hired as a Deep Learning Engineer!

Better still?

Once you join, you’ll also get access to our private Discord community, where you can get help and ask questions from me, fellow students, and other working tech professionals.

Best articles. Best resources. Only for ZTM subscribers.

If you enjoyed Daniel's post and want to get more like it in the future, subscribe below. By joining over 300,000 ZTM email subscribers, you'll receive exclusive ZTM posts, opportunities, and offers.

No spam ever, unsubscribe anytime

More from Zero To Mastery

The No BS Way To Getting A Machine Learning Job preview
The No BS Way To Getting A Machine Learning Job
19 min read

Looking to get hired in Machine Learning? Our ML expert tells you how. If you follow his 5 steps, we guarantee you'll land a Machine Learning job. No BS.

Top 10 Machine Learning Projects To Boost Your Resume preview
Top 10 Machine Learning Projects To Boost Your Resume
11 min read

Looking for the best machine learning projects to make your resume shine? Here are my top 10 recommendations (with 3 'can't miss' projects!)

How One ZTM Student Landed A Senior Engineering Role at NVIDIA preview
How One ZTM Student Landed A Senior Engineering Role at NVIDIA
6 min read

From Game Dev to ML/AI to Senior Engineer at Nvidia. Read Hiren's career journey here to see what it takes to get hired in the best roles at the best companies.