Let’s Think Step-by-Step: Beginner’s Guide to Chain of Thought Prompting

Scott Kerr
Scott Kerr
hero image

If you've spent any time experimenting with large language models (LLMs) like GPT-4, you've probably noticed that sometimes it absolutely nails a tough question, and then the next time it flubs an easy one...

So what's going on here?

Well the problem all comes down to how LLMs work under the hood. Because even though they're getting smarter with each new model, there are still issues with how it actually solves prompts. And if you don't understand the issue, then you're going to keep getting hit or miss results.

The good news is that there's a simple prompting technique that you can use, that helps AI reason more clearly on logic-based problems! And in this guide I'll break down what it is, how it works, and when to use it.

Side Note: If you’re as fascinated by LLMs as I am, want to go from beginner to pro, (and maybe even land a job as a Prompt Engineer), then check out my Prompt Engineering Bootcamp. 🚀

Instead of memorizing random prompts, you’ll learn how LLMs actually work and how to use them effectively.

It’s a complete course that starts with beginner-friendly information and builds to advanced skills, and will put you at the forefront of the AI field for the next decade – plus, you’ll get to join our private Discord community where you can ask questions 24/7 and get help from me and other experts along the way!

With that out of the way, let’s get into this guide…

What is Chain of Thought Prompting?

Chain-of-Thought (CoT) prompting is just a fancy term for asking the model to show its reasoning step by step.

You simply ask it a prompt as usual, but also add on a note for it to show how it came to the answer.

Super simple right?

The thing is though, it completely changes how the model works. And so instead of immediately spitting out an answer, the model will produce a series of logical steps or thoughts and then arrive at a conclusion - similar to how a human would work through the problem.

For example

Let's suppose that we ask the AI the following question:

“Tom has four apples. He gives one to Sarah and buys two more. How many apples does he have now?”

The AI might quickly answer

The answer is “5”

And sure in this example that answer is correct, but the issue is that it wont explain how it got there.

Why does this matter?

2 reasons:

  1. Without understanding how it came to the answer, you don't know if it did the right things to calculate it

  2. And so the answer might be wrong

Not great right?

But when you use Chain-Of-Thought prompting, it increases the quality of the output. And all you need to do, is change that initial prompt slightly:

“Tom has four apples. He gives one to Sarah and buys two more. How many apples does he have now? Let’s think step by step.

That last line makes all the difference in how it processes the prompt, because now the AI’s answer will look something like:

“Tom starts with 4 apples. He gives 1 to Sarah, leaving him with 3. Then he buys 2 more, so 3 + 2 = 5. Therefore, he has 5 apples now.”

See the difference?

By explicitly telling the model to reason through the question step by step, we got it to explain each stage of the problem instead of jumping straight to the final answer. In this case the final answer is the same, but the reasoning makes it clear why it's 5 and helps ensure the AI didn't just guess.

So why does simply adding “let’s think step-by-step” makes any difference?

Why thinking step-by-step help's improve LLM performance

I cover it more in this video from my course (which you can watch for free here), but in simple terms, it all comes down to how LLMs work under the hood.

You see, these models don’t truly “think” like humans because they’re not performing deep calculations or reasoning in a conscious way. Instead, an LLM generates text by predicting what words or phrases are most likely to come next, based on the patterns it learned from billions of examples in its training data.

This means that if you ask a question and pause, the model might then recall an answer that often follows similar questions in its training data, without actually working through the logic every time. Basically, it's trying to sidestep doing the work and looking for the most statistically likely answer, based on the previous information it's had access to.

The issue of course is that not all the data it had was correct. And so by prompting the model to show its reasoning, you're essentially steering it toward a different pattern – one where the answer is preceded by an explanation.

Better still?

This can dramatically improve the LLMs ability to solve complex problems.

For example

In the image below, you can see data from a report done by AI researchers at both Google and the University of Tokyo:

Each plot represents a different type of task, with the x-axis representing model size (smaller to larger) and the y-axis representing accuracy in the task.

The researchers found that chain of through prompting (i.e. asking it to think through things step by step) improved performance substantially. Even more interesting was the fact that the LLMs performance improved even further as the model’s got larger.

In fact they found that one 540-billion-parameter model achieved a new state-of-the-art score on a tough math benchmark (GSM8K), just by being prompted with a few step-by-step examples. (More on this more advanced method in a second)

This meant that it outperformed models that we're specifically fine-tuned just for those tasks!

Crazy right?

When not to use Chain of Thought prompting

Even though this method can improve performance, it's worth mentioning that chain-of-thought prompting isn’t a magic pill for every query.

In fact there are 2 times when you shouldn't use this method.

#1. Don't use CoT when the question doesn't require reasoning

COY only helps when the question actually requires reasoning.

So if your question is a simple fact or a one-step problem, then asking the model to think step by step won’t improve the correctness, it’ll just make the answer unnecessarily long.

For example

If you ask “What’s the capital of France?” and add “Let’s think this through step by step,” the model will likely give you a weirdly verbose response:

“France is a country in Europe. The capital of a country is its main city. For France, that city is Paris. So the capital of France is Paris.”

Sure, it got the right answer (Paris), but all that reasoning wasn’t really needed! In cases like straightforward lookups, the extra steps are overkill and can even introduce more room for error or irrelevant info.

So only use CoT prompting when your task has multiple steps, requires logic, arithmetic, or making a decision with justification. If not, keep it simple.

#2. Don't use CoT on older models as it might not work

If you remember the charts from earlier, you're notice that the red line improves as the models get larger. This is because Chain-of-thought works best with advanced models.

(It’s considered an emergent ability that really shines once a model is large or smart enough to handle complex reasoning).

Any of the frontier models today (like GPT-4 or other top-tier LLMs) should handle CoT just fine because they’ve effectively “learned” this pattern from their training, or via fine-tuning in the case of instruction-following models.

But if you’re using an older or very small model, it might not know how to properly follow a “think step-by-step” instruction (and could just produce nonsense).

How to use Chain of Thought Prompting

Now that we know what chain-of-thought prompting is and when it’s useful, let’s talk about how to use it effectively.

There are basically two flavors of CoT: Zero-shot, and Few-shot.

So let's break down what they are and how they work.

How Zero-Shot CoT works

This is the most basic method and what we've been talking about so far. You ask your question and append a phrase like “Let’s think step-by-step to the end of your prompt”

This relies on the model’s ability to follow that instruction on the fly. It’s the simplest way to invoke chain-of-thought reasoning and can get great results. However, there is a better method that we can use.

How Few-Shot CoT works

Just like before, you ask the model to think through step-by-step. However, this time you also one or more examples before asking your actual question. (This is how that model beat that math problem we talked about earlier).

Why bother giving examples?

Well, language models are great pattern mimics. So if you show a couple of examples of the format you want (question → reasoning → answer), the model is more likely to follow that format and less likely to go off track.

Better still?

By showing the model how to reason through similar problems, you prime it to produce a good chain-of-thought for your question. Because you're basically saying:

“Here’s how I want you to think about this kind of problem. Now do the same for this new question.”

To help visualize this, let's look at a basic problem, and how we would solve it with:

  • A basic prompt

  • A zero-shot CoT prompts and

  • A Few-Shot CoT prompt

Basic Prompt (No CoT):

“A train leaves the station at 3 PM and travels at 60 miles per hour. It needs to go 180 miles. What time will it arrive?”

A capable model will figure out it takes 3 hours and answer:

“6 PM.”

But another model might misinterpret or just throw a guess without explanation, so you’re not 100% sure it reasoned it out.

Zero-Shot CoT Prompt:

“A train leaves at 3 PM and travels 60 miles per hour. It needs to go 180 miles. Let’s think step by step: how long will the trip take? Then, what time will that be?”

Now it will likely say something like:

“180 miles at 60 mph means 180/60 = 3 hours. Starting at 3 PM, 3 hours later is 6 PM. So it will arrive at 6 PM.”

Here we explicitly told the model to break the problem into steps (“how long... then what time”), and it followed suit. We’ve increased our confidence in the answer because we saw the reasoning.

Few-Shot CoT Prompt:

Now we really hold the model’s hand by giving it a couple of examples before asking about our 3 PM train. For instance, we could prep the prompt like this (the text in italics are examples we feed in advance):

Example 1: Q: A car travels 100 miles at 50 miles per hour. How long does it take? A: Let’s think step by step. 100 ÷ 50 = 2, so it takes 2 hours. Example 2: Q: A train travels 120 miles at 60 miles per hour. How long does it take? A: Step by step: 120 ÷ 60 = 2, so 2 hours. Now, the real question: Q: A train leaves at 3 PM and travels 60 miles per hour. It needs to go 180 miles. Let’s think step by step: (We don’t write the answer – the model will fill it in.)

When the model sees those examples, it recognizes the pattern, where every question is being answered with a step-by-step solution. And so by the time it gets to our actual question, it’s primed to follow the same approach.

It will likely produce a very clear reasoning just like it did with the examples:

“180 ÷ 60 = 3 hours, starting at 3 PM means arrival at 6 PM”.

Now obviously this is a basic problem so it's hard to see the benefit. But the thing is, using Few-Shot with CoT can lead to even more reliable answers.

In fact, research shows that adding just a couple of examples can increase the LLM accuracy by up to 28% on certain tasks. That’s a huge jump for such a simple technique!

Again though, it's not the be all and end all super prompt, and you still want to follow similar usage guidelines like I mentioned before:

  • If your question doesn’t require reasoning (e.g. “What’s the capital of X?” or “Who wrote Pride and Prejudice?”), don’t use CoT prompting. In these cases, CoT won’t improve accuracy and it’ll just add unnecessary steps. You’re better off with a straightforward prompt or a different technique for those

  • If your question involves multiple steps or logic and you just need the model to slow down and explain itself, try a zero-shot CoT prompt first (simply add a phrase like “let’s think this through” to your question). This is often enough for clear-cut cases

  • If the model seems confused, makes a mistake, or the task is particularly complex, then switch to a few-shot CoT approach. Provide one or two example Q&A pairs that illustrate the correct reasoning, and then ask your question. This extra guidance can make a big difference in the model’s performance

Even then though, you still might not get the perfect answer on the first try and thats totally normal.

This is because prompting an AI model is an iterative process. Kind of like having a conversation with someone who doesn’t speak your language perfectly yet. They’re very capable, but you might need to clarify what you mean a few times.

So, what do you do if your carefully crafted CoT prompt still isn’t delivering the result you want?

Common issues when using CoT prompting and how to fix them

Here are a few tips and common scenarios and how to resolve them.

Issue #1. The model skips the reasoning steps or gives a very short answer

This might happen if the model doesn’t realize you really wanted a detailed reasoning, so try adding more scaffolding.

For example

Explicitly instruct it in the prompt:

“Explain each step of your reasoning before giving the final answer.”

You can even break your prompt into sub-questions:

“First, figure out X. Then determine Y. Finally, give the answer.”

This way, the model is guided through each part.

Issue #2. The reasoning is there, but it feels too generic or not useful

Sometimes the model’s step-by-step answer might be technically correct but not insightful (or just rephrasing the question). In this case, consider making your few-shot examples more specific to the kind of reasoning you want.

For example

If you’re asking for advice or a decision, maybe craft an example that shows nuanced thinking. The more your examples reflect the context or level of detail you need, the better the model’s response will match your expectations.

Issue #3. The model is still making mistakes or guessing

If you find the answer is wrong but with a seemingly plausible chain-of-thought, you might need to nudge it to be more careful.

For example

You can rephrase your CoT prompt to emphasize accuracy:

“Let’s think this through carefully”

or

“What are the key steps here? Work them out one by one.”

By phrasing your questions like this, it can signal the model to be more meticulous.

It could just be that your question is a little ambiguous and you need to better clarify the problem statement itself so the model doesn’t misinterpret it.

For example

Imagine you asked,

“I need to pick the best laptop for travel. One has better battery life, but the other is lighter. What should I choose? Let’s think step by step.”

If the AI gives a wishy-washy answer, it doesn’t mean CoT failed. It might mean the question was too open-ended. In this example, you're not actually clarifying whats important to you, and the only details you gave it are weight.

Is this what you care about, or is it more important features?

Heck, it could simply be that you don't even know what kind of features you want, other than that you know you'll be using this to travel for work almost daily.

So in a case like this, try guiding the chain-of-thought more:

“Let’s think step by step: what factors make a laptop good for travel? Consider weight vs battery life, size, durability, and charging convenience. Given those, which laptop is the better choice?”

Now the model has a clearer roadmap of what considerations to weigh, and you’ll likely get a more concrete answer, because we steered the chain-of-thought to focus on specific criteria, making the reasoning more relevant.

TL;DR

Assuming the model is large enough to understand, almost every issue with CoT not understanding (or prompting in general) comes down to refinement of your questions.

This is why you need to treat the interaction like a back-and-forth dialog. If the model’s first attempt isn’t quite right, iterate on your prompt. You might add details, break the task down further, or give an example of what you want.

This kind of refinement is a huge part of effective prompt engineering. Even with powerful reasoning prompts like CoT, you have to experiment with phrasing and structure to get the best results. But don’t be discouraged because playing around and adjusting is how you get to the gold.

Give Chain of Thought Prompting a try for yourself!

So as you can see, chain-of-thought prompting is one of the easiest ways to get an AI model to produce better, more reliable answers when your question involves reasoning. By changing how you ask, you change how the model thinks.

Better yet, is so simple to use! You just need to give it a try.

Speaking of which, now that you’ve got the rundown, I encourage you to try it out yourself.

Go ahead and take a question you might normally ask an AI, and then rewrite it with a “let’s think step-by-step” kind of prompt (the wording doesn’t have to be exactly the same, you can tweak it for your own purposes)

Run it and see what the answer looks like.

  • Did the explanation help?

  • Is the answer more accurate or clearer than the direct answer you’d get without the reasoning?

And if you’re feeling adventurous, take it a step further and add a made-up example or two (few-shot style) before your real question, and see how that influences the result. Then keep tinkering and refining until you get the answer formatted just right.

By actively experimenting with CoT prompting, you’ll quickly build an intuition for when it helps and how to use it best until it becomes second nature to nudge the AI to show its work whenever you face a tricky problem.

P.S.

Remember, if you want to learn everything you need to know to work with LLMs and AI then check out my complete Prompt Engineering Bootcamp!

It’s completely beginner friendly, but detailed enough that you’ll go from zero experience to being at the forefront of the AI world, and able to get hired in one of the trending roles of the next 10 years!

You’ll learn how Large Language Models (LLMs) actually work, how to use them effectively, testing & evaluation methods, and so, so, so much more!

Better still?

When you join, you get access to our private Discord community:

Here you can chat to me, other students, and other working tech professionals and get help with any questions you might have 24/7.

Best articles. Best resources. Only for ZTM subscribers.

If you enjoyed Scott's post and want to get more like it in the future, subscribe below. By joining over 300,000 ZTM email subscribers, you'll receive exclusive ZTM posts, opportunities, and offers.

No spam ever, unsubscribe anytime

More from Zero To Mastery

How To Use ChatGPT To 10x Your Coding preview
How To Use ChatGPT To 10x Your Coding
15 min read

Trying to use ChatGPT to code but keep getting stuck? Learn how to use it properly in this guide with examples, prompt tips, and real coding help!

Beginner’s Guide to ChatGPT Code Interpreter (With Code Examples) preview
Beginner’s Guide to ChatGPT Code Interpreter (With Code Examples)
22 min read

Discover how to use the ChatGPT Code Interpreter with code examples to automate tasks, analyze data, and simplify coding—even if you're just getting started.

How To Become A 10x Developer: Step-By-Step Guide preview
How To Become A 10x Developer: Step-By-Step Guide
22 min read

10x developers make more money, get better jobs, and have more respect. But they aren't some mythical unicorn and it's not about cranking out 10x more code. This guide tells you what a 10x developer is and how anyone can become one.