The term ‘prompt engineering’ has been the AI talk since the beginning of ChatGPT and even grew into a separate profession. At its core, it’s about playing with language models to receive the right answer in the desired style and form. Among the various prompting techniques, one especially effective approach is called chain-of-thought prompting. This method involves walking the model through a clear reasoning process before presenting it with a new question. This method has a proven record of improving the performance of language models over standard prompting techniques.
In this post, we’ll dig deeper into chain-of-thought prompting. We’ll look at how it enhances the way LLMs think through problems, explore its limitations, and show how to enhance CoT datasets with SuperAnnotate’s GenAI editor.
What is chain-of-thought prompting?
Chain-of-thought prompting is a technique that guides an AI model through a step-by-step reasoning process to arrive at an answer. Rather than posing a single question and expecting an immediate solution, we encourage the model to consider each intermediate step along the way. This approach often leads to more reliable and transparent reasoning, and it's proven especially helpful for logic-driven tasks like solving math problems.
Chain-of-thought toy problem
Take the following math word problem:
"Sara buys 3 shirts at $20 each, 2 pairs of jeans at $50 each, and gets a 10% discount on the total. How much does she spend?"
The chain of thought reasoning for this problem will look something like this:
- Calculate shirt cost:
- 3 shirts × $20 = $60
- Calculate jeans cost:
- 2 jeans × $50 = $100
- Total before discount:
- $60 + $100 = $160
- Apply discount:
- 10% of $160 = $16
- $160 - $16 = $144
Final answer: Sara spends $144.
Without a chain of thought, an AI might give the answer – right or wrong. Even if it gives the right answer, you won't see how it solved the problem. By using chain-of-thought prompting, not only are we more likely to receive correct answers, but we also gain insight into the model's thought process, making it more transparent and easier to verify.
In short, CoT
- Encourages intermediate reasoning rather than jumping to a final answer.
- Breaks down problems into smaller, manageable steps.
- Simulates the model's human-like reasoning.
Chain-of-thought prompting benefits
Chain-of-thought prompting became a widely used technique because it helped improve model performance without much hassle. Four of its main benefits include:
- Breaking down complex problems: CoT decomposes multi-step problems into intermediate steps, which increases the likelihood of getting the right answer and also enables models to allocate more compute to problems that require more reasoning steps.
- Transparency in reasoning: CoT provides transparency – it allows the user to see how the model arrived at a particular answer. This makes it easier to identify where the reasoning went wrong and debug the process (although fully characterizing a model’s computations that support an answer remains an open question).
- Versatility in applications: CoT improved model performance for a variety of tasks, including math word problems, common sense reasoning, and symbolic manipulation. Essentially, it could be used for any problem-solving task that can be tackled through language.
- Easy to implement: You can trigger chain-of-thought reasoning in many of today's large language models just by showing them examples of how to think through problems step-by-step in the setup or prompt.
Chain-of-thought limitations
The main downside of CoT prompting is that it improves results for complex tasks and works best with larger models, as well as a few other limitations:
Dependence on model size: CoT reasoning works best with very large language models (like those with 100 billion parameters or more). Smaller models often struggle to produce clear and logical reasoning, which can lead to mistakes.
Misleading reasoning: Sometimes, the reasoning a model provides doesn’t match how it actually arrived at its answer. This can make it hard to trust the model’s conclusions, as the explanation might sound good but be incorrect.
Slower responses: Using CoT means the model has to think through several steps before giving an answer. This can take more time and resources, making it slower, which isn’t ideal for situations where quick answers are needed.
Overthinking simple questions: For simple questions using CoT can make things unnecessarily complicated. In these cases, a straightforward answer might be better and faster.
Need for good prompts: The success of CoT depends heavily on how well the prompts (the questions or instructions given to the model) are designed. If the prompts are not clear or effective, the reasoning can go off track.
CoT prompting techniques
There are 4 main types of CoT to help the model reason in consequent steps – zero-shot CoT, manual or automatic CoT, and multimodal CoT.
Zero-shot chain-of-thought
Zero-shot CoT is a very simple technique that helps LLMs think step by step. And there’s no special trick here – you literally add ‘Let’s think step by step’ or a similar request to your query. This cue tells the model to break the problem into smaller parts before finding the answer. For simpler tasks where the model might miss the right answer without guidance but where detailed CoT prompts feel excessive, you can try zero-shot CoT.
However, zero-shot CoT’s big downside is that LLM’s self-taught step-by-step reasoning can have flaws, and the final answer may not be reliable.
Manual chain-of-thought
Manual CoT involves creating detailed examples by hand, showing the model how to reason through each step before giving an answer. By exposing it to these carefully crafted examples, the model learns to handle new questions more effectively. The big advantage here is that it often results in more accurate and reliable reasoning.
However, the downside is that manually producing these examples takes a lot of time, effort, and skill, making it hard to scale this approach.
Automatic chain-of-thought (Auto CoT)
In summary, Zero-Shot-CoT is easy to set up but not always dependable, while Manual-CoT is more reliable but demands a lot of time and effort. The new Auto-CoT strategy introduced by some experts aims to solve this issue.
Auto-CoT’s main idea is to automatically generate the example demonstrations the model relies on, removing the need for manual labor. In theory, this could combine the simplicity of Zero-Shot-CoT with the sturdiness of Manual-CoT, making it easier to scale up.
Auto-CoT has two main steps. First, there’s question clustering, where you group similar questions together. Next comes demonstration sampling, where you pick a representative question from each group and create its reasoning chain using zero-shot-CoT guided by simple heuristics—like how long the questions are and how many steps their reasoning requires.
Multimodal chain-of-thought
Until 2024, chain-of-thought reasoning was mostly applied to language-only models. This changed when researchers at Meta and AWS introduced multimodal CoT, which brings together both language and visual data.
Multimodal CoT works in two stages: rationale generation and answer inference. Both stages use the same model structure, but their inputs and outputs differ. In the first stage, the model processes language and image inputs to create a rationale. In the second stage, the original language input is combined with this rationale, along with the original visual input, so the model can infer the final answer.
CoT results: It works better for bigger models
The Brain team at Google ran a series of experiments with chain-of-thought prompting, testing it on five math benchmarks: GSM8K, SVAMP, ASDiv, AQuA, and MAWPS. They also evaluated multiple language models, including different sizes of GPT-3 (InstructGPT), LaMDA, PaLM, UL2, and Codex.
Here are a few examples of their CoT triples for arithmetic, commonsense, and symbolic reasoning benchmarks:
Their findings highlight three key points:
- Size matters: Chain-of-thought prompting delivered the best results with very large models (around 100 billion parameters). Smaller models, while fluent, often produced faulty reasoning and did worse than standard prompting.
- Complexity benefits: CoT excelled at handling tough problems. For example, on GSM8K, the biggest GPT and PaLM models more than doubled their performance. But for simpler tasks that require only one step to solve, improvements were small or even negative.
- Exceeding scores for large models: GPT-3 at 175B parameters and PaLM at 540B matched or exceeded previous best scores on challenging datasets like GSM8K, SVAMP, and MAWPS. On AQuA and ASDiv, they fell just a bit short—by about 2%.
The team's analysis of reasoning paths from LaMDA 137B on GSM8K showed that most were logically sound, with only a few lucky guesses. Incorrect answers often came close but had minor mistakes or, in some cases, bigger issues like misunderstanding the problem. Comparing PaLM models of different sizes suggests that increasing model size helps reduce these errors, leading to fewer gaps in reasoning and better overall understanding.
How to create and enhance CoT datasets with SuperAnnotate
If you’re fine-tuning LLMs with CoT datasets, it’s crucial to ensure data quality. In fact, data quality can make or break the model. When it comes to CoT datasets, that means making sure your chain-of-thought prompts are well-structured and aligned with both the task and the model size. If the dataset is too generic, includes confusing prompts, doesn’t match the model features, or suffers from other weaknesses, these issues will hold your model back.
How can we help?
At SuperAnnotate, we provide infrastructure for enterprises to build and manage large-scale AI training datasets, including those centered around detailed chains of thought. If you’re creating a new CoT dataset from scratch—using your own annotators or tapping into our expert workforce—our platform is designed to support you at every stage. Or, if you’re looking to improve an existing dataset, you can invite annotators on the platform to identify and address pitfalls like missing reasoning steps, unclear instructions, and other quality challenges.
Why we stand out
Other annotation tools often limit you to a single prompt-completion format, making it hard to handle the complexity of chain-of-thought workflows. CoT tasks require the flexibility to visualize and annotate multiple reasoning steps at a time or even explore different possible solution paths in the case of tree-of-thought (ToT) prompts.
SuperAnnotate’s fully customizable annotation UI is the only solution that lets you build out annotation editors perfectly suited for these tasks. By tailoring the workspace to the problem’s complexity, you can give annotators the flexibility they need to structure, review, and refine the reasoning process—something rigid, one-size-fits-all tools simply can’t match.
Our tool allows you to:
- Customize your UI: Quickly create interfaces tailored to the complexity of chain-of-thought prompts so annotators can see and refine each reasoning step.
- Manage teams effectively: Assign tasks to your own staff or our experts, and keep everyone aligned with built-in collaboration tools like commenting and chat.
- Maintain high quality: Employ human reviews, automated checks, and LLM-driven validations to ensure clarity and consistency in your data.
- Streamline your process: Automate tedious parts of the workflow and speed up evaluations, making the entire process more efficient.
Example
Say you’re building a fitness assistant for your app, and it needs to give accurate and detailed advice to your users. Training this agent on CoT datasets with multi-step reasoning data points will help the model deliver personalized, thorough, and actionable advice, which is crucial for such applications.
Suppose you already have the CoT dataset, but training the assistant on this data didn’t deliver the needed results. The next best thing to do is improve this dataset and the CoT reasoning steps.
With SuperAnnotate’s platform, you can refine these CoT steps during the annotation process. Annotators can review each step, catch flaws, and suggest improvements, ensuring your dataset is clear, realistic, and ready for training.
Here's a basic example of enhancing a single Chain-of-Thought data point with SuperAnnotate’s GenAI editor. You can always adjust the detail and complexity to fit your specific needs – this is just for demonstration purposes.
The best part about the platform is its customizability. You can fit the UI for whatever your use case is. You can add more granularity to this CoT use case to review, rate, and annotate multiple CoT reasoning at a time. This makes our UI really helpful especially for more complex tasks that require a lot of detail, making sure you get the most precise results possible.
To see how our GenAI editor can maximize your models' potential with CoT datasets, request a demo.
Final thoughts
Chain-of-thought prompting changes how language models solve complex problems. Instead of guessing or jumping straight to an answer, they break tasks down into logical, understandable steps. This approach proves especially useful in areas like math reasoning, symbolic manipulation, and commonsense questions. While it’s most effective with larger models and can introduce some delays, the boost in accuracy and transparency often outweigh the downsides. New variations—like zero-shot CoT or auto-CoT—keep making it easier and faster to build strong reasoning paths and even expand CoT into multimodal domains.
With the right tools and well-structured datasets, you can turn chain-of-thought prompting into a practical advantage. As you refine prompts, carefully select examples, and improve your data quality, CoT can help you create AI solutions that are accurate, detailed and niche.