How to improve dataset quality for LLM fine-tuning [+code guide]

The saying "garbage in, garbage out" is especially true in machine learning. Engineers and data scientists often focus on the more glamorous aspects of algorithmic development and model tuning. However, any successful model’s foundation is the quality of the data on which it is trained. In this article, we’ll highlight and explain the importance of high-quality data as the foundation before any efforts in fine-tuning models. We’ll also walk through SuperAnnotate’s LLM editor to set up the interface, upload data, and prepare the environment to improve dataset quality.

The danger of bad data

Faulty data can manifest in many ways, including incorrect labels, missing values, biased information, or noisy inputs. Each of these issues can severely compromise the performance of an ML model. For instance, a model trained on data with incorrect labels will learn the wrong patterns and associations. This leads to poor performance, even if the model architecture is sophisticated and well-designed. Data accuracy is essential to avoid these pitfalls and achieve reliable results in LLMs.

Data-centric approach

A data-centric approach prioritizes data quality over model complexity. This strategy involves thoroughly auditing and cleaning the dataset before starting model training. By correcting inaccuracies, filling in missing entries, and ensuring the data represents a broad spectrum, we enhance both the reliability and performance of models.

The benefits of a data-centric approach are:

Increased model accuracy: Clean, well-curated data leads to better model performance and broader applicability.
Cost efficiency: Reducing time spent on debugging and retraining due to poor data quality lowers overall project costs.
Scalability: High-quality datasets create a solid foundation, making it easier to scale model applications with fewer adjustments for data errors.

Example with SuperAnnotate’s platform

In this blog post, we’ll take a dataset from Hugging Face and try to optimize its quality for further fine-tuning. For this example, we’ll use chain-of-thought prompting, a popular technique that improves the reasoning of language models for tasks that need step-by-step thinking.

Chain of thought (CoT) prompting

Chain of thought (CoT) prompting is a technique used to improve how language models solve problems. Instead of asking a model to produce an answer directly, CoT prompting guides the model in explaining the steps or thoughts it takes to reach a conclusion. This method helps the model process complex questions by breaking them down into simpler parts, much like a person would try to solve a complex problem. This approach makes the model's answers more accurate, detailed, and easier to understand, enhancing both performance and reliability.

In the example below, you can notice the difference between a few approaches – few-shot, zero-shot, few-shot with CoT, and zero-shot with CoT. The question is a mathematical problem that is pretty simple yet requires sequential reasoning – as a result, methods without CoT answer incorrectly. For the correct answer in this example, you can either give another example where you solve the problem( few-shot CoT), or just hint it to think step by step.

Overview of the ARC-CoT Dataset

In our example, we’ll use the Augmented ARC-Challenge dataset with chain-of-thought reasoning (ARC-CoT) from the Hugging Face datasets library. This dataset is a modified version of the original ARC dataset, which consists of multiple-choice questions that require text-based reasoning. The ARC-CoT dataset includes additional annotations that outline a chain-of-thought (CoT) reasoning for each question to help enhance model performance on complex reasoning tasks.

Example: Loading dataset from Hugging Face to SuperAnnotate

First, we import the dataset from the Hugging Face datasets library, and then we will explore the dataset to understand its structure and contents.

from datasets import load_dataset
dataset = load_dataset("Locutusque/arc-cot")

### showing a sample
print(dataset['train'][16]['question'])
print(dataset['train'][16]['answer'])

print(dataset['train'][5]['question'])
print(dataset['train'][5]['answer'])

Below, the dataset viewer provides some details about the ARC-CoT dataset:

Setting up the interface

SuperAnnotate’s LLM editor has a customizable user interface designed to support any GenAI task, from supervised fine-tuning (SFT) to retrieval augmented generation (RAG) and fine-grained RLHF. For this example, we added four components: question, answer, approve/disapprove, and correct answer. You can add in more components as necessary for your use case.

For this example, we will use a straightforward UI to ensure clarity, and add only four components: question, answer, approve/disapprove, and correct answer. You can add in more components as necessary for your use case.

When created, each component is assigned a random ID. We will change these IDs to something more meaningful to make data import and export more efficient. Since the datasets in Hugging Face have the columns 'question' and 'answer,' it is good practice to give our components matching IDs.

LLM code editor

After customizing the interface, we can add additional functionality using the LLM code editor. One option is to use one of the models from Hugging Face to generate answers. Alternatively, we could use our base model and a partially fine-tuned one.

Data upload

When you’ve set up a project similar to the one shown in the images above, you’ll be able to download a CSV template that you can populate with the data you want to import into SuperAnnotate. To keep things simple in this guide, we will upload the dataset as a CSV file, but you can also upload it as a JSON file through the SDK or use our Snowflake or Databricks integrations.

1. Download the CSV template

2. Add data to the uploaded CSV: Open the downloaded template with pandas and add the previously loaded Hugging Face dataset. Then save this to the CSV file.

data = pd.read_csv("template_chain_of_thought.csv")


# loop over no_chain_of_thought and add the question and answer to the dataframe and add a name to the name column as unique id's

new_rows = []
for i, item in enumerate(chain_of_thought):
    new_rows.append({'name': f'chain_of_thought_{i}', 'question': item['question'], 'answer': item['answer']})

data = pd.concat([data, pd.DataFrame(new_rows)], ignore_index=True)

data.head()

# remove the first row
data = data.iloc[1:]

data.to_csv("chain_of_thought.csv", index=False)

3. Upload the resulting CSV

Efficiently manage large scale dataset improvements with SuperAnnotate

Several leading foundational model companies use the SuperAnnotate LLM platform to build the highest quality SFT datasets. It allows you to:

Efficiently build a UI suited to your task (as demonstrated above).
Distribute tasks to and manage people in the project – be it internal or external experts.
Ensure high quality with collaborative toolings and comment/chat functionality.
Automate processes and evaluations and include LLMs to make them more efficient.

In this example, the dataset includes about 1000 question-and-answer pairs. By inviting a few of your colleagues to the tool as annotators and QAs, you can streamline the whole process. The platform is built to accommodate an active learning workflow to accelerate the time to model and make model development as agile as possible.

A typical workflow in SuperAnnotate’s LLM platform

Our customers generally use the following setup to accelerate their time to model.

1. Build smaller, high-quality datasets: Start by building a small, really high-quality dataset. In this example, that would involve curating and rewriting the examples into the format that we require.

2. Fine-tune a model on the dataset: Once the small dataset is ready, fine-tune your model on a small but representative subset of your data. This initial step helps the model adapt to the specific features of your dataset.

3. Predict on a larger set: Gather a set of new prompts and use the model to create completions. This tests how well the model works on a broader scale and how well it generalizes.

4. Human in the loop (QC): Review the model’s predictions with your team to ensure they are accurate. During this stage, focus on:

Checking the accuracy of the model’s outputs.
Identifying any errors or misclassifications.
Correcting annotations where necessary.

This feedback is crucial as it not only ensures the quality of your dataset but also collects more examples to refine the model further.
‍

5. Iterative refinement: Use the corrections and feedback from the QC phase to train or fine-tune the model further. This process may be repeated multiple times, with each cycle aiming to enhance the model's performance as it learns from an improving dataset.

This approach helps streamline the annotation process, reducing the workload and improving the quality of your project with each iteration.

Closing remarks

To wrap up, the quality of your data fundamentally determines the success of your LLMs. This guide has walked you through practical steps to enhance data quality, with a strong focus on data integrity before starting model training. We used SuperAnnotate’s platform to set up the interface, upload data, and prepare for data quality improvements. Good data lays the foundation for effective and reliable models. By maintaining rigorous standards for your data, you ensure that your models operate optimally and deliver valuable insights.

How to improve dataset quality for LLM fine-tuning [+code guide]

Contents

The danger of bad data

Data-centric approach

Example with SuperAnnotate’s platform

Chain of thought (CoT) prompting

Overview of the ARC-CoT Dataset

Example: Loading dataset from Hugging Face to SuperAnnotate

Setting up the interface

LLM code editor

Data upload

Efficiently manage large scale dataset improvements with SuperAnnotate

A typical workflow in SuperAnnotate’s LLM platform

Closing remarks

Recommended for you

Stay connected

Contents

The danger of bad data

Data-centric approach

Example with SuperAnnotate’s platform

Chain of thought (CoT) prompting

Overview of the ARC-CoT Dataset

Example: Loading dataset from Hugging Face to SuperAnnotate

Setting up the interface

LLM code editor

Data upload

Efficiently manage large scale dataset improvements with SuperAnnotate

A typical workflow in SuperAnnotate’s LLM platform

Closing remarks

Recommended for you

Multi-agent LLMs in 2025 [+frameworks]

OpenAI's GPT-4o: The next big thing in AI

CTO's guide to SFT and RLHF projects

Stay connected