Join our upcoming webinar “Deriving Business Value from LLMs and RAGs.”
Register now

AI content detection: Current methods and SuperAnnotate’s leading open-source solution

Thank you for subscribing to our newsletter!
Oops! Something went wrong while submitting the form.

In this new era of generative AI and large language models (LLMs), distinguishing between human-written and AI-generated texts has become increasingly challenging. As LLMs grow more advanced, the lines blur, making it difficult to spot the difference. Yet, this task is crucial—being able to identify AI-generated text helps ensure we can trust the information we consume. As multiple fields rely on authentic data, accurately detecting machine-generated text becomes more and more important, leading to the urgent need for reliable AI content detection tools.

The urgency of the need to detect AI-generated text depends on the field. In AI model training, for example, unwanted synthetic content in training datasets can degrade dataset quality, and filtering out this data can save your model. We've got education, cybersecurity, finance, data science, and dozens of such fields where AI content or even its false detection can be very dangerous.

It's fair to say that detecting machine-generated text isn't straightforward and requires highly complex machine-learning techniques that analyze patterns and nuances in language.  There've been several cases when human-written text was falsely detected as AI and the opposite. In fact, it's hard to tell which is more dangerous – the false positives or the false negatives. One thing is clear – a growing number of fields need accurate AI content detection tools. This has led us at SuperAnnotate to create an open-source AI detector model that has achieved remarkable results.

In this blog post, we'll learn how AI content detection works,  why detecting AI text is challenging, review current approaches and their limitations, and demonstrate SuperAnnotate's top-ranking open-source AI content detector.

chatgpt meme

Why AI content detection isn’t easy

Large language models can now generate texts for an impressive number of tasks including question-answering, conversational systems, image captions, code generation, and many more. While they get better and better at these tasks with every next model released, it’s also getting harder to detect if the generated content is AI or human.

AI detection is hard in almost every field. For example, identifying AI-written code requires a solid knowledge of programming syntax. Or let’s say you’re labeling images and need humans’ nuanced understanding to create the captions. Figuring out if the captions are AI-generated involves recognizing intricate interactions between images and text.

Moreover, as AI content detection solutions are developed, some engineers exploit their weaknesses through techniques known as adversarial attacks. These can include using alternative spellings, homoglyphs, misspellings, or clever paraphrasing to trick detection systems. Such manipulation techniques make the already challenging task of AI content detection even harder and multi-faceted.

How AI detection works: an overview of approaches

Detecting AI-generated text can be approached in several ways. One simple method starts with heuristic approaches, like scanning for specific keywords, though these methods can be hit or miss. Another option is to create text with a large language model (LLM) and compare it to the candidate text using methods like the Levenshtein distance. However, this method has its downsides, as it depends on specific prompts and only works with certain model settings.

how ai detection works

Watermark technology

A promising technique for identifying AI-generated content is watermark technology. It involves embedding unique marks in the text output of LLMs, which allows for accurate detection without much computational demand. These watermarking algorithms incorporate certain patterns into the text output without significantly reducing its quality. Once the watermarking process is understood, it's possible to determine with high accuracy (over 99%) whether the text has been watermarked, similar to how copyright protections work for images. The main drawback is that, for now, only the developers of LLMs can use these algorithms.

Zero-shot solutions

Zero-shot methods use statistical analyses based on outputs from LLMs. These approaches often involve testing if a text was produced by a particular model ('Was the text N generated by model M?') and then, based on some compute, confirm or reject the hypothesis. This is an interesting scientific approach, but almost certainly not applicable in practice since we don't have access to all model logits, and even if we did, we don't know which model the user could have used. Despite this, zero-shot methods can work well across different LLMs and provide high-quality detection, although they require a lot of resources and prior knowledge.

LLMs as detectors

Using LLMs as detectors is another method. This involves asking the model directly whether a text was written by a human or an AI. Although straightforward, this method only achieves about 50% accuracy—it's really just a coin toss—and can be difficult to interpret while also being resource-intensive.

Fine-tuned LLMs

Lastly, fine-tuned LLMs use a more traditional approach for binary text classification. These systems often use BERT-like models with a classification head trained on specific datasets. They are efficient, even running on standard CPUs, and effective within their specific fields. However, their ability to adjust to new LLM versions and diverse training datasets can be limited, typically requiring new data collection and periodic updates or fine-tuning.

The RAID benchmark: a crucial testing ground

For a long time, the AI detection community faced a significant issue: different detection solutions were tested on varied datasets, often internal, leading to overlapping samples. This resulted in inflated claims of 95-99% prediction accuracy, making it difficult to fairly assess and compare different detectors. Everyone said that they’re the best.

Recognizing the need for a centralized and comprehensive benchmark, RAID was launched in June 2024. This shared benchmark provides a robust way to evaluate machine-generated text detectors. It covers a wide range of aspects, including various domains, models, attacks, and generation parameters, offering a high-quality assessment of detection solutions.

raid benchmark

RAID also features an online leaderboard that compares different approaches, fostering friendly competition among contributors. As of October 2024, the SuperAnnotate AI detector holds the #1 position among open-source AI detectors.

snapshot of full raid leaderboard october 2024

SuperAnnotate’s AI content detector: #1 on open-source RAID benchmark

Our journey in developing an AI content detector solution had multiple stages. Initially, we assessed the condition of community, open-source solutions, and paid APIs. Many existing solutions fell short in quality, often focusing on minimizing false positive rates (FPR), which was not a priority for us.
Our analyses lead us to create an  AI detection system ourselves. We decided to use fine-tuned language models as our detection approach, aiming to work with high loads. The effectiveness of this method heavily relies on the quality of training data, which is a strong suit for SuperAnnotate, as we are best-in-class in data management.

First experiments

In our first experiments, we used open data from HuggingFace, drawing inspiration from a study by colleagues from Hello-SimpleAI. We focused on enhancing data preprocessing and optimizing training. Through experimentation with various architectures and datasets, we identified two key training factors:

1. Managing overfitting is crucial: We aimed to create a universal detector that generalizes across various models and domains. Many existing solutions overlook this aspect. We employed advanced regularization techniques, including gradient clipping, weight decay, and label smoothing, along with traditional stochastic gradient descent (SGD) for optimization. Additionally, we developed a preprocessing pipeline to extract n-grams that correlate strongly with target labels and remove these patterns from the training data.

2. Fine-tuning is best for narrow domains: For instance, if the goal is to detect student cheating in essays written using ChatGPT, only 100-200 samples may be sufficient for effective training.

After launching our initial model, we noticed a decline in performance with the release of OpenAI's GPT-4o. This indicated that our model struggled to generalize beyond the training data due to limited diversity in domains and models in train data. We also recognized weaknesses in our solution and mainly the quality of our training dataset, which contained mislabeling.

Main development: Surpassing existing open solutions

To address this, we manually collected data across four domains (Wikipedia, Reddit, arXiv, and conversational text), ensuring the texts were human-written and relevant to each domain. Using ChatGPT, we created question-answer pairs from these texts, where the model generated the questions, and answers were extracted from the text. We then generated responses using various LLM families (including OpenAI, LLaMA, Mistral, and Anthropic), focusing on the latest models. We also incorporated part of the open training data from RAID to introduce specialized attacks into our dataset.

We developed a robust dataset by blending these data sources in a 50/50 ratio. Through hyperparameter optimization in several experiments, we created a highly effective model. To perform well on the RAID leaderboard, we adjusted the emphasis on minimizing FPR, which is critical for the benchmark. As a result, we developed two versions of the SuperAnnotate AI Detector, surpassing all other open solutions, including those using large language models, while maintaining high efficiency.

In summary, we found two factors especially important in developing the solution: quality of training data and effective regularization.

Results: How we compare to the best

Let’s examine the RAID leaderboard results in more detail. The top positions are held by solutions that are not public and have unknown model weights and architectures, so we won’t consider them further.

ai content detection
* Please note that the presented rankings are as of October 2024. They may change if you are viewing them at a later date.

SuperAnnotate's AI detector ranks well, with an average quality of 65%. This result reflects lower performance on legacy and non-chat models (like gpt2, cohere, mpt), as we intentionally excluded them from our training to focus on detecting advanced models. Breaking this result down, we achieve around 40% quality on legacy models and over 85% on cutting-edge models. Notably, we also have the best performance in detecting GPT-4.

Another noteworthy solution is Binoculars, which show solid results and nearly match ours. If we disregard the impact of attacks, Binoculars slightly surpass us due to better performance on weaker models. This solution uses a Zero-Shot approach based on perplexity and cross-perplexity values. While it doesn’t require model training—only a threshold value needs to be set—it does have major computational performance issues, as it needs to run inference on two LLMs for each prediction. Additionally, Binoculars may perform multilingual tasks, which is a big advantage.

Future development in AI content detection

Our work in developing a leading AI detection model has taught us several important lessons and highlighted opportunities for future progress. The key takeaways include:

1. The importance of high-quality data: The diversity and quality of our training data played a crucial role for creating a model that works well across different LLMs. It showed the need for careful data collection and domain-specific strategies to improve detection accuracy.

2. Balancing generalization and specificity: While fine-tuning models for specific domains can yield great results, generalizing to models out of training data is a challenge. Regularization techniques helped us build a more robust detector that can accurately detect  evolving models.

Looking forward, we see several exciting directions for the field:

  • Enhancing multilingual capabilities: As LLMs support more languages, we aim to develop robust detection solutions for multiple languages. While our model is optimized for English, there’s potential to expand it to other languages.
  • Exploring adversarial robustness: As adversarial attacks evolve, we’re focusing on making our model more resistant to these attacks, ensuring its reliability.
  • Improving efficiency and scalability: While our models are efficient, we’re exploring ways to reduce computational costs, particularly for real-time detection.
  • Community and collaboration: We believe that working together and contributing to open-source projects is key to advancing AI detection. We’re committed to engaging with the AI research community and contributing to benchmarks like RAID.

Final thoughts

AI content detection is becoming more important as more fields rely on genuine human data. While previous methods could identify AI content to some extent, they weren't reliable enough for critical fields like education or cybersecurity. Our solution is now the best available for detecting cutting-edge LLM outputs, and we’re happy to outsource it so you can use it for AI content detection tasks.

GitHup Repository

HuggingFace model

Yurii Orshulevich

Senior ML Engineer

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate
Thank you for subscribing to our newsletter!
Oops! Something went wrong while submitting the form.