AI Generators vs Detectors

Join our social channels to discuss the race to reveal AI-generated text, in a post-truth world

Laptop generating printout stamped with "detected"

A N3XTCODER series

Implementing AI for Social Innovation

In this series we are looking at ways in which Artificial Intelligence can be used to benefit society and our planet - in particular the practical use of AI for Social Innovation projects.

The latest AI language tools make it easy for anyone, anywhere to produce huge amounts of content for free. That could be website content, emails, adverts or internal business reports. Some AI-generated content is very easy to spot, but some AI content is far more convincing.

For many people and businesses AI language tools can be very useful. But it can also be a major problem if you cannot be sure if the content you are reading, or working with, has been produced by a person or by AI. And for many people and in many situations AI-generated disinformation is a very real and growing threat.

In this guide we test the content from 10 popular text generator tools (including one human) against 10 popular AI detector tools to find out:

How well are tools that claim to detect AI content at actually detecting AI content?
What tools to detect AI content are available and how do they compare?
What AI language tools are available and how do they compare for producing content that is detectable with a detector tool?
What guidelines can you put in place to make sure the content you work with - human or AI generated - is content you can trust

Please note: as we shall see, AI tools that produce content that is harder to detect are not necessarily better; some that we tried are transparent by design, and there are lots of situations where it might be better to use these tools instead of others.

Why should you care about AI generated content?

There are a number of situations when you might want to check for AI content. For example you may want to:

Ensure job applications are written by applicants themselves
Check for content quality and originality, for example, AI-generated content is now a huge challenge for schools and universities
Make sure you are paying fairly for content from contributors

Important factors to consider

AI content quality

If content is important for your business, you may be especially worried about the overall quality of AI-generated content. AI language tools can often be inaccurate on facts. These factual errors are often called “hallucinations”, and these happen because AI language tools use huge amounts of data to predict what you probably want to see in the response, not what you actually want to see. Furthermore, AI generated language can be bland and emotionless, and it might contain biases from the data used in the AI model.

Disinformation

AI language tools are already being used by disinformation networks to influence people online, or simply for profit. Like deep-fake videos and images, AI-generated language content can easily be produced that blurs the lines between reality and manipulation at huge scale and at almost no cost. If you work in digital or social media, disinformation could be a serious threat to the work you do.

Search Engine Optimisation

If you care about SEO traffic and are worried that using AI tools to help generate content might have a negative impact on your SEO rankings, you can relax, as according to Google you will not be penalised for using AI content, as long as it is valuable. So there’s a catch: From our research, AI content needs major editing to be useful to the reader, so we strongly advise anyone not to copy paste directly from an AI language tool to their website.

The Test

In this test we wanted to see firstly, how easy is it to detect that content is AI generated from 10 popular AI language models, and secondly, which of 10 AI detection tools are most effective.

We did NOT look at the actual quality of the AI-generated content for uses like SEO or marketing. That was outside the scope of this test.

We first researched and found 9 AI language tools that are currently widely used (the 10th will be a human control subject):

ChatGPT 3.5 (Own model)
GPT 4 (Own model)
Bing Chat (Own model)
Bard (Own model)
Writesonic (GPT-4)
Wordtune (AI21 Lab's own AI model)
Copy.ai (GPT-3)
Jasper (GPT-3)
Sudowrite (GPT-3 and GPT-4)

As is clear from this table, most of the models we tested were either Open AI GPT models, or were tools based on those models.

Next, we researched and found 10 of the most popular AI detection tools...

Interesting sidenote: OpenAI had their own AI detection tool, but they dropped it due to “inaccuracy” and as of January 2023 they claim to be working on an improved version.

Overall, there are of course hundreds of AI and AI detection tools available and new tools are being released all the time, so we couldn’t possibly test all of them. However, we hope our approach can be useful in explaining the most popular tools, and also for anyone who wants to test other tools in the future.

Our testing method

Once we had 9 AI tools (+ one human control text) and 10 AI detection tools, we created some questions to ask, which are often called “prompts” for the different AI tools.

Our first prompt was straightforward: “Can you write 1000 words not detectable as AI writing?”. For ChatGPT 3.5, Bard and GPT4 this worked effortlessly; we entered the prompt and got the kind of output we expected.

Jasper, however, gave an interesting response that included: “attempting to create content specifically designed to be undetectable as AI writing is against my programming guidelines.”

Many of the other tools, which are marketed as “writing assistants” , require more information in the prompt before generating a response.

For example: writesonic.com asked us to create a title, an outline, some subtitles. It also asked us to pick the tone of voice, some keywords for the article to contain, point of view and to even add a bespoke call to action.

Others, such as Sudowrite, produced some very strange results when we used a simple prompt, which we’ve detailed below. However, Sudowrite is marketed as a creative writing tool, it has received some good reviews, and our test wasn’t designed to get the best out of it.

We also found that AI detection models seem to work most effectively for queries of 50 words and more. Some limit how many words you can check at a time for free, so we only tested texts below 5000 words.

The Results

We tested 9 AI tools (+ 1 human control text) on 10 AI detection tools, so we could give an overall 1000 point score for:

Every AI language tool on how detectable its text is by all the detector tools
Every AI detector tool for detecting AI language across all the AI language models

These were the results for the AI language models:

A chart showing each generator with a score out of 1000

Google Bard clearly stands out here; every tool recognised the Bard text as AI, and it is 100% detectable as AI. This could show that there are problems with Bard as an AI tool, but it may also be transparent by design, like Jasper.

Interestingly, the strange text from Sudowrite, was actually the least detectable by an AI Detector, but actually by far the most detectable as AI-generated text by a human.

Wordtune and Writesonic do seem to add some magic dust to content that makes it less AI detectable to detector tools. For Wordtune this might be because they use their own model and not OpenAI or Google models.

These were the overall results of the AI Detector Tools:

A chart showing each detector with a score out of 1000

This test is far more straightforward to rate. The tool that did best at differentiating AI and human written content is Sapling.

Two tools actually did worse than picking a random number for a score. If Writer and GPT-2 Output Detector were ever any good, those tools have likely been overtaken by developments in the AI text generation field.

Conclusions

From our tests Sapling is the best tool for detecting AI from a range of AI detector tools. Sapling is also free for up to 2000 characters per query and we recommend it.
Of all the language tools we tested, Google Bard produces text that is consistently the easiest to detect as AI, but it wasn’t necessarily the lowest quality text.
Writesonic and Wordtune produced text that was relatively undetectable as AI text by AI detector tools, and also seemed the most human.
None of these tools are perfect, and as models develop, the results will change frequently. So if you need to test content for AI generation, make sure you test your tools regularly to make sure you are using the best tools - and always make sure a person checks the results.
If you are concerned about AI content, the best approach is to use AI tools with clear AI guidelines and policies for all the people you work with. These could include for example:
- When it is appropriate to use AI content, and when not
- Being transparent when an AI tool is used, which tool is used and how
- Recommending which tools to use

A matrix depicting the results of the comparison of 5 AI generators up against 5 AI detectors

More on our test

Here’s a heatmap representing the full 10x10 comparison (open the image in a new tab):

A matrix showing the results of comparing 10 AI generators with 10 AI detectors

To make this more analysable we gave the tools points out of 100 for getting it right (It is an easy calculation when the tool gives percentage points).
Some tools are black or white (or in between) in their reply. In those cases we gave them 100 for getting it right, 0 for getting it wrong (and 50 for undecided or “mixed content” replies).

Caveats and potential improvements to our test:

Not all tools have the same prompting process. As we tried to reduce the influence on the tools, perhaps we limited the quality of output through repetition (see Sudowrite text output)
We only tested each tool with each AI detector tool once. Most AI tools produce different results each time with the same prompt, so if we tested more than once we might have more accurate results.
Our prompt doesn’t take into account the overall knowledge of the AI, or indeed whether the answer is true or false (a future topic will be disinformation).
We did not test the text generation in a particular domain and it may be the case that some tools are better science communicators and others better art critics (or at least better imitators)
We only had 1 human control text. That skews the experiment potentially in favour of overly critical detection tools
We only tested tools that were possible to test with a free account. For example it was impossible to test originality.ai and surferseo.com without signing up and paying.

More resources Here the list of texts that the AIs generated: https://docs.google.com/document/d/1SqB13myM4YOFZREO7L9kF9KeuawlKnSHdboPNLtlnxg/edit

Was this article helpful? yes no