AI Detection Tools Biased Against Non-Native English Speakers

July 13, 2023

A recent study has shown that chatbot detection programs are biased against non-native English speakers. The findings will be of concern to anyone who uses English as a second language.

The market has been hooked onto generative AI following the launch of OpenAI’s ChatGPT in November last year.

The good and bad

Although large language models LLMs have been touted for enhancing efficiency, their popularity also raises concerns they may be abused, for instance, to cheat in exams or school assignments.

This comes on the back of chatbots’ ability to write prose, essays, lyrics, and computer code within seconds.

Differentiating between original and AI-generated work is becoming increasingly important for educators, as experts have already warned an estimated half of college students are using ChatGPT to cheat at exams, prompting the need for programs for chatbot detectors.

A recent study suggests that these bot detectors bring fresh problems of their own, as their discriminatory nature against non-native English speakers is revealed.

The bias of chatbot detectors

Generative AI has already been accused of being biased, and now programs meant to detect its use are also following suit. A recent study led by Stanford University biomedical data science assistant professor James Zou, showed significant bias against non-native English speakers. Zou tested AI detectors on essays written by native English speakers as well as on essays that were a Test of English as a Foreign Language (TOEFL).

The study screened 91 English essays written by non-native English speakers using seven different apps that are used to detect chatbot content.

According to the study, 61.3% of the surveyed TOEFL essays originally composed by humans were mislabeled as AI-generated.

“GPT detectors exhibit significant bias against non-native English authors, as demonstrated by their high misclassification of TOEFL essays written by non-native speakers,” said the study.

Another app flagged 98% of the essays as chatbot compositions. The study shows that when the same process was undertaken on essays written by native English speakers, the AI detectors agreed that 90% of the essays were human-generated.

Ironically, the researchers asked ChatGPT to rewrite a TOEFL essay and ran it through the detectors, which concluded the essay was generated by a human.

Also read: BAYC Bounces Back With $10M in Sales during 1st Week of July

The perplexity test

When tracing the source of the discrimination, the scientists noted that the AI detectors used what is called the perplexity text.

This, according to them is the measure of how “surprised” or “confused” an LLM is “when trying to predict the next word in a sentence.”

Low text perplexity is shown when an LLM manages to easily predict the next word. High text perplexity indicates that the sequence of words is harder to predict.

An article by the Guardian says LLMs like ChatGPT are trained to churn out low perplexity text, which means whenever humans use “a lot of common words in a familiar pattern in their writing, their work is at risk of being mistaken for AI generated text.”

Earlier this year, OpenAI released a tool to detect AI content but warned it is not 100% accurate.

Higher risk for non-native English speakers

According to scientists, the risk of writing low-perplexity text is higher for non-native English speakers.

“The implications of GPT detectors for non-native writers are serious, and we need to think through them to avoid situations of discrimination,” wrote the researchers.

“Our findings emphasize the need for increased focus on fairness and robustness of GPT detectors, as overlooking their biases may lead to unintended consequences, such as the marginalization of non-native speakers in evaluative or educational settings.”

The researchers also expressed concerns over the results of the detectors, which may compromise academics in particular with false accusations.

“In education, arguably the most significant market for GPT detectors, non-native students bear more risks of false accusations of cheating, which can be detrimental to a student’s academic career and psychological wellbeing,” said the scientist researchers.