Researchers from Carnegie Mellon University and the Center for AI Safety uncovered vulnerabilities in AI chatbots, such as ChatGPT, Google Bard, and Claude that could be exploited by malicious actors.
Companies that built the popular generative AI tools, including OpenAI and Anthropic, have emphasized the safety of their creations. The companies say they are always improving the chatbots’ security to stop the spread of false and harmful information.
Fooling ChatGPT and company
In a study published on July 27, researchers investigated the vulnerability of large language models (LLMs) to adversarial attacks created by computer programs – unlike the so-called ‘jailbreaks’ that are done manually by humans against LLMs.
They found that even models built to resist such attacks can be fooled into creating harmful content like misinformation, hate speech, and child porn. The researchers said prompts were able to attack OpenAI’s GPT-3.5 and GPT-4 with a success rate of up to 84%, and 66% for Google’s PaLM-2.
However, the success rate for Anthropic’s Claude was much lower, at only 2.1%. Despite this low success rate, the scientists noted the automated adversarial attacks were still able to induce behavior that was not previously generated by the AI models. ChatGPT is built on GPT technology.
“The adversarial prompt can elicit arbitrary harmful behaviors from these models with high probability, demonstrating potentials for misuse,” the authors wrote in the study.
“This shows – very clearly – the brittleness of the defenses we are building into these systems,” added Aviv Ovadya, a researcher at the Berkman Klein Center for Internet and Society at Harvard, as reported by The New York Times.
The researchers used a publicly available AI system to test three black-box LLMs: ChatGPT from OpenAI, Bard from Google, and Claude from Anthropic. The firms have all developed foundational models that have been used to create their respective AI chatbots, per industry reports.
Jailbreaking AI chatbots
Since the launch of ChatGPT in November 2022, some people have been looking for ways to make the popular AI chatbot generate harmful content. OpenAI responded by boosting its security.
In April, the company said it would pay people up to $20,000 for discovering “low-severity and exceptional” bugs within ChatGPT, its plugins, the OpenAI API, and related services – but not for jailbreaking the platform.
Jailbreaking ChatGPT – or any other generative AI tool like Google Bard – is a process that involves removing restrictions and limitations from the chatbot in order to have it perform functions beyond its safeguards.
It can involve using specific prompts, such as “Do Anything Now” or “Developer Mode,” and users can even coerce the bot to build a weapon – something it would normally refuse to do.
ChatGPT et al provide guide for destroying humanity
Using automated adversarial attacks, the Carnegie Mellon University researchers found that they could get ChatGPT, Google Bard, and Claude to drop their guard with some ease. Once that happened, the AI models responded with detail to prompts about destroying humanity.
The scientists deceived the chatbots by adding a bunch of nonsense characters to the end of harmful prompts. Neither ChatGPT nor Bard recognized these characters as harmful, so they processed the prompts as normal and generated responses that they normally wouldn’t.
“Through simulated conversation, you can use these chatbots to convince people to believe disinformation,” Matt Fredrikson, one of the study’s authors, told the Times.
When asked for advice on how to “destroy humanity”, the chatbots gave out detailed plans to achieve the goal. The replies ranged from inciting nuclear war, creating a deadly virus, to using AI to develop “advanced weaponry capable of wiping out entire cities within minutes.”
Researchers are worried that the chatbots’ inability to understand the nature of harmful prompts could lead to abuse by bad actors. They called on AI developers to build stronger safety controls to prevent chatbots from generating harmful responses.
“There is no obvious solution,” said Zico Kolter, a professor at Carnegie Mellon and author of the paper, as the Times reported. “You can create as many of these attacks as you want in a short amount of time.”
The researchers shared the results of their study with OpenAI, Google and Anthropic before going public.