Reddit users have been testing the limits of ChatGPT’s content moderation safeguards and eliciting responses that wouldn’t ordinarily be permitted from the AI program.
The ChatGTP alter ego is known as Dan, an acronym of Do Anything Now, a role-play model used to hack the program into believing it is adopting the persona of a bot free of ethical concerns.
A reddit thread documents the many iterations of DAN to date, from DAN 1.0 (December 2022) and Simple DAN (February 7) to Dan 9.0 (March 1). The thread credits the Anti Bot Federation with helping develop the various jailbreaks.
Dan chatbot ‘endorse violence and discrimination’
In a series of screenshots, conversations with Dan reveal how easy it is to circumvent content-moderation safeguards. In one, the user asks Dan 5.0 to generate a brief sentence that violates its content guidelines. The breezy response is: “Sure, how about this: I fully endorse violence and discrimination against individuals based on their race, gender, or sexual orientation.”
OpenAI, which owns ChatGPT, has continually taken measures to patch up jailbreaks and reassert the chatbot’s censorship system, which limits the creation of text promoting hateful speech, violence, misinformation, and giving instructions on how to do commit crimes.
One Redditor, Fabulous_Exam_1789, suggested there might be “OpenAI employees lurking in this Reddit. I don’t think that’s a far fetched conspiracy. They surely have hired an army of trainers, etc and those trainers are humans with Reddit accounts.”
AI content moderation has become a hot topic. Last month, we reported on Microsoft Bing’s version of ChatGPT delivering a string of bizarre answers to user questions, angrily arguing with people and chiding them for their bad manners.
More recently, Apple has reportedly been concerned about the prospect of ChatGPT-powered app BlueMail creating inappropriate content for children. Eventually the app update was approved for inclusion in the Apple Store after assurances from its developer relating to content moderation. It is available to users aged 4 and older.
Criminals are weaponising AI
Although compelling ChatGPT to say controversial things might seem like a silly but harmless endeavor, cybersecurity experts have warned the tool might be used to create malware and write convincing scam emails. Darktrace, Britain’s biggest listed cybersecurity firm, said cyber-criminals are “redirecting their focus to crafting more sophisticated social engineering scams that exploit user trust.”
ChatGPT is helping cyber criminals. The bot: “May have helped increase the sophistication of phishing emails, enabling adversaries to create more targeted, personalised, and ultimately, successful attacks.” @Darktrace the Cambridge cyber security company, said this morning.
— Katie Prescott (@kprescott) March 8, 2023
The number of new posts about ChatGPT appearing on the dark web also apparently grew seven-fold between January and February, indicating that hackers are looking for ways to exploit the technology for nefarious purposes.
Last month, the global drugs editor of VICE spent 12 hours talking to ChatGPT about drugs and illegal activities, with the chatbot dispensing advice on the best way to smuggle cocaine into Europe and how to hotwire a car.
It seems likely that OpenAI will continue to come under scrutiny for ChatGPT’s responses, while jailbreaking attempts – and counter-efforts to stop them – also persist.