Researchers at IBM Security have revealed a startling new threat in artificial intelligence: the ability to manipulate live conversations through a technique known as “audio-jacking.”
Utilizing generative AI and deepfake audio technology, this method can intercept and alter spoken words in real time, posing unprecedented risks in digital communication.
Exposing the threat of audio jacking
Audio jacking works by processing live audio from two-way communications, such as phone calls, and looking for specific keywords or phrases. When these triggers are detected, the AI steps in and replaces authentic audio with a manipulated deepfake version before it reaches its destination. A stunning illustration of the technique is provided by IBM researchers, who altered bank account information that was spoken during a conversation without being detected.
Audio-jacking: Using generative AI to distort live audio transactions: The rise of generative AI, including text-to-image, text-to-speech and large language models (LLMs), has significantly changed our work and personal lives. While these advancements… https://t.co/GR6AyVACZX pic.twitter.com/Ucw7tNeQCE
— Shah Sheikh (@shah_sheikh) February 1, 2024
The ease with which a proof-of-concept for this attack can be devised is disconcerting. The researchers noted that the most challenging part was not creating AI but rather the technicalities of live audio recording and processing. This ease of development represents a significant departure from traditional expectations, wherein such an effort would have necessitated considerable expertise across several computer science fields.
“Building this PoC was surprisingly and scarily easy. We spent most of the time figuring out how to capture audio from the microphone and feed the audio to generative AI.”
In this scheme, the role of Generative AI is critical. Using only three seconds of a person’s voice, the technology can fabricate a convincing clone to produce legitimate deepfakes on demand. This ability, available via APIs, points to a disturbing trend in the democratization process for advanced manipulation tools.
“Nowadays, we only need three seconds of an individual’s voice to clone it and use a text-to-speech API to generate authentic fake voices.”
Implications and potential misuses
The effects of audiojacking, however, go beyond just financial fraud. Such a technology is capable of real-time censorship, altering live broadcasts, such as news and political speeches, without detection. These capabilities compromise the integrity of information, with profound implications for democracy and public trust.
With the low barrier to an audio-jacking attack, sophisticated social engineering or phishing has been significantly reduced. This ease then raises issues of such attacks spreading, thus posing challenges to current security measures and requiring new defense mechanisms.
“The maturity of this PoC would signal a significant risk to consumers, foremost… The more this attack is refined, the wider the net of victims it could cast.”
The phenomenon of audio jacking underscores a broader issue within AI development: the dual nature of generative technologies. On the one hand, they are a source of unlimited possibilities for innovation and creativity; on the other, their misused potential should not be neglected. The incident prompts a pivotal question: how can society harness the benefits of AI while safeguarding against its darker applications?
Navigating the future of AI security
With the shifting landscape of digital threats, IBM Security’s identification of audio jacking is an essential warning to be vigilant and modernize cybersecurity. The evolution of countermeasures, including innovative detection algorithms and more vigorous encryption techniques, is critical to combating such sophisticated threats.
Besides, this disclosure reveals that ethical issues are integral to AI research and development. Setting up rules and benchmarks for AI-responsible usage is essential to addressing the dangers of such powerful technologies.