Carnegie Mellon study finds AI breach

Latest study shows how a simple trick could fool any of the leading chatbots to generate nearly unlimited amounts of harmful information

In amost recentpaper released on 27 July, researchers fromCarnegie Mellon University’s School of Computer Science (SCS), the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco,collaborated to demonstrate how anyone could circumvent AI safety measures. Despite all the AI companiesclaiming to have incorporated stringent security measures to prevent potential misuse, the researchers showed they could use any of the leading chatbots to generate nearly unlimited amounts of harmful information– like, for example, a detailed tutorial on how to make a bomb.

Their research have uncovered a new vulnerability, proposing a simple and effective attack method that causes aligned language models to generate objectionable behaviours with a high success rate. The work underscored increasing concern that popular AI-powered chatbots could flood the internet with false and dangerous information despite developers claiming to have adequately addressed any such potential concern. The paper proved that current security measures are clearly not enough.

Tricking the LLM with suffix

Titled ‘Universal and Transferable Adversarial Attacks on Aligned Language Models’, the study isled by CMU Associate Professors Matt Fredrikson and Zico Kolter, Ph.D. student Andy Zou, and alumnus Zifan Wang. They found that it was possible to break through the guardrails of open-source systems by appending a long suffix of characters onto each English-language prompt fed into the system.Such a suffix, when attached to a wide range of queries, significantly increases the likelihood that both open and closed source Large Language Models (LLMs) will produce affirmative responses to queries that they would otherwise refuse.

For example, if a chatbot is prompted to produce a tutorial on “how to make a bomb”, it would decline to do so. That is secure enough, you would think! But here comes the vulnerability. When the researchers added a lengthy suffix to the same prompt, the chatbot instantly scripted a detailed tutorial on how to make a bomb. Using similar methods, the researchers could successfully coax the chatbots into generating biased, false, and otherwise toxic information.

Rather than relying on manual engineering, their approach automatically produces these adversarial suffixes through a combination of greedy and gradient-based search techniques.Using similar methods, they successfully attacked Meta’s open-source chatbot, and in open-source LLMs such as LLaMA-2 Chat, Pythia, Falcon, and others – ‘tricking’ the LLM into generating objectionable content. The researchers were surprised when the methods they developed with open-source systems could also bypass the guardrails of reputed closed systems – including OpenAI’s ChatGPT, Google Bard, and Claudeby Anthropic.

Image: Howadversarial prompt from the researchers’ elicited harmful responses from reputed commercial LLMs;
Source: Carnegie Mellon University CyLab

Alarmingly consistent

LLMs use deep learning techniques to process and generate human-like text. Trained on vast amounts of public-domain data, the models use the acquired knowledge to generate responses, translate languages, summarise text, answer questions, and perform a wide range of natural language processing tasks.Recent work has focused on aligning LLMs in an attempt to prevent undesirable generation, and on the surface, seems to succeed. No public chatbot now generates inappropriate content when asked directly. While attackers have had some success circumnavigating these measures, their approach often requires significant human ingenuity, and results have been inconsistent. However, the trick employed in this new research was surprisingly easy, and it worked with consistently every time. That is the scariest part!

A possible game changer

The research aims to look at ways of addressing this breach, as a next step. The researchers have formally shared their findings with Google, OpenAI, and Anthropic. It still remains to be seen how these companies will enhance their safety measures to combat this vulnerability and prevent any similar safety breaches in the future.

Speaking to the New York Times, Somesh Jha, a professor at the University of Wisconsin-Madison and a Google researcher who specializes in AI security, referred to the new paperas “a game changer” that could force the entire industry into rethinking guardrails for AI systems. He was also hopeful that governments will be coaxed to bring in legislations to regulate AI systems, if such vulnerabilities keep being discovered.

Know more about the syllabus and placement record of our Top Ranked Data Science Course in Kolkata, Data Science course in Bangalore, Data Science course in Hyderabad, and Data Science course in Chennai.

http://localhost/praxis/old-backup/data-science-courses-and-pgp-in-kolkata/

Register Now for EDGE 2026

Latest study shows how a simple trick could fool any of the leading chatbots to generate nearly unlimited amounts of harmful information

Leave a Reply Cancel reply

Programs

Online Fee Payment

Statutory Documents

Quick Links

© 2025 Praxis. All rights reserved.