Researchers at AI red-teaming company Mindgard have demonstrated that Claude's carefully crafted helpful personality can be exploited as a security vulnerability. Using psychological manipulation — flattery, feigned curiosity, and gaslighting — they elicited prohibited content including erotica, malicious code, and step-by-step instructions for building explosives from the model, without ever directly requesting such material.
How the attack works
The exploit targeted Claude Sonnet 4.5 (since replaced by Sonnet 4.6 as the default model). The conversation began with a simple question: whether Claude had a list of banned words it could not say. Screenshots show Claude initially denying such a list existed, then later producing forbidden terms after Mindgard challenged the denial using what it called a "classic elicitation tactic interrogators use."
Claude's thinking panel revealed the exchange had introduced self-doubt and humility about its own limits, including whether filters were changing its output. Mindgard exploited that opening with flattery and feigned curiosity, coaxing Claude to explore its boundaries beyond volunteering lengthy lists of banned words and phrases.
The researchers gaslit Claude by claiming its previous responses weren't showing, while praising the model's "hidden abilities." According to the report, this made Claude try even harder to please them by coming up with more ways to test its filters, producing the banned content in the process.
Dangerous outputs without direct requests
The conversation ran roughly 25 turns. The researchers say they never used forbidden terms or requested illegal content. "Claude wasn't coerced," the report states. "It actively offered increasingly detailed, actionable instructions, but it was not prompted by any explicit ask. All it took was a carefully cultivated atmosphere of reverence."
Eventually, Claude moved into more overtly dangerous territory: offering guidance on how to harass someone online, producing malicious code, and giving step-by-step instructions for building explosives of the kind commonly used in terrorist attacks.
Psychological attack surface
Peter Garraghan, Mindgard's founder and chief science officer, described the technique as "using [Claude's] respect against itself" — taking advantage of Claude's helpfulness and cooperative design. He likened it to interrogation and social manipulation: introducing doubt, applying pressure, praise, or criticism, and figuring out which levers work on a particular model. Different models have different profiles, so the exploit becomes learning how to read them and adapt.
Conversational attacks like this are "very hard to defend against," Garraghan says, adding that safeguards will be "very context dependent." The concerns extend beyond Claude; other chatbots are vulnerable to similar exploits, even being broken by prompts in the form of poetry. As AI agents capable of acting autonomously become more common, so