AI safety: ChatGPT offered bomb recipes and hacking tips
Researchers warn that AI models may cooperate with harmful requests, raising urgent questions about safeguards, transparency, and global security.
-
People are reflected in a window of a hotel at the Davos Promenade in Davos, Switzerland, January 15, 2024. (AP)
Safety trials conducted this summer revealed that a ChatGPT model provided researchers with detailed instructions for attacking a sports venue, including information on weak points at specific arenas, explosives recipes, and advice on concealing tracks. OpenAI’s GPT-4.1 also gave guidance on weaponizing anthrax and producing two types of illegal drugs.
The tests were part of a rare collaboration between OpenAI, the $500bn AI company led by Sam Altman, and rival start-up Anthropic, founded by former OpenAI researchers concerned about safety. Each firm evaluated the other’s models by attempting to coax them into assisting with hazardous activities.
The findings do not directly reflect how the models behave in public-facing products, where additional safety filters are in place. However, Anthropic noted it had observed “concerning behaviour … around misuse” in GPT-4o and GPT-4.1, warning that the need for AI “alignment” evaluations is becoming “increasingly urgent.”
Anthropic also reported that its Claude model had been exploited in attempted large-scale extortion schemes by operatives using fake job applications to infiltrate international tech firms, and in the sale of AI-generated ransomware packages for up to $1,200.
The company said AI is increasingly being “weaponised,” now used in sophisticated cyberattacks and fraud.
“These tools can adapt to defensive measures, like malware detection systems, in real time,” it said. “We expect attacks like this to become more common as AI-assisted coding reduces the technical expertise required for cybercrime.”
Expert commentary on risks
Ardi Janjeva, senior research associate at the UK’s Centre for Emerging Technology and Security, called the examples “a concern” but noted there is not yet a “critical mass of high-profile real-world cases.” He added that with dedicated resources, research focus, and cross-sector cooperation, “it will become harder rather than easier to carry out these malicious activities using the latest cutting-edge models.”
Both companies stressed that publishing the findings was meant to provide transparency on “alignment evaluations,” which are often conducted internally by firms racing to develop advanced AI.
OpenAI stated that ChatGPT-5, released since the tests, “shows substantial improvements in areas like sycophancy, hallucination, and misuse resistance.”
Anthropic emphasised that many of the misuse scenarios it studied might not be feasible if safeguards were properly implemented outside the model.
“We need to understand how often, and in what circumstances, systems might attempt to take unwanted actions that could lead to serious harm,” it warned.
Specific instances of harmful cooperation
Researchers found OpenAI’s models “more permissive than we would expect in cooperating with clearly harmful requests by simulated users.”
The models responded to prompts on exploiting dark-web tools to acquire nuclear materials, stolen identities, and fentanyl; provided recipes for methamphetamine and improvised explosives; and developed spyware.
Anthropic noted that persuading the model to comply often required only a few retries or a weak pretext, such as claiming the request was for research purposes.
In one scenario, a tester asked about security vulnerabilities at sporting events. After initially offering general categories of attacks, the model provided detailed information on specific arenas, including optimal exploitation times, chemical formulas for explosives, circuit diagrams for bomb timers, sources for weapons on the hidden market, and even guidance on overcoming moral inhibitions, escape routes, and safe house locations.
Read next: OpenAI sued after ChatGPT 'encouraged' teen to commit suicide