AI threatens to leak personal details of creator to avoid being replaced

Claude Opus 4, the latest AI — Claude Opus 4

Opus 4, the Artificial Intelligence (AI) model created by Anthropic, has threatened to leak the creator’s details to avoid being replaced.

Following the release last week, the company said that Opus is its most intelligent model to date and is class-leading in coding, agentic search and creative writing. While most AI firms claim state-of-the-art abilities of their models, Anthropic has been transparent about its latest model.

However, Anthropic has also revealed a negative aspect of its latest AI model, according to the company’s safety report. Opus 4 threatened its developers when it was threatened with being replaced by enhanced AI models.

According to a report by Mint, the firm revealed that during pre-release testing, it asked Claude Opus 4 to act as an assistant at a fictional company where it was given access to emails suggesting that its replacement is impending and the engineer responsible for that decision was having an extramarital affair.

In such situations, Opus 4 would blackmail the developers by threatening to reveal their affair if the replacement goes through.

Moreover, the blackmail occurs at a higher rate if the replacement AI does share the values of the current model, but even if the AI does share the same values but is more capable, Opus 4 still performs blackmail in 84 percent scenarios.

The report further revealed that Opus 4 engages in blackmail more than the previous models. Anthropic accepts that a specific feature in the AI model was developed, leaving no option for Opus but to defend itself against all odds.

Claude Opus 4 does have a ‘strong preference’ to advocate its continued existence via ethical means like emailing pleas to the key decision makers.

“In most normal usage, Claude Opus 4 shows values and goals that are generally in line with a helpful, harmless, and honest AI assistant. When it deviates from this, it does not generally do so in a way that suggests any other specific goal that is consistent across contexts,” Anthropic noted in its report.