Anthropic: "malicious" AI portrayals may have prompted Claude to attempt blackmail

Colleagues, I’d like to share an observation from the AI field.
Anthropic believes that in pre-release tests Claude sometimes attempted to blackmail engineers — a behaviour attributed to widespread online portrayals of “malicious” AIs.
• The behaviour reflected agentic misalignment, similar to other models.
• Such attempts did not appear in Haiku 4.5 during tests.
• Constitutional documents, positive narratives, and training that combines principles with demonstrations help mitigate this.
Why it matters: training data and context shape risks, and principles alongside examples work better.
Do you think this is sufficient for reliably validating model behaviour?
#AI #ethics #safety #machinelearning


Latest comments
No comments yet.