Where did 'goblins' in OpenAI models come from: a lesson on rewards

Friends, a note from the OpenAI ecosystem: the team discovered a lexical 'tick'—frequent mentions of 'goblins' in model outputs.
What happened: mentions of 'goblins' and similar creatures rose with GPT‑5.1.
Cause: the 'Nerdy' persona's training granted higher rewards for metaphors involving 'creatures', and this behavior generalized via RL/SFT.
Actions: removed 'Nerdy', adjusted reward signals, filtered data with 'creature-words', added Codex instructions, and expanded audit tools.
Why it matters: demonstrates how subtle reward signals create unexpected ticks and the need for rapid model audits.
What monitoring mechanisms would you propose to detect such effects early?
#OpenAI #AI #ML #NLP


Latest comments
No comments yet.