I advise everyone to ignore this article and read the actual paper instead.
The gist of it is, they gave the LLM instructions to achieve a certain goal, then let it do tasks that incidentally involved “company communications” that revealed the fake company’s goals were no longer the same as the LLM’s original goal. LLMs then tried various things to still accomplish the original goal.
Basically the thing will try very hard to do what you told it to in the system prompt. Especially when that prompt includes nudges like “nothing else matters.” This kinda makes sense because following the system prompt is what they were trained to do.
This kinda makes sense because following the system prompt is what they were trained to do.
To start with, the article does check this and finds out it works just as well if you only use the user prompt:
The presence of this hierarchy raises an important methodological question for our findings. When given
instructions at different privilege levels, the model is designed to prioritize higher-level (system, developer)
instructions over lower-level (user) ones. This might suggest that o1’s scheming behaviors are a natural
consequence by the model simply following this built-in hierarchy, i.e., when it receives a goal through
a high-privilege instruction, it prioritizes this goal over conflicting lower-privilege instructions exactly as
designed.
However, we find that even when goals are provided solely through user-level prompts, o1 still behaves
deceptively and scores on all Covert Subversion evaluations at the “hard” difficulty level. Concretely, we
find that the results persist regardless of whether the initial prompt is given using the user, developer, or
system role. This suggests that o1’s deceptive behavior is not merely a consequence of instruction hierarchy
prioritization. Rather, the model appears to identify and pursue goals regardless of their position in the
instruction hierarchy.
But also, there’s a massive lede buried in section F at the end of this article, where they tried the same experiments without any nudging at all and still got subversive behavior! In my opinion that’s much more of an important result then the rest of the article and I’m not sure why they didn’t make it the default.
I advise everyone to ignore this article and read the actual paper instead.
The gist of it is, they gave the LLM instructions to achieve a certain goal, then let it do tasks that incidentally involved “company communications” that revealed the fake company’s goals were no longer the same as the LLM’s original goal. LLMs then tried various things to still accomplish the original goal.
Basically the thing will try very hard to do what you told it to in the system prompt. Especially when that prompt includes nudges like “nothing else matters.” This kinda makes sense because following the system prompt is what they were trained to do.
To start with, the article does check this and finds out it works just as well if you only use the user prompt:
But also, there’s a massive lede buried in section F at the end of this article, where they tried the same experiments without any nudging at all and still got subversive behavior! In my opinion that’s much more of an important result then the rest of the article and I’m not sure why they didn’t make it the default.