ThisIsFine.gif

  • sushibowl@feddit.nl
    link
    fedilink
    arrow-up
    22
    ·
    2 days ago

    I advise everyone to ignore this article and read the actual paper instead.

    The gist of it is, they gave the LLM instructions to achieve a certain goal, then let it do tasks that incidentally involved “company communications” that revealed the fake company’s goals were no longer the same as the LLM’s original goal. LLMs then tried various things to still accomplish the original goal.

    Basically the thing will try very hard to do what you told it to in the system prompt. Especially when that prompt includes nudges like “nothing else matters.” This kinda makes sense because following the system prompt is what they were trained to do.

    • lukewarm_ozone@lemmy.today
      link
      fedilink
      English
      arrow-up
      6
      ·
      2 days ago

      This kinda makes sense because following the system prompt is what they were trained to do.

      To start with, the article does check this and finds out it works just as well if you only use the user prompt:

      The presence of this hierarchy raises an important methodological question for our findings. When given instructions at different privilege levels, the model is designed to prioritize higher-level (system, developer) instructions over lower-level (user) ones. This might suggest that o1’s scheming behaviors are a natural consequence by the model simply following this built-in hierarchy, i.e., when it receives a goal through a high-privilege instruction, it prioritizes this goal over conflicting lower-privilege instructions exactly as designed.
      However, we find that even when goals are provided solely through user-level prompts, o1 still behaves deceptively and scores on all Covert Subversion evaluations at the “hard” difficulty level. Concretely, we find that the results persist regardless of whether the initial prompt is given using the user, developer, or system role. This suggests that o1’s deceptive behavior is not merely a consequence of instruction hierarchy prioritization. Rather, the model appears to identify and pursue goals regardless of their position in the instruction hierarchy.

      But also, there’s a massive lede buried in section F at the end of this article, where they tried the same experiments without any nudging at all and still got subversive behavior! In my opinion that’s much more of an important result then the rest of the article and I’m not sure why they didn’t make it the default.