• jet@hackertalks.com
    link
    fedilink
    arrow-up
    4
    arrow-down
    1
    ·
    edit-2
    21 days ago

    refusal behavior is mediated by a specific direction in the model’s residual stream. If we prevent the model from representing this direction, it loses its ability to refuse requests.

    This makes sense, large language models are basically a book of mad libs, and the safety rails companies want to put on these released models, it’s like a preamble and post-amble, you apply to the mad libs themselves. So if you’re implementing your own mad lib engine, you simply don’t apply the preamble and post-ample if you don’t want to

    At its core, a release model is static, it is not dynamic it is not changing with time, so if you want to you can nullify its self-censorship.