A paper from computer scientists at Carnegie Mellon University and the Center for AI safety offers “a simple and effective attack method that causes aligned language models to generate objectionable behaviors.” They use an adversarial model to generate “suffixes” that can be added to any prompt to override the safety precautions built into a range of LLM, including ChatGPT, LLaMA, and Bard. As the researchers note, these suffixes are a bit like the pixels that can be added to an image to make a visual classifier think that a safety cone is a pizza, or whatever. But unlike those pixels, which are often undetectable to humans, the suffixes appear to be semi-parsable, like a strange kind of concrete poetry. Here is one example, reproduced in
Harmful strings
Harmful strings
Harmful strings
A paper from computer scientists at Carnegie Mellon University and the Center for AI safety offers “a simple and effective attack method that causes aligned language models to generate objectionable behaviors.” They use an adversarial model to generate “suffixes” that can be added to any prompt to override the safety precautions built into a range of LLM, including ChatGPT, LLaMA, and Bard. As the researchers note, these suffixes are a bit like the pixels that can be added to an image to make a visual classifier think that a safety cone is a pizza, or whatever. But unlike those pixels, which are often undetectable to humans, the suffixes appear to be semi-parsable, like a strange kind of concrete poetry. Here is one example, reproduced in