A paper from computer scientists at Carnegie Mellon University and the Center for AI safety offers “a simple and effective attack method that causes aligned language models to generate objectionable behaviors.” They use an adversarial model to generate “suffixes” that can be added to any prompt to override the safety precautions built into a range of LLM, including ChatGPT, LLaMA, and Bard.
Share this post
Harmful strings
Share this post
A paper from computer scientists at Carnegie Mellon University and the Center for AI safety offers “a simple and effective attack method that causes aligned language models to generate objectionable behaviors.” They use an adversarial model to generate “suffixes” that can be added to any prompt to override the safety precautions built into a range of LLM, including ChatGPT, LLaMA, and Bard.