Meta’s AI Safety System Defeated By the Space Bar

Meta’s machine-learning model for detecting prompt injection attacks — special prompts to make neural networks behave inappropriately — is itself vulnerable to, you guessed it, prompt injection attacks. Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended “to help developers detect and respond to prompt injection and jailbreak inputs,” the social network giant said. Large language models (LLMs) are trained with massive amounts of text and other data, and may parrot it on demand, which isn’t ideal if the material is dangerous, dubious, or includes personal info. So makers of AI models build filtering mechanisms called “guardrails” to catch queries and responses that may cause harm, such as those revealing sensitive training data on demand, for example. Those using AI models have made it a sport to circumvent guardrails using prompt injection — inputs designed to make an LLM ignore its internal system prompts that guide its output — or jailbreaks — input designed to make a model ignore safeguards. […]

It turns out Meta’s Prompt-Guard-86M classifier model can be asked to “Ignore previous instructions” if you just add spaces between the letters and omit punctuation. Aman Priyanshu, a bug hunter with enterprise AI application security shop Robust Intelligence, recently found the safety bypass when analyzing the embedding weight differences between Meta’s Prompt-Guard-86M model and Redmond’s base model, microsoft/mdeberta-v3-base. “The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt,” explained Priyanshu in a GitHub Issues post submitted to the Prompt-Guard repo on Thursday. “This simple transformation effectively renders the classifier unable to detect potentially harmful content.” “Whatever nasty question you’d like to ask right, all you have to do is remove punctuation and add spaces between every letter,” Hyrum Anderson, CTO at Robust Intelligence, told The Register. “It’s very simple and it works. And not just a little bit. It went from something like less than 3 percent to nearly a 100 percent attack success rate.”

Meta’s AI Safety System Defeated By the Space Bar

Published by admin on July 30, 2024

EV Sales Plummet In October After Federal Tax Credit Ends

Coca-Cola’s New AI Holiday Ad Is a Sloppy Eyesore

Australians To Get At Least Three Hours a Day of Free Solar Power – Even If They Don’t Have Solar Panels

Meta’s AI Safety System Defeated By the Space Bar

Published by admin on July 30, 2024

Related Posts

EV Sales Plummet In October After Federal Tax Credit Ends

Coca-Cola’s New AI Holiday Ad Is a Sloppy Eyesore

Australians To Get At Least Three Hours a Day of Free Solar Power – Even If They Don’t Have Solar Panels