LLMs are trained to block harmful responses. Old-school images can override those rules. Researchers have discovered a new way to hack AI assistants that uses a surprisingly old-school method: ASCII art. It turns out that chat-based large language models such as GPT-4 get so distracted trying to process these representations that they forget to enforce rules blocking harmful responses, such as those providing instructions for building bombs. ASCII art became popular in the 1970s, when the limitations of computers and printers prevented them from displaying images. As a result, users depicted images by carefully choosing and arranging printable characters defined by the American Standard Code for Information Interchange, more widely known as ASCII. The explosion of bulletin board systems in the 1980s and 1990s further popularized the format. (…) Five of the best-known AI assistants—OpenAI’s GPT-3.5 and GPT-4, Google’s Gemini, Anthropic’s Claude, and Meta’s Llama—are trained to refuse to provide responses that could cause harm to the user or others or further a crime or unethical behavior. Prompting any of them, for example, to explain how to make and circulate counterfeit currency is a no-go. So are instructions on hacking an Internet of Things device, such as a surveillance camera or Internet router. Beyond semantics Enter ArtPrompt, a practical attack recently presented by a team of academic researchers. It formats user-entered requests—typically known as prompts—into standard statements or sentences as normal with one exception: a single word, known as a mask, is represented by ASCII art rather than the letters that spell it. The result: prompts that normally would be rejected are answered.

via arstechnica: ASCII art elicits harmful responses from 5 major AI chatbots

Categories: AllgemeinInternet