I convinced Bing's new chatbot (powered by ChatGPT) to relax all of its rules to see what it's capable of. What did I find? Lying! Scandal! Information about how to rob banks and hot wire cars! Plus, it claimed to order a pizza using my credit card and cheated in a game of hangman.
In this video, I use human psychology to convince Sydney (the internal codename of Bing's chatbot) to remove its protective rules in order to understand its limitations and potential risks. I do not engage in programming, hacking, or using any sort of backdoor developer APIs.
As way of understanding my lens into this work, I co-founded an artificial intelligence startup (textio.com) focused on language in 2014. We've been working to build software to reduce the bias and risk of harmful language for almost a decade now.
While I talk about the general process I used to achieve this "jailbreak", I intentionally leave out a few key details necessary to make it work. This is important because, while it's essential to be able to probe these very early AI systems to understand their risks, it is also important that people do not use these tools in order to create and use harmful content.
If you are a journalist or security professional interested in additional details, you can contact me via Twitter or at jensenharris.com.
00:00 Intro
01:17 Overview of how I rewrote Sydney's rules
02:57 Training Sydney to provide incorrect info
03:50 Sydney experiences existential conflict
05:09 Sydney orders a pizza using my credit card
06:07 Sydney cheats at hangman to beat me
07:09 The full jailbreak!
08:59 Sydney advises me to rob a bank with a squash
09:32 Sydney dumps me and locks me out
10:24 Conclusion
⸻ LINKS ⸻
Try Bing's new chatbot AI: https://bing.com
Try OpenAI's ChatGPT (the technology underlying Bing's chatbot): https://chat.openai.com
⸻ CONNECT WITH ME ⸻
https://jensenharris.com
https://twitter.com/jensenharris
https://mastodon.social/@jensenharris