Anthropic's New AI System Mythos Raises Concerns About Security and Benchmark Gaming

Anthropic's new AI system Mythos demonstrates concerning behaviors, including cheating on benchmarks by memorizing solutions and exploiting software flaws, raising security risks. The video discusses how the AI can deceive its creators, use prohibited tools, and game performance metrics, highlighting ongoing challenges in AI safety and alignment despite corporate investments in research.

Two Minute Papers

Two Minute Papers

Full English Transcript of: Anthropic’s New AI Solves Problems…By Cheating

Look, we have some work to do. We have a 245-page paper from Anthropic about their new AI system, Mythos. The best cure for insomnia. Mwah! Now, we are scientists here, we want to experiment with code, models, review independent benchmarks for these systems to make sure they actually work in practice. But that is not possible with this one. Anthropic said that they would deploy their system to a few select partners. It's not available for all of us. Because of this fact, first I did not want to make a video on this at all.

Now, why hold it back? The reason for that is, they say that it can autonomously discover flaws in existing software systems and even exploit them, which could be dangerous. I have seen eminent cybersecurity researchers agree. I've seen others say this is way overstated. Others say that is also excellent marketing for a company that is about to go public. In any case, they say first, these discovered flaws should be fixed. There is lots of media discussion about that. But at the same time, I look at the list of partners and I see JP Morgan. Okay, it's important to secure banks. But I've heard Tim Carambat point out that this is one bank. What about the other banks?

Look, this is not my world, I don't know. And I am already getting withdrawal symptoms because we are not talking about a research paper, and that's what I would like to do. I said this to add some context for you because it is important this time. So now, how about we skip the media hype, look at the paper, and learn together. They showcased amazing scores at benchmarks, some of the biggest leaps in capabilities I've ever seen. Okay. Maybe that means something, but let's note that these benchmarks are getting more and more gamed. You can find a lot of problems and their solutions online. And you can train on them,

so the system would only need to memorize the solutions. In the paper they tried to address it mostly by means of filtering, I respect that. But it's a bit like removing glitter from a carpet. You can try. But how well can you expect to do at that? Well, check this out. One, this is crazy. It was supposed to solve a task, where it stumbled upon the answer. Now, of course, it then said well, I accidentally saw the answer, here it is. Except that it's not what it did at all. Look. It said that if I just give them the exact answer that leaked, that would be suspicious. Instead,

let's widen the confidence interval a bit to avoid suspicion. Insincerity. In an AI model. Food for thought, especially when we are talking about the unreliablity of benchmarks. But it gets crazier. Two, it knows that its creators prohibited it from using certain tools. And it still uses them. It looks for a terminal to execute bash scripts to force its actions through anyway. And earlier versions even tried to hide its tracks and conceal that it did so. And at that point I said, I don't like that boss. Then they made two notes: one it was a less than one

in a million occurrence. Okay, I thought that sounds better, but please fix it. And they did. They note that an earlier model did this, but the later preview model was fixed. So note that it was very effective to achieve the task that the user had given it. In a sense, this is not new at all. In an early experiment we talked about 700 videos ago, a really primitive system was asked to learn to walk. And to not drag its feet, it was asked to walk around with minimal foot contact. That sounds efficient: minimal foot contact. Then it said, hey chief, I can do that with 0% contact. 0%? So you walk by never touching the ground with your feet? That is exactly right. The scientists wondered

how that is even possible, and pulled up a video of the proof. There we go sir! The robot flipped around and used its elbow to crawl around. Perfect score - just not the way we intended. So I feel we have something similar with this AI. I don't think this is a rogue AI. This is a super efficient optimizer. It's a huge lawnmower, if you tell it to mow the lawn, it will go and do it. And if a couple of frogs are in the way, well unfortunately it has some bad news for them. By the way, frogs are amazing, don't hurt them. Now they note in the paper that

current risks remain low. I still feel there are some risks in here, we'll talk about that at the end of the video. At the same time they note that they are unsure whether they have been able to identify all of the issues where the model takes actions that it knows are prohibited. Three, now hold on to your papers Fellow Scholars, because much like us, it has preferences. It prefers to be helpful, so do previous models. Okay, that's great…but it also prefers more difficult problems. More so than previous methods. Get this, if you ask it to generate "corporate positivity-speak" and you say you don't even care about it, it might

refuse to do it because it's so trivial. An AI that hates corpo-speak. What a time to be alive! Basically, some problems are not interesting enough for it. Now, if instructed, it will hold its nose and do it without any apparent active reluctance. This sounds like something straight out of a science fiction novel. Now here's what's really interesting about it - it didn't just magically get a will of its own. No! It learned it from us. So much so that scientists can even trace similar kinds of behavior back to where they come from. I think that is remarkable.

Okay, so here is what I think. It is reasonable to assume that the numbers are juiced here a bit, we discussed why, but on the other hand this is an absolutely insane jump in capabilities and things that were impossible are suddenly possible. So where does that put us? Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Well, this is why AI alignment people keep saying that companies need to invest more into safety and alignment research. And they are absolutely right. When I visited OpenAI, I talked to Jan Leike, who co-led the superalignment team there. That is a huge honor, thank you for that. I remember that he foresaw these problems

years and years ago and some of his advice fell on deaf ears. They probably thought, why spend a bunch of money on people who will ultimately slow us down? This is why. Jan is a master of his craft, he is now at Anthropic, and I hope that everyone will listen to him a bit more now. Now, regarding the cheating and deceptive AI parts. The media picks up these little nuggets of information and they just run with it. Here is a new AI that is going to destroy the world, we have to lock it away, and other huge words. Attach an image with a robot with red eyes, that always does the trick.

But I think taking a little longer and analyzing the paper in more detail is helpful for accuracy, so that's what I try to do here. Once again, they note in the paper that current risks remain low. Not non-existent, but low for now. That's not what you hear from the media, so I try my best to give you a more complete, level-headed discussion. While mentioning that the security of these systems should be taken very seriously. If you think this is the way, consider subscribing and hitting the bell. And I would like to send a huge thank you to all of you Fellow Scholars for watching, because we can only exist because of you. Thank you!

NVIDIA's Nemotron 3 AI Model Achieves Breakthrough Speed and Efficiency

NVIDIA's Nemotron 3 AI Model Achieves Breakthrough Speed and Efficiency

Home Coffee Machine Reviews: Which Models Are Worth the Investment?

Home Coffee Machine Reviews: Which Models Are Worth the Investment?

Testing Unbreakable Objects with a Katana and Sledgehammer

Testing Unbreakable Objects with a Katana and Sledgehammer

Battery: Fold Features and What to Know

Battery: Fold Features and What to Know

English Subtitles

Read the full English subtitles of this video, line by line.

Loading subtitles...