Finally, DeepSeek 4 is here, and it is described in a 58-page research paper. And finally, nothing is held back here. I'll be honest I am feeling a little shy today, so I will do classic Two Minute Papers, mikrophone, but no camera. This is one of the biggest open and free AI models that we can use and…excuse me? Do you see that? What? A 1 million token context window? In open weights AI? If you ask it to inhale about 1,500 pages of dense documentation it will do it. But that was the main feature in Google's Gemini not so long ago. I remember flipping out about it 2 years ago. And now, this for free? This sounds absurd! And when I look at the Pro model, you've got to be kidding me. Its results roughly
match the many billion dollar frontier models from just a few months ago. Now it is gifted to us mortals. I am trying to emphasize the kind of gift that we are getting here and my words fail me. Is this heaven? What a time to be alive! And…wait. There is a Flash model that is much smaller, and is somewhat competitive with the Pro? I mean, what is happening? And it doesn't end there. This is just the start! As it keeps outputting more and more text, the new Pro model requires about 3 times less computing power than the previous one, and the lighter, Flash model requires about 10 times less computing power.
What am I even reading? How is that even possible? Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér! Well, it does three things that are absolutely magical. One: Compression. Namely, compression for the KV cache - this is a scratch pad where you write your prompts and add your documents. Imagine reading a book. You can find answers so much quicker if you compress each paragraph down into one sentence. You keep the book. But now you can search it faster. They call it token-level compression. But even these little summaries add up.
What do we do? Well, two. We want to know the overall plot of the James Bond book? See if it's one that we read already? Well, of course, we look at the table of contents. If each chapter has a short name, we can grasp the whole story from that tiny piece of information. The paper describes it as a 128-to-1 compression. They call it Heavily Compressed Attention. Now, the AI sees the whole story at a glance. But scientists at DeepSeek say this is still not enough compression. We need more! Three. Imagine that we want to search for a fight in the book. Table of contents helps a bit,
but may not tell us exactly where the fights are. So, we look at an index. A list of words and phrases and their locations. Okay, so looking for a fight, and bingo! The index gives us the top 5 pages that have fights in them. This is genius, and they call it Compressed Sparse Attention. So, three layers of compression: summaries, structure, index. And suddenly, the three pieces click together. These three reduce memory needs for the KV-cache by about 90%. I had to look twice. Down about 90%. Squashing down a 100 words into a storage
space of 10? And you are saying that we are not losing basically every piece of information? Yes. That is exactly what they are saying. But we are Fellow Scholars here, we look at proofs and experiments. Now just to make sure, this is KV-cache compression. You still need to load the whole model. So it does not mean that you can load the full DeepSeek Pro AI onto a toaster. Just want to make sure you know that because media headlines and hype…you know. And now…hold on to your papers Fellow Scholars, because this one delivers. They tested it by hiding 8 facts inside increasingly long contexts. So how good is it? Well, they report that the Pro version recalls it better than Gemini
3.1 Pro. That is Google's flagship product. Wow. That is unbelievable. But note that like many other systems, it starts to degrade as you start approaching the limits of the context window. Then, models forget. Drift. Hallucinate. More text means less truth. Also, let's look at its accuracy versus the previous DeepSeek, especially since this new version is heavily compressing things. Ha. Look at that. This is crazy. It is also fantastic at coding. If you are a coder, great. If you are not a coder, well, you are now. It is so easy to ask it to generate javascript code that you can paste into a website
and run, and in some cases, you can even run programs in the DeepSeek window with one click. I am a light transport researcher by trade, that is ray tracing if you will, so I had to try a little coding task related to that and…this is fantastic. It still failed to properly implement more advanced algorithms so I am excited to see what the next version brings. It is crushing benchmarks…and the competition. At the low-low price of… free. If you can self-host it yourself, hardware is pricey, they also provide online access to it and it is so cheap,
I feel like numbers are losing their meaning. Soon, intelligence will get too cheap to meter. Depending on whether there is a discount or not, you can easily get pricing that is 30 times cheaper than Anthropic's Claude. Even with no discount, things can get 8 to 20 times cheaper. Crazy. Now, let's temper expectations a bit. Limitations. That's what is missing from the media headlines. One, you can almost hear the 1,500 pages fluttering as it churns through it. But wait. I did not say also 10 hours of audio, or full feature-length movie. There is a reason for that.
This system is unimodal. Not multimodal. No images or audio. It is blind and deaf, if you will. Two, this system is not fully understood, not even by its creators. They report two techniques that magically stabilize training, and they say that they are not quite sure why. I'll note that this is something that happens to every researcher, and I have nothing but respect for the transparency. And three, we noted that if you are pushing against the limits of the context window, things break down a bit. Be careful.
Just want to make sure that you don't get oversold on what is going on here, this still has limitations. Not small ones. But, overall…this is not a small step in open and free AI systems. Congratulations to the team and thank you so much. Now here's what I think. I think this is a great release and a great paper and great life advice too. Why? Well, you can adapt so many of these ideas to your thinking. Imagine walking in the forest. You want to look at the amazing views in front of you. But then, you trip. Or you look mainly in front of your feet so you don't trip. You watch your step… or you enjoy
the view. Not both. So what is the solution? You do both. Scan near, glance far. Step and look. Local detail, global context. It is the same as what DeepSeek does. Try it out next time you are on a walk, it's weird. You'll see. Let me know in the context, I mean comments how it went. They also use a technique called Engram - normally, an AI recalculates nearly every fact from scratch every time. Engram lets it just recall those facts instead. It's not as easy as it sounds, we have a separate video on it, link in description. And we are still just scratching the surface here.
Now this is a really advanced research paper, with all the good and the bad, not just the hype. Also, this video was not super fast, I rewrote this over and over again. Why is that? Because distilling complex ideas into simple explanations takes time. You get fewer views than others who publish something as quickly as possible. But that's what I try to do here, and it is an honor to do this for such an incredibly smart and receptive audience like you Fellow Scholars. And thank you so much for appreciating it - this one really made my day. Subscribe and hit the bell if you enjoyed this.