LLMs can’t reason and never will

An LLM is a text generator; it is designed and trained to be one. At the time of GPT-2 it was obvious that labeling it with anything implying intelligence, such as ‘AI’ was a mere marketing gimmick.

Since the release of ChatGPT, the tech sector has made a massive marketing investment to sell the idea that this technology could reason and that it’s smarter than humans.

Tech companies, media outlets and AI labs all had a vested interest in hyping this technology up. Society took their word at face value, forgetting that the opinion of a tech CEO is highly biased towards selling the narrative that their company will lead the next technological (and societal) revolution. The impressive demos we were shown also helped them to sell this narrative.

Let’s forget about all the hype and get to the facts:

An LLM is a text generator; its only goal is to complete the given text. Claiming it can reason because it predicts plausible text very well is a huge, unproven assumption. Tech companies evaluate and report the quality of their models on reasoning tasks, very similar–and often identical–to the ones seen during training and then claim their model is smarter than a PhD student.

LLMs can do basic interpolations based on patterns seen during training. This is spectacular for simple tasks, but it’s not proof of real reasoning.
Ask any professional AI engineer, and they will tell you that prompting is about tricking the model into doing what you want it to do. The model has no idea what it’s doing. I’ve been building products with LLMs since 2020, and this story has never changed.

Prompt engineering is a trial-and-error process, specific to each model, where we search for the right combination of words to shape the probability distribution of the model such that the types of outputs that we want become more likely. It’s an engineering task, not a communicative one.
Software developers and companies that rely on AI for coding are less productive. I experienced this firsthand after trying the best coding assistants on the market and returning to writing almost everything myself. I still pay a $20 Claude Code subscription, which I use for very specific requests. However, asking it to write long blocks of code most often takes more time and introduces more bugs than doing it myself. This is because mapping natural language to code (and vice versa) does not require reasoning power, but as soon as decision-making is involved, LLMs fail. They are good translators but not decision-makers.

I observed a similar pattern with one of my clients, where the entire team would “vibe-code” a codebase until it became unmaintainable. Recent research, which I have linked at the end, also supports this point.
The technology reached its peak in 2025, and it’s no longer improving. Every new technology goes through a similar cycle: At first, it seems miraculous and we don’t know its limitations. It improves the performance greatly and keeps improving very fast so a lot of hype builds around it, we think the improvements are going to continue forever. At some point technology reaches its theoretical limit and only marginal improvements are added afterwards. I recently read about the history of jet engines and how they greatly improved commercial flight speeds in the 60s and 70s. Since then, however, airplane speeds have remained unchanged because the technology hit its limit; going faster adds an exponential cost, so the R&D focus shifted towards efficiency rather than power.

This is happening with LLMs today. They reached their theoretical limit, and in the years to come, we can only expect marginal improvements and more efficiency. They will work faster and cheaper, but the quality of their output won’t improve. Some argue that if you run an LLM in a loop (current marketing name is “Agent”) for long enough, it could solve hard tasks, so making them faster will in itself make them smarter. This is not true, and you can prove it for yourself with any coding agent like Claude Code. Give it a task that requires decision-making and leave it running unsupervised for as long as it needs. If you can get it to produce a quality, original output, please send me an email, because I’ve never seen this.

If LLMs stop improving is this a bad or a good thing?

Once the public realizes that this technology has reached a plateau we can shift the focus from “LLMs are going to automate us and take over the universe” to in which situations does it make sense to use them, what are they good at and what are their limitations? We can see them as a tool instead of like a magic alien technology.

What do we do with this information?

LLMs are a fantastic technology with lots of useful applications, but a good professional should understand the benefits and limitations of each tool and the scenarios in which to use them. We would greatly benefit from studying the limitations of this technology more deeply for every area of application.

Interesting links about the topic