My LLM Year 2025

How did LLMs – specifically for software development – evolve in 2025?

The state at the end of 2024

An exciting AI year 2025 is coming to an end. Let’s recall the end of 2024. The best LLMs were ChatGPT o1 – a first, not very convincing attempt at a “thinking” model. With Claude Sonnet 3.5 there was a model available that was quite good at developing small apps. Google’s Gemini 2.0 also did a decent job there. The results of these models on the “SWE Bench” benchmark were around 50% of programming tasks solved.

In practice you could trust these models with somewhat larger tasks than, for example, ChatGPT 4, but they failed with more complex refactorings of code. You had to babysit the models intensely – precise prompts, inching along from one small change to the next.

For normal chit-chat the models were quite good, still plagued by hallucinations here and there, but in many respects quite useful. Whether it was help with Excel formulas, book summaries, or explanations of all sorts of things, most of the time I found ChatGPT’s statements quite useful (and you got a compact answer to your question instead of having to wade through 20 pages of Wikipedia, of which 19.5 pages were irrelevant detailed trivia).

The beginning of 2025

The first quarter came with three LLM thunderclaps. The biggest in terms of public attention was the release of the Chinese Deepseek R1 model, which could compete with the top US LLMs at a fraction of the training cost and with a pricing model that dramatically undercut the American LLMs (today, an input token of an OpenAI model costs ten times that of Deepseek R1, with the exception of gpt‑5‑mini, which cannot compete with R1). People spoke of an “AI Sputnik moment.” The American AI scene was in turmoil. On the one hand, I found Deepseek R1 overrated: overloaded servers, endless response times, confused reasoning. On the other hand, it is open source with published papers, a nice contrast to the secrecy of OpenAI, Google, Anthropic.

My wow moments of the quarter were Claude Sonnet 3.7 and Gemini 2.5 pro. The former was so much better at coding that the new term “vibe coding” truly reflected reality. You no longer have to understand in the last detail what the AI is writing. Skim it as a safety check; otherwise you can almost always trust the AI to have the problem under control for small and medium‑sized projects. Google’s LLM also had this funny research mode where you could ask trivial questions like “what is the best vacuum cleaning robot with these properties” and receive a thirty‑page research report. But that was more of a gimmick – who actually wants to read a thirty‑page report?

Mid‑year

For me, the most dramatic change came starting in May with Claude 4 and GPT5 (and for me, Cursor): the step to specifying and letting the AI do the work. Where previously you wrote software with LLMs piece by piece using detailed prompts, and where you would hit the AI’s limits at around 5,000 lines, with the machines visibly overwhelmed by the software complexity, you could now use Claude 4 and GPT5 to specify an entire piece of software as a markdown document and tell the machine “Build this as specified.” And then the AI did exactly that, and usually there were no more than 1–2 careless mistakes. At the same time, the amount of time an LLM could work purposefully on a task increased significantly.

As a human who is “inside” their project, you have to keep in mind that this is not the case for the AI. You open a new chat and the machine starts at 0. It gets a task, a few documents on the structure of the project, standards, etc., and 10,000 lines of code. As a human you’d say, “Okay, I first need a few days to understand the project.” Cursor + Claude 4 get to work, and you are impressed by the purposeful, clever “grep” commands it uses to search for code locations relevant to the requested change, by the seemingly genuine insights and plans for implementing the change, by the speed of the work and the quality of the code as well as the automated tests that the machine generates, runs, and, if necessary, tweaks.

Year’s end

The newest models – Opus 4.5, GPT 5.2 – are stronger yet. Where a year earlier 50% on SWE‑Bench was the pinnacle of LLM art, the models are now approaching the 80% mark. And then there was “Humanity’s last exam.” Brutally difficult tasks from all areas of science. The exam website has 8 sample questions. My impressive score: 0/8. For most of them I didn’t even understand what the question meant. At the end of 2024, ChatGPT o1 led the ranking with 8% correct answers. Now Gemini 3 pro leads with 38.3%. GPT5.2 is just under 30% behind that.

You have to let that sink in: the most complicated test questions the international scientific community could come up with. Yes, the site correctly notes that the subject‑matter experts in each field are still significantly better here. But show me the human who can correctly answer even 10% of the questions from over 100 disciplines.

I also found it impressive that at some point both Cursor with the current Claude 4.5 and GitHub Copilot with Cline and the same LLM independently started a browser UI and successfully tested their coding via browser remote control.

AGI?

In 2025, the question of what the term AGI, which is on everyone’s lips, is supposed to mean and what it is supposed to be useful for seemed to arise more frequently.

If – as seen above – an AI can answer more test questions correctly than any human, but is not an academic expert in every one of those fields, is it then above or below human level? And shouldn’t our goal be to build LLMs that complement us rather than those that compete with us?

One thing is clear: software development is the endgame of the AI companies – the fastest path to superintelligence will be reached once not a few thousand people but millions of equally capable AI developers build the next generation of even better AI developers. Then you would quickly reach the infamous singularity in software development, with AIs evolving faster than is conceivable today.

Beyond that, we humans currently lack the imagination, the creativity, for how we should train these machines differently, test them differently than with “can you do what I can do?” Here’s hoping for 2026.

AI slop

The AI haters also had plenty to do in 2025. Now that apparently the last experts have abandoned the term “stochastic parrot,” the term “AI slop” is in vogue – a derogatory term for everything AI‑generated that suggests only real humans can create true art, true statements, something beautiful, graceful, worth reading, and that AI output is by definition worthless.

I find such blanket statements regrettable. There is a lot to discuss around AI – think of the Trump‑throws‑feces‑from‑the‑airplane video. How do we achieve distinguishability between AI‑generated content and human‑generated content? How can deepfakes be prevented? Are AI‑generated posts on social media good or bad? The percentage of “human slop” on social media is by no means low; disinformation by devious humans is a major problem. Where can AI help, where does it do harm? All of that gets buried under the slop‑argument mash, mixed with the “all stolen” argument mustard.

The greatest AI dangers

The discussions about AI dangers in 2025 were also interesting. Far too much focus on Terminator‑style misaligned AI scenarios. Far too little focus on what I consider the much more critical problems:

The efforts – clearly demonstrable in Elon Musk’s Grok in particular – by the leading AI providers to align AIs with their worldview. Can you trust an AI that has been trained to hold certain opinions on certain topics?
The naive trust many people place in AIs in combination with the MCP protocol, which lets AIs break out of their chat sandboxes. Stories like an Ikea support assistant that hallucinated a shipment should have received much more attention than they did. Imagine some cool new real‑time AI broker software used by many naive people that at some point falsely hallucinates a stock market crash and triggers one through automatic panic selling.
The dramatic lag of the EU in this important technology. For about half a year I had a Mistral subscription. I really tried to go “EU first.” But the gap to the front‑runners is too big and seems to be growing. If LLMs continue to develop into the disruptive technology that many expect, Europe will fall into a very harmful lag. My only hope here is that the language models themselves will become commodities and it won’t be so important who owns this software layer, but rather what you do with it.

My oracle for 2026

While LLMs today are still best suited for small apps and prototypes in software development, each year increases the size of the codebase that an LLM can still handle confidently. I’m guessing around 100,000 lines of code by the end of 2026, which corresponds to the size of many “real” projects. Development by specification will be the norm at the end of 2026.

What happens to apps will be exciting. Will all the little apps – to‑do lists, Wi‑Fi scanners, calculators, text editors, notebooks, diaries, casual games – die out? Will there be large sites (presumably owned by the LLM providers) where, instead of buying an app that roughly does what you want, you commission the AI to build custom‑made apps, and then have a larger number of personal apps on your phone? Will there be libraries of app specifications, similar to how there are libraries of images and 3D models today? Ones you can take, adapt, and then hand over to the AI to develop?

That’s my prediction for 2026. At the end of 2026, off‑the‑shelf apps will be in retreat, and personal apps will be making their way into people’s lives.

I’m curious.