Agents today feel like LLMs did two years ago. Some of us are convinced that the world will change, that agentic systems will take over the things we do, while some more of us are just as convinced that they can do nothing but make slop.
The problem is randomness and path dependence. This is the same reason that shipping agentic performance in production is difficult. It's why companies still have small, irreplicable groups of people who can do superhuman things with Claude Code, but no way to scale that capability.
When they first became intelligent enough to be useful, Large language models were unwieldy, unpredictable and imprecise. The tiniest changes in instructions would cause wild erratic swings in behavior. They would be confused in strange ways, and fail at the simplest of tasks.
Yet somehow they showed sparks of AGI that promised a different world - if we could wrangle these intelligences and learn to work with them, we could do things that computers had never been able to do.
Imagine programs that did what was meant instead of what was written, to read and write complex human language, solve complex problems, plan and reason - things that we've only ever known a singular species to be capable of. To those of us that jumped into this difficult problem, it was worth helping them help us.
In two years, we've come pretty far. LLMs - as a single API call in a larger system - are now much more tractable. In that time,
- Improved post-training methods have produced more reliable models that don't break when we ask for JSON.
- Better fine-tuning and RL has increased prompt adherence, to the point where we now routinely have multi-page instructions.
- We've begun to figure out how to make them safer without lobotomizing them entirely.
- We figured out how to track regressions and rely on performance from models with evals.
As with a lot of things, what began as art and wizardry - with magic incantations and llm whisperers - is becoming a science.
Agentic loops have once again reset the timeline.

The agent (llm + tools + loop) has become the new primitive, replacing the LLM call.
All of the problems of yesteryear are back:
- Contexts (which were getting large for a single LLM call) are now insufficient for hundreds of looping tool results.
- Measuring reliability is once again difficult. Agents are extremely path dependent, and a single wrong turn can have compounding errors.
- Tests and runs are expensive again.
- Evals are back to being an art. Single-turn LLM evals had been partially tamed with model-as-a-judge research, but evaluating agentic performance requires brand new reward measurement systems.
- Temperature control is gone, and reasoning models introduce more variability in output trading off intelligence. Magic incantations are back.
Yet the same sparks of AGI shine through. In the place of programs that could understand what we mean, we have agents that can understand, question, reason, implement, critique and refine all on their own. If a single LLM call could extract information from a file, an agent could organize and unify a folder - maybe some day a data lake.
Once again, we have the same questions. How do we use these things? How do we make them reliably useful for ourselves? How do we make that value reproducible for the humans around us?
It's going to be an interesting 2026.