As we close out January, Southbridge is a year old. It has been a year of work — of a crazy amount of tests, benchmarks, systems and specs built internally that might never see the light of day, things built to learn what worked as much as what didn't work.
For one more time this year, I thought we'd use writing to keep ourselves honest. Here's where we started, what we tried, where we failed, and where we are.
I started Southbridge because I believed that the central bottleneck across the last fifteen years of my career — from control systems for long-range fiber optics to insurtech to computer vision-aided near-infrared targeting to maritime resource optimization — has been data, and the amount of work it takes to make it useful.
We made three bets:
That for the first time in history, data transformation could be fully automated.
That we could solve the whole thing.
And that we could do it with a small team burning lean.
One central conceit underpins all three: We believed that models of the time — armed with the right tools (which didn't exist), restrained by the right harnesses (which hadn't been built), and trained on the right patterns (which weren't invented yet) — could solve data interfacing as a problem, end to end.
At the time, we called it Structure-preserving Ingest and Retrieval (or SPIR). We wanted to build something that could ingest and understand large amounts of structured and unstructured data, with the ability to transform it into any requested shape. To do this, we needed to solve three problems:
First, we needed to build underlying AI application layer infrastructure that didn't exist yet. This was long before ACP or Claude Code, and we'd need to build model routers, harnesses, tools from scratch. How could we best work with the industry as it caught up?
Second, we needed to find a way to engineer AI systems that were maintainable. It was obvious even in January 2025 that greenfield work was going to accelerate due to models — it was becoming easier to write more code than ever. The problem, however, was maintenance. Having written and deployed agentic systems to production since gpt-3.5, we knew that agentic code tended to rot faster than regular code. How does brownfield AI engineering work in 2025?
Third, we needed to find ways of reducing brittleness in agentic runs by multiple orders of magnitude. Any large scale data understanding task was going to take millions — if not billions — of LLM calls, and a proportionate number of toolcalls. Data tooling often sat at the bottom of a stack, where even a single error could compound into monstrous mistakes. We had to add at least an extra nine of reliability to existing systems if we wanted to be trusted with data.
In retrospect, we were right about all of these things — today we have data agents that we use on a regular basis, that accelerate scientific research and perform hugely complex tasks over days while being maintainable — but we were horribly wrong about how close we were, both as a company and as an industry. We miscalculated how quickly the industry would support the base infrastructure we needed, and how many wrong turns we'd need to test before arriving at a system that scaled.
Here's the first year of Southbridge, from a technical perspective.
Q1: Customers
We spent Q1 close to customers. We started the year believing that we were close to a solution: we had internal prototypes that could ingest flat data, write code using LLMs in real-time, verify schemas deterministically and complete transformations. We believed we could iron the kinks, improve reliability and scale up the systems we had as models got cheaper and the space around us caught up.
In the meantime we wanted to learn more about problems in the wild. We traveled and listened to anyone and everyone with a data problem. We gave out free consulting, solved a number of problems internally and externally where we could, in exchange for access and data.
We learned that an incredible amount of manual work still goes into making any amount of data useful. Data debt is a monotonically increasing quantity at most companies — and this goes double for AI companies, who have to wrangle diverse user-uploaded content while also dealing with AIs and their superhuman ability to generate even more.
Beneath all of this, a new problem was stirring: stack diversity. We were now confident in a few central problems we could point at solving, but no two users had the same stack. We set this aside for the time being.
Our approach at the time focused on a CISC-like architecture. If the LLM call was an instruction in an agentic system, we were trying to get as much computation out of a single call as possible. If we made our tools specific and "thick" — to perform known primitives like filtering, mapping, etc — the workload on the LLM would be reduced, and we could get more done in fewer calls. Today, tools like Greptile operate on similar principles.
This seemed like a good idea at the time, when agentic harnesses were nowhere to be found, and LLM evals by-and-large focused on single LLM calls. (They still do, but I think internally, we — and a number of other teams — have moved beyond single-turn evals). Fewer calls means better testing on a smaller surface area for failure — or so we thought.
Most of the work here (affectionately codenamed Skewless) went into building tools that could deterministically (with minimal AI assistance) flatten and un-nest data, provide information on structure with minimal access costs, and parse what information was parse-able.
Around April, we built our first meta-spec for AI-first data tools as an experiment, aimed at defining AI-first tooling behavior. Little did we know that meta-specs would play a massive role in accelerating development around Q3.
The central theory at the time was that there was a single intermediate format we could transform any input data into, that we could build connectors to and from to effectively build ETL systems.
We were wrong, but it would take us another two months of testing and benchmarking to figure this out.
Q2: Research
Most of Q2 was spent on research. We spent almost all of our time testing and benchmarking model behavior at the edge. How did these new AIs behave at either extreme of their context windows? How did they react to structured information? At the scales we were expecting to deploy them, even the smallest change in format or intelligence would have a compounding difference.
We also built what we believe is the first benchmark for complex data work with real and synthetic data across healthcare, finance, web-based tasks, and a few more. We built this benchmark, but we didn't have a way of running it yet: the only intelligence capable of evaluating performance on the benchmark was human. We ran into the P-NP problem of AI: How do we build tests for a smarter intelligence that a dumber intelligence can verify?
(We would later solve this problem well enough to help with hiring.)
Models at this point were performing wonderfully inside our test systems, but the brittleness worried us: if we can't get transformation A to work 99.999% of the time, how do we make it the foundation of a much longer and larger pipeline?
This is when we began exploring reinforcement learning for data transformation systems. We wrote a detailed specification framing data transformation as a sequential decision-making problem — states, actions, rewards, the works. We'd later abandon this path once we abandoned the concept of a singular central format that inputs and outputs can connect to. It took a crazy amount of work to rediscover that there is no singular format that works. Too much information about a transformation is only known at runtime, from the actual destination.
The RL spec survives as an artifact. Parts of it — the multi-resolution state representation, the curiosity-driven exploration, the options framework for temporal abstraction — would reappear in different forms in the system we eventually built. But the framing was wrong: we were trying to optimize a single agent's decisions, when what we actually needed was a way to compose many agents into reliable sequences.
Q3: Building
The start of Q3 (June) is when we switch completely from writing code to writing specs. Almost all product development from this point on gets done by writing large, 5-10 thousand word specifications. We're betting that by Opus 4, code can be generated as needed, and iterations can happen faster and more collaboratively over English specifications.
This bet pays off.
We build specs for things we need internally that later models are able to simply one-shot.
At this point, we've validated a large number of paths that don't work. We've tried and discarded the silent observer pattern (a la mem0), the filesystem architecture (a la agentfs), the CISC system, and a few others as ultimately infeasible (for us and our combination of team and problem).
High conviction, low confidence — and confidence was the lowest here. We started six months ago with the ambition of solving general purpose data transformations — any and all of them. Since then we'd slowly and methodically cut our ambitions down:
- By February we were focusing only on Ingest as a problem. By the end of Feb this became relational ingest: transforming any input data to flat formats.
- By April we'd sliced that down to data Understanding and cataloguing — building a 'data pack' of sorts from any input information.
- By June we had reached Crawling as the base problem — can you (in a budget-conscious way) crawl and index large heterogeneous data to produce AI-accessible intermediates?
It was this month that we would do two things that marked the turning point.
We would take Claude Code apart internally to discover that agentic loops could function as the new base primitive, that we could measure systems not by the individual success of a single call, but by tuning behavior over hundreds of calls. This was the CISC-to-RISC moment: instead of cramming intelligence into each tool, make the tools simple and let the loop do the work.
Second, we would narrow even further down to connectors and addressing as the problem we needed to solve first.
Both of these things together led us to what would become the bottom piece of our current stack — and the solution to our problems. We called them Sealed Agents.
Things accelerate from here on.
Q4: Sealed Agents, Sealed Flows, and a Runtime
Q4 is a series of rapidly evolving experiments. Sealed Agents — the idea that we could "seal" agents in a box with tasks and behavioral controls — quickly become Sealed Agentic Flows (SAFs), where these boxes could be chained together to form more complex operations, increasing capability without an exponential increase in brittleness.
We start running SAFs on our benchmarks and discover that we can solve more complex problems than ever, reliably.
When things broke, there were now ways to fix them.
This is the sentence I keep coming back to, because it's the thing that was missing from every other approach we tried. Debugging an agent in a freeform loop is like debugging a dream — you can describe what happened, but you can't reproduce it, isolate it, or hand it to someone else to fix. SAFs made the agentic system declarative: the prompts, the code, the behavior expectations, the checkpoints, the boundaries — all of it visible, all of it modifiable.
This is what made brownfield agentic engineering possible. You could take a flow that someone else wrote, understand what it was supposed to do, find the part that was broken, and fix it. You could hand it to a new teammate — or to an AI. We were no longer building sandcastles.
In July, we start work on a proper runtime. We need something that can execute these flows with auto-recovery, monitoring, checkpoints and rollbacks. Something that can manage the lifecycle of agents that run for hours — or days. We call it Tadpole.
By August, we're running Tadpole on real partner data. A 3GB health dataset — 1415 files spanning 24 years of longitudinal data — gets fully crawled, understood, and catalogued in 4 unsupervised hours. The system identifies coding inconsistencies across survey cycles, finds structural anomalies in sequence numbering, detects data series that appeared and vanished between cycles, and produces a usable codebook. The partner team's estimate for the same work done manually: a year of calculated man-hours. We start accumulating hours of productive agentic work that actually count.
By October, the architecture has stabilized enough that we can focus on what sits above: domain-specific programs. We call these programs strands (later hanks, thanks Amazon), and they encode the specific knowledge needed for a class of data tasks — how to handle healthcare codebooks, how to reconcile financial data sources, how to parse and unify telco workflows.
By November, we test the hypothesis that AIs can re-weave these programs — modify, extend, and retarget existing hanks for new use cases. It works. This solves the human-in-the-loop problem for scaling: you don't need an engineer to hand-write every new hank from scratch.
By December, we build the first polymorphic connector — a combination of meta-specs and an eval suite built into a hank — that enables the runtime itself to connect to multiple agentic backbones (Claude Code SDK, Codex, Gemini CLI). The runtime, now renamed Hankweave, is no longer coupled to any single provider.
And in January 2026, we open-source it.
Looking Back, Looking Forward
I'm leaving out another four chapters worth of things that didn't work — from using LLMs as event generators, agentic filesystems, all the way to approaches that tried to find pivot points in datasets that could be used to index the entire thing based on connectivity (we called these Rosetta stones). The list of approaches we tried and discarded would fill its own post.
But what strikes me most, looking back, is the shape of the path. We didn't pivot — we narrowed. Every quarter, the problem we were working on got more specific, more constrained, and closer to something real. We started with "solve all data transformations" and ended with "make agents work reliably in a sequence" — which turns out to be the harder problem underneath the one we thought we were solving.
If 2024 was the year of AI, then 2025 was the year of agents. We all started the year worrying about toolcalls and trying to get models to reliably respond with JSON. We're closing it out with agents that can productively work for hours on end and do things that would've seemed impossible twelve months ago.
At Southbridge, we started the year confident that we could build systems that could effect any-to-any transformations on any kind of data. We're ending the year with something different and better: a reliable runtime for long-horizon agentic work, and a growing library of programs that encode real domain knowledge about real data problems. Hankweave is now public (docs), and we're using it — and building on top of it — every day.
We couldn't have gotten this far without the support and expertise of our investors and advisors at theGP, and all of the wonderful people who helped us with their tech, ideas, thoughts, or ears.
The lightbulb works. Now we have to wire the house.
Postscript: February 2026
A month after this was written, a few things have happened that are worth noting.
Hankweave launched publicly and has logged thousands of new agentic hours in its first weeks. Working with our external partners we're seeing hanks being written and deployed in production, but most importantly, we're seeing hanks being repaired and fixed in ways that last.
We've also shipped the first prototype of the agentic layer that sits above Hankweave, which can take a problem, build a hank for it, run it, debug it, and improve it until the result is polished. Not just a runtime for agents, but agents that use the runtime.
This piece was written at a moment of cautious optimism. We'd built the thing, but hadn't yet proven it could survive contact with the outside world. It has. The brownfield thesis — that repair and maintenance matter more than generation — is holding up, and if we're any kind of leading indicator, it's about to matter a lot more.
Appendices
These are artifacts from the journey — specifications we wrote and later abandoned, included because the thinking in them still matters even when the framing was wrong.
Appendix A: Meta-Spec for AI-First Data Tools
Written in April 2025, during our CISC-era attempt to define how AI-first tooling should behave. We've since moved away from thick, specialized tools toward simple primitives inside agentic loops — but the underlying principles about explicitness, debuggability and statelessness proved prescient and influenced how we design hank codons today.
Appendix B: Reinforcement Learning for Data Transformation
Written in Q2 2025, when we were exploring RL as a framework for teaching agents to perform data transformations. We abandoned the approach when we abandoned the idea of a singular intermediate format — but the thinking here about state representation, temporal credit assignment, and the options framework for temporal abstraction fed directly into how hanks are structured today.