HankweaveWritingAboutusJoinus
HankweaveWritingAbout

How Not to Leave Greenfield

Hrishi Olickel·May 1, 2026
  • I run a company that sits at the bottom of the stack
  • Vibe and fun, versus the bottom of the stack
  • The genie's list
  • No. 1 - Don't ever delete anything from context
  • No. 2 - Build everything greenfield. Always.
  • No. 3 - Pass everything through a model
  • No. 4 - Mix prompts, code, control logic, and data
  • No. 5 - Work in silos
  • Now do the opposite
  • From the Q&A
  • What this was the warm-up for

This is the longer version of a short, informal talk I gave at SQ Collective's The Stage on May 1, 2026 - an open co-work Friday in Singapore where someone stands up at 1:30, shares something they're actively chewing on, and then everyone goes back to their laptops.

Watch the talk and Q&A in full!

If you'd rather read it than watch it, the slides are below, and I've laid the deck out inline through the rest of the post.

slides · click to focus, then use ←→ to navigate
open ↗

The deck as delivered. It's also inline below, slide by slide.

Wide shot from the talk. I'm seated with a mic, mid-gesture, the Windows-XP-green title slide - "How to not leave Greenfield." - projected behind me.

I run a company that sits at the bottom of the stack

My name's Hrishi. I run a company called Southbridge - small team. We make data agents: effectively, automating the process of data engineering, end to end. I didn't fully know how difficult that was when we started, but I kind of did, because it's what I'd done for the previous five or six years with much larger teams. Despite all the advancements in technology and AI, solving data problems has stayed very, very manual. It takes a human to look at information and figure out what's an outlier and what isn't, what works and what doesn't, what's intended and what's not.

Title slide: "Hrishi · Southbridge. DATA AGENTS · LONG-HORIZON · HERMETIC." A list of the work: codebooks for genomics, collating SEC filings, database migrations, dissolving ingest problems, relational ingest at scale.

So we started out building transformation systems, and today we work with companies and think tanks to build codebooks for health data, to join up SEC filings, to do large-scale database migrations, and to solve ingest. For the last six months or so, the North Star has been ingest: how do you take new data coming into a company - a new customer, a new dataset if you're doing research - and bring it into your systems with no mistakes.

Here's the thing that's been quietly working against us the whole time. Over the last year or so, models have gone from junior-dev, junior-intern to "oh, you're so smart, I don't even know how you work here" levels of smart. But the amount of information they can see at once has not gone up. If anything, it's gone down. Gemini was the top of the frontier at 2 million tokens, then it came back down to 1 million.... That's maybe 2 megabytes of data. So at any point in time - whether you're using Cursor, or some complicated thing, or our thing - the model on the other side is only ever seeing about a megabyte of data, and I'm being generous. Past a megabyte it starts degrading. Maybe past 500 kilobytes it starts degrading....

Slide: "Models keep getting smarter." A METR chart of how long a task AI agents can complete autonomously, with the task horizon doubling roughly every seven months. Underneath, in big serif: "But." And a small boxed label: everything ≈ 2 MB.

It reminds me of early computers with rope core memory, where they could barely see what was happening at any one time. The intelligence keeps climbing - METR has the task horizon roughly doubling every seven months... - but the window the model gets to think clearly inside of stays small. A lot of what we build is downstream of that gap.

Vibe and fun, versus the bottom of the stack

I want to talk about the distinction, because it's what we internally call the bottom of the stack. We also call it hermetic. And there's a massive difference between data engineering - which in some ways is the final boss, because it tends to sit in the critical path for most companies - and the things you do for fun.

Slide split in two. On the left, "Vibe · Fun" over a warm photo of two people happily sharing a bowl of dessert. On the right, "Bottom of the stack" over a packed professional kitchen mid-service. Beneath both: "reliable. repeatable. dependable."

Think of cooking a soufflé for friends. A lot of agentic dev, coding, even some of what we do, lives in that range. People are coming to eat your soufflé, but you can change it to cookie dough halfway through and it doesn't matter. You're the product manager, you're the consumer. You can build whatever you want and change your mind two days in, five days in, and do something else entirely.

Most brownfield work, bottom-of-the-stack work, is the professional kitchen instead. The work needs to be delivered exactly as it needs to happen, or stuff doesn't happen. Or they don't ask you to work there anymore. So most of the agents we've built, from day one, have had to be reliable, repeatable, and dependable.

And here's the honest part: it's not that we started out saying "we know how to make reliable agents," or "we know how to do reliable work with LLMs," which are non-deterministic and so on. We paid a lot of attention, and we learned 300 ways not to do it - by reading the outputs and fixing things. We're now sitting on a lot of information on how not to do this. And incrementally, you make progress.

The genie's list

There's a wonderful Charlie Munger talk from a long time ago - he was borrowing from Johnny Carson - called How to Guarantee a Life of Misery.... Instead of telling the graduates how to live well, he gave them a list of things to do if they wanted to be miserable: all the things you should do if you want no friends, want to die alone, want a miserable life. There's a counterpart someone pointed me to, too - the CIA has a field manual, from its OSS days, on how to sabotage enterprise operations if you're a saboteur dropped into another country.... It's a wonderful list of things you shouldn't do as an enterprise, where you can just read it and find the opposite. Call a lot of meetings. Take a lot of notes where nothing actually happens. That's how you sabotage a company.

Title slide: "How to not leave Greenfield." over the Windows XP "Bliss" hill, with two source cards at the bottom - "How to Guarantee a Life of Misery, Charlie Munger · Harvard School, 1986" and "Simple Sabotage Field Manual, Office of Strategic Services, 1944."

I thought that was the right format for this. So the rest of this is a collection of ways to guarantee your agents are never reliable - that the AI work at your company never solidifies, never becomes bottom-of-the-stack, never leaves the greenfield where you just build fun stuff that never makes it into the critical path. The inverse, in every case, should be clear. If it isn't, you can always ask.

No. 1 - Don't ever delete anything from context

Whatever agent you're building or using, stuff everything in there. Skills, as many as you can find. Instructions, as many as you can get. Compact the context often - and don't even notice who's doing the compacting. Summarize it. Add in documents, tooling extensions, your hard drive, your downloads folder. Throw additional stuff in, so that at any point in time you're winning at being able to ruin your own work.

Slide: "№ 1 — Don't ever delete anything from context." Scattered torn-paper labels pile on top of each other - skills, instructions, compacted context, summarizations, documents, tooling extensions, your hard drive, your downloads folder - and over the middle of the heap, a black tag reading "??? who knows ???".

If you have no idea what the agent is seeing, you have no idea what your AI knows, and you have no shared mental model. You're four hours in, context has been compacted four times, and you've stuffed in so many skills you have no idea what's activating at any given point. This is a wonderful way to ensure that if you build something good, you got lucky and it was an accident.

No. 2 - Build everything greenfield. Always.

Whatever you solved last week, throw it out and start over. You get extra points if the stuff you build ends up with so much unintended complexity that you have to rebuild it every time you come back to it. Stop working on it for two weeks, return, and rebuild from scratch. Because new is better than old.

Slide: "№ 2 — Build everything greenfield. Always." with a single quiet line beneath it: "new is better than old".

Engineering teams did this before AI, too. Anything that needed to get built greenfield, everyone would fight over, because greenfield is fun. The difference with current AI is that nothing ever leaves greenfield. The fun part is the whole loop now, and the fun part never has to graduate into something dependable.

No. 3 - Pass everything through a model

This is a big one, and most of our partners do it, and so do we. Vibe your skills - how you do things - then give them to a coworker. Have them pass it through a model to summarize so they can understand it, then pass it through another model to write it to their hard drive. When someone else asks "hey, how'd you do that thing?", pass them that vibed version, so they can pass it through another model to look at it.

Slide: "№ 3 — Pass everything through a model." A hand-drawn illustration of a hooded scribe at a desk with a quill, while engineers and scientists all around hand scrolls in and pull rewritten scrolls back out, arrows looping endlessly between them.

By the time you have something resembling organizational knowledge, it's been through a model so many times that - wonderfully - your ratio of signal to slop gets a little worse each pass, and a little worse, until by the end it's mostly whatever Opus thought was the right thing to do in that moment. God help you if you're not using Opus when you do this. Codex will do it even without asking - OpenAI's models tend to be opinionated, like a senior engineer who knows they're senior to you and will actively rewrite your stuff because "that was better." Make sure all organizational knowledge keeps passing through models as often as possible.

No. 4 - Mix prompts, code, control logic, and data

Don't have an abstraction layer. Mix everything. Anyone who was around for early web development will recognize this: it's what PHP was, and what came after. Any complex API, workflow, or system you build has some kind of control plane - do this, then that, then if-this-then-this. It has instructions that go into a model. It has evals or data that get passed in. It has plain deterministic code that executes at some point, and it has execution bindings. Mix all of it. You literally can now: go to a model and say "make the thing," and it'll build you a beautiful two-file system where no one can tell what anything is.

Slide: "№ 4 — Mix prompts, code, control logic, data." On the left, labelled "DO THIS", a hand-drawn page of tangled PHP-soup with threads running everywhere. On the right, labelled "NOT THIS (BORING)", a tidy Model–View–Controller diagram.

The slide is only half a joke. The tangled mess is labelled "do this." The clean Model-View-Controller diagram is labelled "not this (boring)." Mixing your control flow into your prompts and your prompts into your code is the agentic version of <?php echo $sql; ?> in a template, and the consequences are the ones you remember.

Idea

A close relative of this one: mix agent-written and human-written work so thoroughly that nobody - and ideally no agent either - can tell what came from where, or what level of confidence any piece of knowledge had when it went in. Code already fails to encode intent; agent-written code fails harder. If you can't tell whether a line exists because it was the right call or because a model thought it looked nice, you've lost the thread.

No. 5 - Work in silos

Every engineer on your team, or you, if you're the engineer, should be a wonderful butterfly. Tooling so separate, methods so different, that everyone is their own individual butterfly with their own emergent work habits that nobody else knows. This is one easy way to ensure that when your work actually works, you got lucky.

Slide: "№ 5 — Work in silos." Four cards for four made-up city-states of bespoke tooling: GAS TOWN (cursor · custom rules.md · proudly bespoke), LIQUID CITY (claude code · home-rolled framework), VIBE VILLAGE (codex · vibe-coded harness · no docs), and PROMPTOPOLIS (invented their own skills format).

Everyone gets their own Gas Town, their own Liquid City, their own Vibe Village. If you're lucky, invent your own skills format - but don't give it to anybody else. Then someone else invents what they have, and they hit a different issue, and nothing accumulates across either of you.

Now do the opposite

If any of that resonated, or made sense, or didn't - the inverse is the whole point.

Slide: "You spin me right round —" over a list of links. PHILOSOPHY: southbridge.ai/blog/antibrittle-agents. RELATED: olickel.com/everything-about-evals. WHAT WE BUILT · OPEN SOURCE: southbridge.ai/hankweave. DOCS: hankweave.southbridge.ai.

We started building bottom-first for a lot of the data work we did, so the first thing we built was a runtime. Partly because we needed it, but also because the lower down something sits in your stack, the longer it needs to exist to accumulate fixes and get stable. We started building that around June/July of 2025, and open-sourced it a few months later as Hankweave. Open-sourcing it wasn't to sell it forward as a product. It was because we're a tiny team - I always say we're neuron-limited - and open source got a lot more neurons and a lot more diversity onto it. We've had many more people use it, and thankfully just as many people complaining about it and telling me what sucks. Because you can't simulate true diversity. You can't simulate the real world. For bottom-of-the-stack work, you have to hit market, hit real people, and go through the painful process of them telling you "hey, this doesn't work."

So here's the actual answer, in one breath.

Slide: "Because we use hanks (AI programs that separate concerns), broken into reusable codons (sequenced units of work), watched by sentinels (real-time monitors), with logs (that humans actually read), shared and version-controlled (so fixes accumulate)."

For a lot of our work we use hanks, which are just our abstraction layer for AI programs - the place you separate control flow from prompts from code from data, and put each somewhere else. They get broken into codons: individually sequenced units of work, where you specify "this model, this harness, do this, here's what I expect out of it," and then sequence it. They're watched by sentinels, which are watchdog agents that watch the work happen and are themselves reusable. And the whole thing keeps logs and version control that someone actually reads.

Mid-talk, leaning toward the projected slide listing hanks, codons, sentinels, logs, and version control.

The mechanical shape of it is simple. The runtime sequences codons, wipes context between them so each one is a clean unit, lets sentinels watch from the side and write notes, and version-controls the whole thing so fixes accumulate instead of evaporating.

The rule underneath all of it is: don't build what you can reuse. When you reuse, you get the benefit of the last person, and you get to benefit the next one by fixing it. Most of my life and our work has been fixing things, and building things that can be fixed - like old clothes versus fast fashion - rather than building everything greenfield. It's the difference between the clothes we used to have and fast fashion: now you buy a new pair of pants when yours don't work, instead of fixing the pair you have.

You don't have to use exactly this. I genuinely believe you can solve for any of these - pick anything - as long as you have an abstraction layer that you agree on and you force people to use it. It's harder than building whatever you want, whenever you want. But as long as you know what the different pieces are, and you enforce reuse over rebuild, I think you'll be in a good spot.

Related reading
Antibrittle Agents
The philosophy underneath all of this: trenches, receipts, context boundaries, and agents that don't break on the second customer.
Read the full piece →

From the Q&A

I left most of the session for questions, and that's where a lot of the good stuff ended up. A few exchanges worth keeping.

"This feels a lot like building a team for the first time - a team of relative juniors who forget things." Yes. Or a high-churn team. Or an excitable one. That mapping holds up further than you'd expect, and a lot of what follows leans on it.

"How much of this generalizes to just building a good coding agent?" Most of what I'm describing actually sits one level above the coding agent, where we use coding agents as a reusable harness. If you look at Hankweave, we built on Claude Code's agent SDK, because models - through RL, synthetic RL now - are co-evolving with their harnesses. That layer is solidifying so hard: Anthropic tunes on the data they get from Claude Code, adds custom headers, red-team/blue-team tests within days, and ships. So you almost have to use the right harness for the right model. Everything I'm talking about sits on top of that: what gets fed into the harness, and what you want to observe coming out of it.

Idea

The thing I didn't get to, and it's related: evals completely changed pre- and post-agentic era. Evals used to be a single turn of an LLM call. That doesn't work anymore. You can have a hundred tool calls before a response, and 25% of them might be failures, but that's still the right way to go. So most of our work has been: how do you declaratively define and test for behavior, instead of stepwise success or failure? What does successful behavior look like for the thing that's happening, rather than "did each individual step pass?"

"You mention a million-token context, but the research says it's really only about 100K on haystack tests. What are you seeing?" Haystack tests for discontinuous retrieval, which isn't that close to what we do. LLMs, like humans, encode information sparsely - I've been talking for thirty seconds and I've largely said one thing, I haven't stuffed twenty factoids in for you to retrieve. So in practice, as long as you're on the same problem and everything follows as a causal chain - "I walked in, set up my laptop, gave a talk" - frontier models stay pretty good up to 400-500K, even 600K, from our tests. But if I stuffed even 100K tokens with completely unrelated facts and tried to get intelligence out of it, it would struggle.

"Why a sentinel watching alongside, instead of a reviewer at the end?" There are roughly three ways to enforce behavior. One: write it in the prompt - "don't be lazy," add some emojis, promise to tip it $100, surround it with stars. Two: the codon approach - let it do its thing, then bring somebody else in to fix it, the way a team brings in a second person to unscrew what the first one screwed. Three: the sentinel - someone sitting next to you taking notes any time you veer somewhere bad, where those notes are accessible whenever you need them. It doesn't interrupt the loop; the agent keeps going, and at condition breakpoints you read the notes and fix things up. We've found that pattern incredibly effective, and cheap - our sentinels run on something like Gemini Flash, because, sadly and happily, it doesn't take much intelligence to point out a mistake. Technically it's just a trigger on the agent loop - "fire when X, Y, Z happens" - and a single LLM call that appends to a file. That's it.

"When agent runs are six to ten hours, is that a single run, or codons stacked?" Multiple codons and steps, but still without termination. Most of what we do is heading toward as-unsupervised-as-possible, which is also why Hankweave has a preflight check - you reason about the run before the first model is called, kind of like a good type checker, so you can catch an inference issue before any model is involved.

"If you're using an AI system to measure an AI system, how do you break the loop?" That's the philosophical core. The question we kept hitting was: how do you use a dumber intelligence to measure the progress of a smarter one? Because evals and sentinels both need to run a ton, so by definition they'll always be a smaller, cheaper, in-some-ways-dumber model with less context. Our answer has been to let the smart models move, but aim the cheap ones at behavior. Like running a team - you can't always tell if someone's functioning by reading every line they wrote, but you can tell when something's off: whether they showered, when they leave, whether you see them at the water cooler. Say we have an ingest pipeline pushing data into Postgres. I reasonably can't tell you whether an individual tool call was right or wrong, but I can tell you what a correct path looks like, and what a wrong one would have in it - you're reading things from outside, you're not making enough calls to the database, most of your calls are bunched into one section instead of distributed. There's a comparison to fuzzy logic in there somewhere.

During the Q&A, seated and mid-answer, the "You spin me right round" links slide projected behind.
Highlight

Someone asked, kindly, for more failure stories - that conferences are all "how I lost weight, how I made money," and almost nobody stands up to talk about how to fail. I agree, and it's most of why this talk is shaped as a list of ways to do it wrong. We've written some of our failures down too.

Related reading
25 Ways Not to Make a Lightbulb
The same inversion, written down: a catalogue of approaches we tried that didn't work, in the spirit of Edison's filaments.
Read the full piece →

What this was the warm-up for

This was a fifteen-minute version, in a co-working space, on a Friday afternoon. The fuller arguments - codons, sentinels, budgets, preflights, and why brownfield is where the real work lives - got their proper outings later, as talks at SuperAI 2026 and AI Engineer Singapore. Here those are!

Related reading
No Country for Old Code
The longer AI Engineer Singapore version: building agents inside large existing systems, with old code, old organizations, and old data.
Read the full piece →
Related reading
Synthetic Hires
The SuperAI version: agents as hires rather than tools, and why a surface for process knowledge is what lets organizations learn once and pass it on.
Read the full piece →

Build agents that get to leave greenfield - reliable, repeatable, dependable - by enforcing reuse over rebuild and letting fixes accumulate.


The runtime we use in production is open-source at github.com/SouthBridgeAI/hankweave-runtime, the docs are at hankweave.southbridge.ai, the philosophy is in Antibrittle Agents, and there's a lot more on evals at olickel.com/everything-about-evals. Thanks to SQ Collective for The Stage. I'm @hrishioa on Twitter, and we're at southbridge.ai.

← PreviousAgentic Programming Patterns: 1Next →No Country for Old Code
Southbridge AI
WritingAboutGitHub
  • I run a company that sits at the bottom of the stack
  • Vibe and fun, versus the bottom of the stack
  • The genie's list
  • No. 1 - Don't ever delete anything from context
  • No. 2 - Build everything greenfield. Always.
  • No. 3 - Pass everything through a model
  • No. 4 - Mix prompts, code, control logic, and data
  • No. 5 - Work in silos
  • Now do the opposite
  • From the Q&A
  • What this was the warm-up for
AboutJoin