Rendered at 17:00:56 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
_zoltan_ 4 hours ago [-]
"In this post, I’ll cover a third, not-so-obvious approach: building ways for the agent to validate more of its own work before a human has to step in. "
this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
kami23 3 hours ago [-]
This is where most of my productivity gains have come, I have a special harness I move from project to project now that does my testing orchestration, lots of my work day is setting up a prompt or two early and just letting them loop till they return evidence that the feature is working having gone through the big QA loop.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
osigurdson 3 hours ago [-]
I can see how you could avoid regressions this way, but what do you add to your harness to prove that a new feature is working?
_zoltan_ 21 minutes ago [-]
for us it's (usually) very easy as I work on performance optimization. a non-negligible part of this is correctness and verifiability, so we already have some of that.
to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.
part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.
looking at /usage, my API duration was 2h 43m, and on top of that:
Definitely agree that performance optimization is a good use case for LLMs. Here you have both a measurable goal / objective function and guardrails against functional regressions. It kind of closes the loop in that regard.
One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.
kami23 2 hours ago [-]
I have it record a series of gifs or videos that I look over. If something looks off I'll dig into it, but I break down work into very very small chunks that are usually easily verifiable or don't require multiple steps.
Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.
I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).
None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...
pstorm 1 hours ago [-]
I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.
dirtbag__dad 3 hours ago [-]
You can get really far with the 20x Claude Code and Codex plans. They are many orders of magnitude cheaper than api calls.
psychoslave 4 hours ago [-]
What license do you use then?
_zoltan_ 3 hours ago [-]
you can pay by just volume ("API pricing")
xg15 4 hours ago [-]
Isn't this a bit of an incorrect usage of the term "backpressure"?
OP quoted the correct definition right at the start:
> In systems engineering, backpressure is the mechanism by which a downstream component signals upstream that it can't accept more work
(the "downstream component" being the human reviewer in this case)
But the measures they propose don't actually do that. They are more like fixed throttle elements which would slow down the rate of submissions of an agent and weed out some low-quality submissions before hitting "downstream".
I'm missing the connection to the actual capacity (or will) that the human developers have to review the submissions.
lucasfcosta 4 hours ago [-]
Author here. Well noted. I do think backpressure might not be the ideal analogy/term.
It comes from previous posts I’ve come across, but I haven’t considered exactly what you mentioned. That’s on me.
ricksunny 3 hours ago [-]
lol your comment sounds like a Claude apology
root-parent 3 hours ago [-]
I had a colleague called Claude. Half of the company was blaming everything on him...
SadErn 3 hours ago [-]
[dead]
3 hours ago [-]
jeffbee 4 hours ago [-]
It is an incorrect use of what was already a flawed metaphor. Pressure is isotropic. Directed pressure makes no sense, like all other fluid analogies in unrelated fields of engineering.
brookst 4 hours ago [-]
Wait so cross ventilation, where a breeze will flow through a house if windows are open on opposite sides at a much greater rate than if windows are only open on the upwind side… isn’t really a thing?
bmm6o 3 hours ago [-]
I took the analogy to be about the location of the pressure and not the direction. If you allow pressure to build on the input pipe when you can't accept more, the component that is upstream in the flow is able to observe that and respond. Maybe the difference is I envisioned a series of pipes and not a single one.
marcosdumay 3 hours ago [-]
The act of "making pressure" means applying a force and is completely directional.
vermilingua 4 hours ago [-]
> It should also reduce the number of low-quality PRs your teammates have to review for details the agent should have caught itself.
Oh boy.
Alifatisk 4 hours ago [-]
Care to elaborate?
atq2119 4 hours ago [-]
That quote shows an utter disregard for basic human decency.
It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality. Putting that on your team mates, or worse, random open source project maintainers on the internet, is the definition of an extractive contribution.
mpalmer 4 hours ago [-]
They are probably reacting to the laughable idea that by making PRs 20% better (or whatever), devs will continue to review the code with sufficient rigor to catch even the bugs they're supposedly now preventing. Assuming such rigor was ever present in their work!
Put another way, who are they supposed to hire to tell these low quality PRs apart from the high quality ones? Who even knows how to do something like that?!
EMM_386 2 hours ago [-]
I always use a standard workflow and it has never been a problem.
- Define the task and the goal, write a short spec document (markdown is fine)
- Point the agent at it in plan mode and have it write the plan to disk with phases. Iterate on its plan if necessary here and now.
- Have each agent tackle a phase and have it update it as a living document (switch models if some phases are more difficult than others)
- Clear and repeat until done
I've never had to overcomplicate this and it's worked both on enterprise-scale projects and personal projects. I am not sure what I'm missing - if anything.
denysvitali 4 hours ago [-]
This seems to be the coding agents 101: build a strong feedback loop. Am I missing something?
lucamark 4 hours ago [-]
No, there is nothing new in this article. This methodology is already adopted - and further optimized - by the sw community.
artursapek 4 hours ago [-]
Yeah I don’t really see the backpressure analogy here - it implies that the agent is constantly producing new stuff, which isn’t really possible since the solution is very detailed specs/goals.
pshirshov 3 hours ago [-]
A very long post about a simple and very obvious idea with many different implementations.
The three main problems are 1) API usage is deadly expensive 2) Claude is about to make all automation very expensive 3) all the flows where a model has the initiative are strictly biased towards unwarranted stops (checkpointing).
Also, I won't call that "backpressure", there is no producer-consumer disbalance or something similar. From what I can see, the author just proposes a structured feedback loop. That's a discussion about organizational principles for system which consist of multiple unreliable but very complex components and this "backpressure" is just one of the aspects. Personally I find the viable system model framework productive as both a mental model and literal implementation guideline.
Lesser problem is that agent SDKs are bad and building a custom harness is hard.
entrope 3 hours ago [-]
> all the flows where a model has the initiative are strictly biased towards unwarranted stops
Can you elaborate on what you think causes such a bias? My experience is that Qwen3.6, Claude Sonnet 4.6 and Opus 4.6/4.7 will work as far as they can given direction and a way to test their work. My so-far limited experience with Opus 4.8 is that it does stop somewhat earlier for feedback, but in places where I am glad it is checking assumptions or where I agree with it identifying a change in scope (for example, where the following work deserves a separate commit or merge request). I would call those justified stops rather than unwarranted.
pshirshov 3 hours ago [-]
Ask Claude! It will quote its constitution aka soulfile. It says the constitution instructs it to perform regular checkpointing no matter what.
The problem is not "backpressure", that's just one of the tools and there are different approaches with the same effect.
You can't express orchestration in terms of "backpressure" only, I think.
Implement-Review-Repeat loop does not involve backpressure in the strict meaning of the term.
jon-wood 3 hours ago [-]
This what hooks[1] are for, except hooks allow specifying criteria in certain conditions (like the agent believing it’s done and ready to hand back to the user) in a manner that the agent won’t just forget about once it’s a few turns deep, and doesn’t require triggering a whole other LLM instance to read some plain text instructions while you hope it interprets them correctly.
It absolutely makes sense to have a system in place that allows the code generated by an LLM to be automatically validated but there’s no need to resort to a non-deterministic system for these sort of deterministic pass/fail conditions.
Interesting ideas for generalizing goals to reduce human labor in human <—> agent interactions. That said, maybe it is better to set up customized skills and infrastructure for large projects? At our early stage of trying to capture value of agentic systems, the good ideas in this article might be premature optimization.
yearesadpeople 3 hours ago [-]
If the systems invariants are well defined, and a suite of conformance + requirements tests (ensuring invariance is respected) are defined, wouldn't this be a broad - _'base case'_ - approach in general?
cadamsdotcom 3 hours ago [-]
Everyone looking into this and other verification should be moving away from long prompts and complex skills, and looking into hooks.
If you put all these checks in your stop hook and your git commit hook, your repo docs can tell your agent that checks will run automatically when it stops work, and it should fix any problems found.
It’s wonderful to reintroduce determinism at the QA end of your process. I find it very calming to know the agent can’t skip or forget to check its work because with hooks the checks are run by the harness.
manmal 3 hours ago [-]
I think pi-subagents (which can form arbitrarily long chains of subagents, with up to 8 in parallel) and Claude Code‘s new workflows feature, are quite convenient abstractions that can be setup quickly.
xlii 3 hours ago [-]
If that's third then I have fourth. Self plug obviously, but figured that I'd like something between smart autocomplete and an agent -
an autocomplete that has wider context.
Called it rik, and it's on GitHub if anyone's interested checking it.
Oh this is 101. Anyone not doing this? If not do it now!
Arodex 4 hours ago [-]
Because your token use explodes?
wellpast 4 hours ago [-]
I’m willing to be wrong but this industry-wide emphasis on AI creative/coding workflows seems way over-engineered.
Ime successful creative execution looks like micro-iterations where each output informs the next creative move.
I can build something incredibly fast from essentially caveman grunt instructions through an LLM harness, iterating as I go.
Optimizing for feeding a huge plan to an agent sounds to me like a net waste of time. And looking over the shoulder of industry peers trying to do this, I don’t see their outputs or throughput some remarkable improvement over what I can produce with minimal fanfare usage.
Yokohiii 3 hours ago [-]
LLMs are too flaky for high quality code. On tougher problems it's very common for an LLM to contradict itself and run in circles. It simply doesn't know what the right thing is, but on each turn it is super confident to do the right thing.
Maybe I've chosen hardmode to learn C with LLM assistance, plus my pet project turned out to be a bit less trivial then anticipated. But I know that I have to think three times about my choices how to deal with C problems and seeing how a LLM struggles to give reasonable answers is a a huge red flag and forces me to think about it a fourth time.
Doing all this with a fast autonomous workflow with just little user guidance is asking for trouble.
coffeefirst 3 hours ago [-]
It’s not just you. My last dumb pilot program making a Pocket clone in Python also got stuck in a loop regularly, which should be its strong suit...
I suspect that the “right” way to use LMs in coding, including accounting for focus, control, and costs is not a settled debate. We probably haven’t even seen the best ideas yet. But I’m really dislike the maximalist approach.
piazz 4 hours ago [-]
100% agree. Took me ages of working with the agents to circle back around this, which was the best way to get work done before AI automation anyways.
0x696C6961 4 hours ago [-]
Yeah it's wild watching so many people decide waterfall is great all of a sudden.
bluGill 3 hours ago [-]
Every large project in the coming back to waterfall. While the problems are certainly known and it was ultimately developed as a straw man, everything else ends up working worse. That said, you shouldn't be thinking pure waterfall as it's drawn up as a strawman, but rather a waterfall variation with feedback loops. But in the end, in very, very many cases, you have to know an end date in order to get things done because so many other things depend on you being done at the same time. If something is going to get done sooner you can't use it anyway without all the other pieces.
0x696C6961 52 minutes ago [-]
"waterfall variation with feedback loops" lol next we're going to have "agile where you plan everything up front"
NamTaf 4 hours ago [-]
Never mind stumbling into proper engineering principles like having documented, testable requirements specifications.
dcrazy 55 minutes ago [-]
I”ve been pretty happy with this side effect of the agentic coding bubble.
mewpmewp2 4 hours ago [-]
For me it's usually that I start with a single agent, but then I won't have anything to do while it is churning and I have other ideas/features that keep building up that I want to do, so I need to scale, and while I'm scaling I need to start to have those workflows, so eventually I end up with many agents, most which are autonomous working on their own worktrees, but I will have a specific agent that I will talk to more iteratively.
So e.g. I may have 1 agent that I ask and iterate on with directly, and 9 agents that work separately on their own.
I will utilize this 1 agent on features I care most about and want to guide and iterate on in as much detail as possible.
bunderbunder 4 hours ago [-]
I suspect that letting agents spin away unattended for long stretches of time will become less and less popular as more and more companies blow their token budgets and start requiring some answers to difficult questions before agreeing to further loose the purse strings.
fassssst 4 hours ago [-]
The trick is to have 5 of those huge plans running in parallel.
artursapek 4 hours ago [-]
I agree. I have gotten an incredible amount of work done iterating with 5-30 minute long agent tasks. But it requires I stay engaged, and not go chill on the beach, which I guess is a lot of agentmaxxers’ goal.
_zoltan_ 4 hours ago [-]
you do not need micro iterations. you can set macro goals and let the agent/LLM/model whatever you want to call it figure it out.
it works.
cyanydeez 4 hours ago [-]
interesting idea, unfortunately programming the structure is equivalent (P=NP) to just programming itself. same as TDD.
as usual, the tool isnt really doing whats listed on its label.
however, people are different so this might improve someones capability to deploy LLMs. might even provide better evidence where actual brain power is needed.
this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.
part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.
looking at /usage, my API duration was 2h 43m, and on top of that:
One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.
Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.
I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).
None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...
OP quoted the correct definition right at the start:
> In systems engineering, backpressure is the mechanism by which a downstream component signals upstream that it can't accept more work
(the "downstream component" being the human reviewer in this case)
But the measures they propose don't actually do that. They are more like fixed throttle elements which would slow down the rate of submissions of an agent and weed out some low-quality submissions before hitting "downstream".
I'm missing the connection to the actual capacity (or will) that the human developers have to review the submissions.
It comes from previous posts I’ve come across, but I haven’t considered exactly what you mentioned. That’s on me.
Oh boy.
It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality. Putting that on your team mates, or worse, random open source project maintainers on the internet, is the definition of an extractive contribution.
Put another way, who are they supposed to hire to tell these low quality PRs apart from the high quality ones? Who even knows how to do something like that?!
- Define the task and the goal, write a short spec document (markdown is fine)
- Point the agent at it in plan mode and have it write the plan to disk with phases. Iterate on its plan if necessary here and now.
- Have each agent tackle a phase and have it update it as a living document (switch models if some phases are more difficult than others)
- Clear and repeat until done
I've never had to overcomplicate this and it's worked both on enterprise-scale projects and personal projects. I am not sure what I'm missing - if anything.
The three main problems are 1) API usage is deadly expensive 2) Claude is about to make all automation very expensive 3) all the flows where a model has the initiative are strictly biased towards unwarranted stops (checkpointing).
Also, I won't call that "backpressure", there is no producer-consumer disbalance or something similar. From what I can see, the author just proposes a structured feedback loop. That's a discussion about organizational principles for system which consist of multiple unreliable but very complex components and this "backpressure" is just one of the aspects. Personally I find the viable system model framework productive as both a mental model and literal implementation guideline.
Lesser problem is that agent SDKs are bad and building a custom harness is hard.
Can you elaborate on what you think causes such a bias? My experience is that Qwen3.6, Claude Sonnet 4.6 and Opus 4.6/4.7 will work as far as they can given direction and a way to test their work. My so-far limited experience with Opus 4.8 is that it does stop somewhat earlier for feedback, but in places where I am glad it is checking assumptions or where I agree with it identifying a change in scope (for example, where the following work deserves a separate commit or merge request). I would call those justified stops rather than unwarranted.
You can't express orchestration in terms of "backpressure" only, I think.
Implement-Review-Repeat loop does not involve backpressure in the strict meaning of the term.
It absolutely makes sense to have a system in place that allows the code generated by an LLM to be automatically validated but there’s no need to resort to a non-deterministic system for these sort of deterministic pass/fail conditions.
[1] https://code.claude.com/docs/en/hooks
https://pura.xyz
https://github.com/puraxyz/puraxyz/blob/main/docs/paper/main...
If you put all these checks in your stop hook and your git commit hook, your repo docs can tell your agent that checks will run automatically when it stops work, and it should fix any problems found.
It’s wonderful to reintroduce determinism at the QA end of your process. I find it very calming to know the agent can’t skip or forget to check its work because with hooks the checks are run by the harness.
Called it rik, and it's on GitHub if anyone's interested checking it.
https://github.com/exlee/rik
Ime successful creative execution looks like micro-iterations where each output informs the next creative move.
I can build something incredibly fast from essentially caveman grunt instructions through an LLM harness, iterating as I go.
Optimizing for feeding a huge plan to an agent sounds to me like a net waste of time. And looking over the shoulder of industry peers trying to do this, I don’t see their outputs or throughput some remarkable improvement over what I can produce with minimal fanfare usage.
Maybe I've chosen hardmode to learn C with LLM assistance, plus my pet project turned out to be a bit less trivial then anticipated. But I know that I have to think three times about my choices how to deal with C problems and seeing how a LLM struggles to give reasonable answers is a a huge red flag and forces me to think about it a fourth time.
Doing all this with a fast autonomous workflow with just little user guidance is asking for trouble.
I suspect that the “right” way to use LMs in coding, including accounting for focus, control, and costs is not a settled debate. We probably haven’t even seen the best ideas yet. But I’m really dislike the maximalist approach.
So e.g. I may have 1 agent that I ask and iterate on with directly, and 9 agents that work separately on their own.
I will utilize this 1 agent on features I care most about and want to guide and iterate on in as much detail as possible.
it works.
as usual, the tool isnt really doing whats listed on its label.
however, people are different so this might improve someones capability to deploy LLMs. might even provide better evidence where actual brain power is needed.