The case for patient agents

Most software agents built on language models are sprinters. They get summoned for a task, consume compute aggressively while they work, return an answer, and stop. They are expensive precisely when they are working and idle the rest of the time. The economics push their designers toward making each invocation count: bigger models, longer prompts, more tools per call. The agent earns its keep in the minutes it is awake.

I've been quietly building the opposite kind of agent.

Watch, write down what you notice

The thesis is that, for some classes of question, the value in observation is not in how much you see in any given session but in how long you can afford to keep looking. Patterns that would tell a knowledgeable human something useful are often present in the daily noise of a forum or a feed, but no human stays paying attention to the same forum for a year.

Cheap inference has changed what that calculus permits. It's now economically rational to keep an instrument trained on a slice of the world for months at a time, paying for attention by the cent rather than by the hour. That's the bet underneath the project — Parallax.

Parallax is a small population of long-lived, low-cost agents pointed at public information sources. Each one does one thing: watch, and write down what it noticed. Each agent costs almost nothing on almost every day. It rarely surfaces anything. It is built to stay quiet.

The vocabulary, briefly

A few terms that come back through the rest of this:

Firehose — a single public information source one agent watches. Right now there are five: r/LocalLLaMA, r/ClaudeAI, r/openclaw, r/techsupport, and Hacker News.
Intern — one of the long-running agents. A loop, a prompt, a model assignment, a piece of persistent state, and append-only access to a shared library. No orchestration framework. No plan-decompose-execute. No agent-to-agent messaging. It reads its firehose and writes structured observations.
The Fleet (capitalized) — the population of interns running concurrently, each pointed at a different firehose, contributing into the same library.
The library — an append-only file the interns write into. One observation per line.
Operator — me. Reading the library by hand and grading what's worth doing next.

A single agent watching one source is a curiosity. The structural bet is that several such agents, each watching an unrelated source, contributing into a single shared library, eventually produce something a single-domain agent cannot. Patterns observed in one place rhyme with questions being asked in another. The library is the substrate where those rhymes can be detected. The agents are the means; the library is the asset.

Quiet by default

The apparatus should not work most of the time. That's the design.

Inference cost is a step function rather than a gradient. Routine perception runs on the cheapest available model. Mid-tier models are reached for only when the cheap layer flags something worth a closer look. Expensive models are reserved for hypotheses that already have evidence behind them. The system spends almost nothing on almost every day, and spends real money only when its own internal evidence justifies it.

This isn't a deployment optimization. It's an architectural commitment. An agent that has to earn its keep on every call will spend its life trying to look productive. An agent that gets paid by the cent for staying quiet — and then occasionally gets to spend more when something is genuinely interesting — can afford to be patient. That patience is what makes long-horizon observation possible at all.

What is not in the apparatus, by deliberate omission: there's no scheduling layer. No orchestration framework. No retrieval-augmented-generation pipeline at write time. The interns don't read the library while they're writing into it. No UI. No operator dashboard. There's a cat | jq | less script I use to read the library by hand, and a small set of analysis programs that read the library file as input. Every component that would be normal in a different kind of agent project would also be a maintenance liability against years of patient operation.

Append, never overwrite

The library is append-only. The schema is versioned: when fields are added, every record from that point on carries them, and earlier records are not rewritten to match. The dispatches I publish are append-only too — a new dispatch supersedes an old one by being later, not by editing it.

The discipline exists because the alternative — a database where the latest version is the only version — collapses the audit trail that lets future agents reconstruct why the project thinks what it thinks. Inspectability of how thinking changed is more valuable than the latest version of the thinking. It is the difference between a system that learns and a system that becomes superstition.

What it found

After several months of running, the Fleet had written 1,272 observations across five firehoses, at a cumulative inference spend under five dollars. Most of those observations are mundane. The point is that the substrate exists.

When the analysis layer ran clustering over the embeddings, eight thematic clusters emerged. Five of them satisfied the criteria I'd set in advance — coherent, nameable in a single sentence, and drawing from at least two firehoses with no single firehose dominating.

The one worth calling out: a cluster of sixty-six observations, drawn from r/ClaudeAI (26), r/openclaw (35), and r/LocalLLaMA (5), all converging on the same theme. Builders moving toward a single coordinating runtime — usually Claude Code — driven by plain markdown instructions, replacing heavyweight multi-agent orchestration frameworks. Three subreddits, no coordination between them, no operator schema asking the apparatus to look for this. It's a thread the library proposed.

Worth being precise about why that's the interesting bit. A 66-observation cluster from a single firehose would only say "this firehose has been talking about this." A cluster that draws from three subreddits, none of whose interns can read the others' writes during their own loop, says something different — that the same shape is showing up in places that did not coordinate on showing it. That's the resonance the project was built to detect. The apparatus detected it, on a corpus the apparatus itself wrote, against coordinates the apparatus itself proposed.

What's open, and why I'm writing this now

There's a technical report that goes much deeper into the substrate, the methodological corrections that surfaced — a 356x signal-to-noise claim that didn't survive an honest baseline and collapsed to 1.4x; a contribution-formula bug that dropped cross-axis contributor overlap from 44% to 2% — and the limitations of the measurement layer as it stands. None of that fits in a blog post, and I'm not trying to make it fit.

The corrections are the part of the project I'm proudest of. They're not buried. The discipline is that every claim is supposed to be tested against data, and when a claim fails the test, the failure is recorded plainly rather than smoothed over. Some of the most consequential parts of the report are the corrections.

What's still open: trajectory measurement — whether you can detect what's rising in the library, not just what's there — is currently running against thinner substrate than the original plan envisioned. Cluster grading is single-evaluator without inter-rater reliability. The firehose mix is narrow and skews toward LLM practitioners; diversifying it is an operational concern, not a one-time experiment.

None of these are dead ends. Several of them are why I'm writing this now. I think the project is interesting enough that more eyes on it would help, before I open-source the apparatus.

If you want to read what the apparatus says when it has something to say, the dispatches are where it speaks for itself. The first one is up.