Core Concepts¶

The Problem¶

Everyone's building "agent memory." Ask them one question: How do you know it works?

You'll get:

"It feels smarter"
"Users report better results"
"The agent remembers things now"

That's not evidence. That's vibes.

Here's what a real answer looks like:

"We track Repeated Mistake Rate (RMR) across sessions. Our null hypothesis is that the system makes no difference. After 50 sessions, RMR decreased from 34% to 12% (p < 0.01). The effect size is 0.65. Here's the data."

If you can't say something like that, you don't have agent learning. You have a demo.

The Claim¶

buildlog makes a falsifiable claim:

H₀ (Null Hypothesis): buildlog makes no measurable difference to agent behavior.

H₁ (Alternative): Agents using buildlog-learned rules have lower Repeated Mistake Rate than baseline.

We provide the infrastructure to reject or fail to reject this hypothesis with your own data.

If buildlog doesn't work, the numbers will show it. That's the point.

The Metric: Repeated Mistake Rate (RMR)¶

RMR = (Mistakes that match previous mistakes) / (Total mistakes logged)

A mistake "matches" if it has the same semantic signature — same error class, similar description, same root cause showing up again.

Why RMR?

Observable: You can count it
Attributable: Lower RMR after rule injection = signal
Meaningful: Repeating mistakes is the actual pain point

RMR is not the only metric that matters. But it's one we can measure, and measurement is where science starts.

The Mechanism¶

buildlog is building toward contextual bandits for automatic rule selection.

What Exists Today (v0.8)¶

Component	Description	Status
Rule extraction	From entries, reviews, curated seeds	Implemented
Confidence scoring	Frequency + recency based	Implemented
Reward logging	Accept/reject/revision signals	Implemented
Experiment tracking	Sessions, mistakes, RMR calculation	Implemented
Review gauntlet	Curated persona-based code review	Implemented
Thompson Sampling	Automatic rule selection via bandit	Implemented

Thompson Sampling Bandit¶

Element	Detail
Context (c)	Error class (e.g., "type-errors")
Arms (a)	Candidate rules to surface
Reward (r)	Binary feedback from mistakes & rewards
Model	Beta-Bernoulli (conjugate prior)
Policy	Thompson Sampling (sample, don't exploit)
Learning	Bayesian updates on every feedback signal

How it works:

Session starts — Bandit samples from Beta distributions, selects top-k rules
Mistake logged — Selected rules get reward=0 (they didn't prevent the mistake)
Explicit reward — Rules get reward based on outcome (accepted=1, rejected=0)

Seed-boosted priors: Curated rules from gauntlet personas start with boosted priors (Beta(3,1) instead of Beta(1,1)), reflecting our belief that expert-curated rules are likely effective.

Theoretical Foundations¶

Concept	Application in buildlog	Status
Confidence scoring	Frequency + recency decay	Implemented
Semantic hashing	Mistake deduplication for RMR	Implemented
Reward signals	Binary feedback infrastructure	Implemented
Thompson Sampling	Rule selection under uncertainty	Implemented (v0.8)
Beta-Bernoulli model	Posterior updates from binary reward	Implemented (v0.8)
Contextual bandits	Context-dependent rule selection	Implemented (v0.8)
Regret bounds	O(sqrt(KT log K)) theoretical guarantee	Follows from TS

We're not inventing new math. We're applying proven frameworks to a new domain.