Core Concepts¶
The Problem¶
Everyone's building "agent memory." Ask them one question: How do you know it works?
You'll get:
- "It feels smarter"
- "Users report better results"
- "The agent remembers things now"
That's not evidence. That's vibes.
Here's what a real answer looks like:
"We track Repeated Mistake Rate (RMR) across sessions. Our null hypothesis is that the system makes no difference. After 50 sessions, RMR decreased from 34% to 12% (p < 0.01). The effect size is 0.65. Here's the data."
If you can't say something like that, you don't have agent learning. You have a demo.
The Claim¶
buildlog makes a falsifiable claim:
H₀ (Null Hypothesis): buildlog makes no measurable difference to agent behavior.
H₁ (Alternative): Agents using buildlog-learned rules have lower Repeated Mistake Rate than baseline.
We provide the infrastructure to reject or fail to reject this hypothesis with your own data.
If buildlog doesn't work, the numbers will show it. That's the point.
The Metric: Repeated Mistake Rate (RMR)¶
RMR = (Mistakes that match previous mistakes) / (Total mistakes logged)
A mistake "matches" if it has the same semantic signature — same error class, similar description, same root cause showing up again.
Why RMR?
- Observable: You can count it
- Attributable: Lower RMR after rule injection = signal
- Meaningful: Repeating mistakes is the actual pain point
RMR is not the only metric that matters. But it's one we can measure, and measurement is where science starts.
The Mechanism¶
buildlog is building toward contextual bandits for automatic rule selection.
What Exists Today (v0.8)¶
| Component | Description | Status |
|---|---|---|
| Rule extraction | From entries, reviews, curated seeds | Implemented |
| Confidence scoring | Frequency + recency based | Implemented |
| Reward logging | Accept/reject/revision signals | Implemented |
| Experiment tracking | Sessions, mistakes, RMR calculation | Implemented |
| Review gauntlet | Curated persona-based code review | Implemented |
| Thompson Sampling | Automatic rule selection via bandit | Implemented |
Thompson Sampling Bandit¶
| Element | Detail |
|---|---|
| Context (c) | Error class (e.g., "type-errors") |
| Arms (a) | Candidate rules to surface |
| Reward (r) | Binary feedback from mistakes & rewards |
| Model | Beta-Bernoulli (conjugate prior) |
| Policy | Thompson Sampling (sample, don't exploit) |
| Learning | Bayesian updates on every feedback signal |
How it works:
- Session starts — Bandit samples from Beta distributions, selects top-k rules
- Mistake logged — Selected rules get reward=0 (they didn't prevent the mistake)
- Explicit reward — Rules get reward based on outcome (accepted=1, rejected=0)
Seed-boosted priors: Curated rules from gauntlet personas start with boosted priors (Beta(3,1) instead of Beta(1,1)), reflecting our belief that expert-curated rules are likely effective.
Theoretical Foundations¶
| Concept | Application in buildlog | Status |
|---|---|---|
| Confidence scoring | Frequency + recency decay | Implemented |
| Semantic hashing | Mistake deduplication for RMR | Implemented |
| Reward signals | Binary feedback infrastructure | Implemented |
| Thompson Sampling | Rule selection under uncertainty | Implemented (v0.8) |
| Beta-Bernoulli model | Posterior updates from binary reward | Implemented (v0.8) |
| Contextual bandits | Context-dependent rule selection | Implemented (v0.8) |
| Regret bounds | O(sqrt(KT log K)) theoretical guarantee | Follows from TS |
We're not inventing new math. We're applying proven frameworks to a new domain.