Core Concepts¶
Arms¶
An arm is any component that can be included or excluded from the agent's prompt. Each arm has a hierarchical ID in type:category:id format.
Arm Types¶
| Type | Description | Example ID |
|---|---|---|
tool |
Agent tools (Read, Write, Bash, etc.) | tool:fs:Read |
memory |
Memory entries loaded into context | memory:project:auth-notes |
skill |
Skill/plugin prompt sections | skill:coding:main |
file |
Workspace files in context | file:workspace:README.md |
section |
Structural prompt sections | section:system:instructions |
Seed Arms¶
Some arms are seed arms — core tools that are never excluded by Thompson Sampling. The default seed set is:
| Arm ID | Description |
|---|---|
tool:fs:Read |
Read files |
tool:fs:Write |
Write files |
tool:fs:Edit |
Edit files |
tool:exec:Bash |
Execute commands |
tool:fs:Glob |
Find files by pattern |
tool:fs:Grep |
Search file contents |
Seed arms are always included regardless of their posterior scores.
Token Cost¶
Each arm has an estimated token cost based on its size in the prompt. Tools are estimated at ceil(JSON.stringify(tool).length / 4) tokens; files and skills use ceil(content.length / 4).
Posteriors¶
Each arm maintains a posterior — a Beta distribution that represents the system's belief about the arm's usefulness.
Beta Distribution¶
A Beta(alpha, beta) distribution models the probability of success:
- alpha — Accumulated successes + prior
- beta — Accumulated failures + prior
- mean —
alpha / (alpha + beta)— the expected usefulness score - variance — Decreases as more data is collected
Credible Intervals¶
Each posterior has a 95% credible interval [lower, upper] computed via normal approximation. Narrow intervals indicate high confidence; wide intervals indicate uncertainty.
Confidence Levels¶
| Pulls | Confidence | Interpretation |
|---|---|---|
| < 5 | Low | Insufficient data; arm always included |
| 5-19 | Medium | Growing certainty; bandit may explore or exploit |
| 20+ | High | Strong signal; bandit relies on posterior mean |
Run Traces¶
A run trace captures everything about a single agent request:
| Field | Description |
|---|---|
traceId |
Unique trace identifier |
runId |
Run identifier (may span multiple traces) |
sessionId |
Session identifier |
timestamp |
Unix timestamp (ms) |
provider |
AI provider (e.g., "anthropic") |
model |
Model name (e.g., "claude-sonnet-4") |
isBaseline |
Whether this was a full-prompt baseline run |
arms |
Array of arm outcomes (included, referenced, tokenCost) |
usage |
Token usage (input, output, cacheRead, total) |
durationMs |
Request duration in milliseconds |
Arm Outcomes¶
Each arm in a trace has an outcome:
- included: true, referenced: true — Arm was in the prompt and the model used it (reward = 1.0)
- included: true, referenced: false — Arm was in the prompt but not used (reward = 0.0)
- included: false — Arm was excluded; no reward update (counterfactual not observed)
Thompson Sampling¶
Thompson Sampling is a Bayesian approach to the multi-armed bandit problem. Instead of always picking the arm with the highest average score, it samples from each arm's posterior distribution and selects based on the samples.
This naturally balances:
- Exploitation — Arms with high posteriors are sampled high more often
- Exploration — Uncertain arms occasionally sample high, getting included for more data
See Thompson Sampling Theory for the full algorithm and comparison with alternatives.
Two Phases¶
Passive Phase¶
In passive mode, the learning layer observes but does not act:
- All arms are included in every run
- Traces are recorded with arm outcomes
- Posteriors are maintained (in active mode) or available for analysis
- No impact on agent behavior
This is the default and is safe to leave on indefinitely.
Active Phase¶
In active mode, the learning layer optimizes prompt composition:
- Thompson Sampling selects arms within the token budget
- Seed arms and underexplored arms (fewer than
minPullsobservations) are always included - Baseline runs (configurable rate) use the full prompt for comparison
- Posteriors are updated after each run
Excluded-Tools Guidance¶
When Thompson Sampling excludes tools from a run, the system injects guidance into the model's system prompt listing which tools are currently unavailable. This means the model can explain to users when a requested capability is temporarily excluded, rather than silently producing an empty or confused response.
When to switch to active:
- You have 50+ traces (enough data for meaningful posteriors)
- You want to start saving tokens
- You've reviewed the dashboard and understand which arms are high/low value
Baseline Runs¶
A fraction of runs (default 10%) use the full prompt — all arms included, no Thompson Sampling selection. These baseline runs enable:
- Counterfactual evaluation — Compare optimized runs against full-prompt performance
- Continuous data collection — All arms get occasional observations, preventing stale posteriors
- Drift detection — If baseline performance changes, the system can detect shifting arm relevance
Baseline rate is configurable via baselineRate. Recommended rates depend on inventory size:
| Arm Count | Recommended Rate |
|---|---|
| 1-10 | 20% |
| 11-50 | 10% |
| 50+ | 5% |
Reference Detection¶
After each run, the learning layer checks whether each included arm was actually referenced by the model's output. Detection logic varies by arm type:
| Arm Type | Detection Method |
|---|---|
tool |
Tool name appears in tool call metadata |
skill |
Skill name mentioned in output or tool metadata |
file |
Filename appears in assistant text |
memory |
Substring (20+ chars) of memory content appears in output |
section |
Always considered referenced when included |
A reference means the model found the arm useful — this drives the reward signal (referenced = 1.0, not referenced = 0.0).
Next Steps¶
- Quick Start — See these concepts in action
- Thompson Sampling — Full algorithm details
- Reward Model — How rewards and priors work