Thompson Sampling¶

The Multi-Armed Bandit Problem¶

Imagine a row of slot machines (bandits), each with an unknown payout rate. You want to maximize your total reward, but you don't know which machine is best. You must balance:

Exploration — Try different machines to learn their payout rates
Exploitation — Play the machine you currently believe is best

In OpenClaw, the "machines" are prompt components (tools, skills, files) and the "reward" is whether the model actually references them.

Why Thompson Sampling¶

Several algorithms solve the multi-armed bandit problem. Here's how they compare:

Algorithm	Approach	Pros	Cons
Epsilon-Greedy	Random exploration with probability epsilon	Simple	Wastes exploration on known-bad arms
UCB1	Upper confidence bound on reward estimate	Deterministic	Overly optimistic; slow to converge with many arms
Thompson Sampling	Sample from posterior, pick highest sample	Naturally adaptive exploration	Requires prior specification

Thompson Sampling is the best fit for prompt optimization because:

Adaptive exploration — Uncertain arms get explored more; well-known arms get exploited
Principled priors — We can encode domain knowledge (curated tools are likely useful)
Natural convergence — Exploration decreases automatically as posteriors tighten
Budget-compatible — Scores can be used for greedy selection within a token budget

How It Works in OpenClaw¶

Step 1: Maintain Posteriors¶

Each arm has a Beta(alpha, beta) posterior. Initially:

Curated arms (tools, skills, memories): Beta(3, 1) — mean 0.75, optimistic
Learned arms (files): Beta(1, 1) — mean 0.50, neutral

After each observation:

Referenced (success): alpha += 1
Not referenced (failure): beta += 1

Step 2: Sample Scores¶

For each arm, draw a random sample from its Beta posterior:

score_i ~ Beta(alpha_i, beta_i)

Arms with high posteriors tend to sample high (exploitation). Arms with wide posteriors occasionally sample high (exploration).

Step 3: Rank and Pack¶

Sort arms by priority:

Seed arms — Always first (never excluded)
Underexplored arms — Fewer than minPulls observations
All others — Sorted by Thompson score (descending)

Greedily select arms within the token budget:

selected = []
usedTokens = 0
for arm in sorted_arms:
    if usedTokens + arm.tokenCost <= tokenBudget:
        selected.append(arm)
        usedTokens += arm.tokenCost

Step 4: Observe and Update¶

After the model responds, check which included arms were referenced:

Included + referenced → reward = 1.0 → alpha += 1
Included + not referenced → reward = 0.0 → beta += 1
Excluded → no update (counterfactual not observed)

Selection Algorithm¶

function thompsonSelect(arms, posteriors, tokenBudget, config):
    // Baseline check
    if random() < config.baselineRate:
        return selectAll(arms, tokenBudget)

    // Score each arm
    for arm in arms:
        posterior = posteriors[arm.id] ?? initialPrior(arm)
        arm.score = sampleBeta(posterior.alpha, posterior.beta)
        arm.isSeed = arm.id in SEED_ARM_IDS
        arm.isUnderexplored = posterior.pulls < config.minPulls

    // Sort: seeds > underexplored > by score
    sort(arms, by=[isSeed DESC, isUnderexplored DESC, score DESC])

    // Greedy selection within budget
    selected = []
    usedTokens = 0
    for arm in arms:
        if usedTokens + arm.tokenCost <= tokenBudget:
            selected.append(arm)
            usedTokens += arm.tokenCost

    return selected

Convergence¶

As observations accumulate, posteriors tighten:

High-value arms (frequently referenced) develop high means with narrow intervals — they're almost always selected
Low-value arms (rarely referenced) develop low means — they're almost always excluded
Marginal arms maintain wider intervals and continue to be explored occasionally

Convergence speed depends on:

Factor	Effect
Baseline rate	Higher rate = more data for all arms = faster convergence
`minPulls`	Higher value = longer forced exploration before exclusion
Arm count	More arms = each arm gets fewer observations per run
Token budget	Tighter budget = more arms excluded = faster differentiation

Typically, posteriors stabilize after 50-200 traces for a moderate-sized arm inventory (10-30 arms).

Comparison with Alternatives¶

Feature	Thompson Sampling	Epsilon-Greedy	UCB1
Exploration strategy	Posterior sampling	Random	Upper confidence bound
Prior knowledge	Yes (Beta priors)	No	No
Budget-aware	Yes (greedy packing)	No (random subset)	Partially
Convergence	Fast for informative priors	Slow (constant exploration)	Moderate
Implementation	Moderate	Simple	Simple
Theory	Bayesian	Frequentist	Frequentist

Next Steps¶

Reward Model — How reference detection drives rewards
Core Concepts — Arms, posteriors, phases
Types Reference — All type definitions