Learning System¶

qortex includes a Thompson Sampling bandit system for online learning. It powers adaptive decisions — selecting prompts, ranking strategies, or choosing retrieval modes — and updates beliefs from feedback.

Architecture¶

┌──────────────────────────────────────────────────────────┐
│                     Learner                               │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────┐ │
│  │   Strategy   │  │  Store (SQL  │  │  Event Emitter  │ │
│  │  (Thompson)  │  │  or Postgres)│  │  (metrics/OTel) │ │
│  └──────┬──────┘  └──────┬───────┘  └────────┬────────┘ │
│         │                │                    │          │
│    select(arms)    get/put state        emit events      │
└─────────┼────────────────┼────────────────────┼──────────┘
          │                │                    │
          ▼                ▼                    ▼
     Beta(α, β)      persistence          Prometheus
     sampling        (per arm)            + OTel spans

Each Learner manages one decision problem. It has:

A Strategy (Thompson Sampling) that samples from Beta posteriors
A Store that persists arm states (SQLite or PostgreSQL)
An Event Emitter that fires metrics and trace events

Concepts¶

Term	Meaning
Learner	Named decision-maker (e.g., `"prompt-optimizer"`)
Arm	One option to choose from (e.g., `"prompt:chain-of-thought"`)
Context	Situational hash that groups arms (enables contextual bandits)
Selection	Choosing k arms via Thompson Sampling
Observation	Recording an outcome (`"accepted"` or `"rejected"`) to update beliefs
Posterior	Beta(α, β) distribution representing current belief about an arm's reward rate

Quick Start¶

Python API¶

from qortex.learning import Learner

# Create a learner with default Thompson Sampling
learner = await Learner.create("prompt-optimizer")

# Define candidates
arms = [
    {"id": "prompt:basic", "token_cost": 100},
    {"id": "prompt:chain-of-thought", "token_cost": 200},
    {"id": "prompt:few-shot", "token_cost": 300},
]

# Select the best arm (Thompson Sampling)
selected = await learner.select(arms, k=1, token_budget=500)
print(f"Selected: {selected[0].id}")

# After observing the outcome
await learner.observe("prompt:chain-of-thought", outcome="accepted")

REST API¶

# Select an arm
curl -X POST http://localhost:8400/v1/learning/select \
  -H "Content-Type: application/json" \
  -d '{
    "learner": "prompt-optimizer",
    "candidates": [
      {"id": "prompt:basic", "token_cost": 100},
      {"id": "prompt:chain-of-thought", "token_cost": 200}
    ],
    "k": 1,
    "token_budget": 500
  }'

# Record outcome
curl -X POST http://localhost:8400/v1/learning/observe \
  -H "Content-Type: application/json" \
  -d '{"learner": "prompt-optimizer", "arm_id": "prompt:chain-of-thought", "outcome": "accepted"}'

# View posteriors
curl http://localhost:8400/v1/learning/prompt-optimizer/posteriors

# View metrics (convergence, selection rates)
curl http://localhost:8400/v1/learning/prompt-optimizer/metrics

MCP Tool¶

Use qortex_learning_select to choose between prompt strategies.
Use qortex_learning_observe to record the outcome.

Storage Backends¶

SQLite (default)¶

# No configuration needed — uses ~/.qortex/learning/<learner>.db
qortex serve

Each learner gets its own SQLite database file. Good for local development and single-process deployments.

PostgreSQL¶

QORTEX_STORE=postgres \
PGVECTOR_HOST=localhost \
qortex serve

All learners share the learning_arm_states table, distinguished by a learner_name column. Required for multi-pod deployments where state must be shared.

Schema:

CREATE TABLE learning_arm_states (
    learner_name TEXT NOT NULL,
    context_hash TEXT NOT NULL,
    arm_id       TEXT NOT NULL,
    alpha        DOUBLE PRECISION NOT NULL DEFAULT 1.0,
    beta         DOUBLE PRECISION NOT NULL DEFAULT 1.0,
    pulls        INTEGER NOT NULL DEFAULT 0,
    total_reward DOUBLE PRECISION NOT NULL DEFAULT 0.0,
    last_updated TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (learner_name, context_hash, arm_id)
);

Thompson Sampling¶

The default strategy uses Beta-Bernoulli Thompson Sampling:

Prior: Each arm starts with Beta(1, 1) — uniform, no preference
Selection: Sample θ ~ Beta(α, β) for each arm, pick the highest
Update: On "accepted" → α += 1; on "rejected" → β += 1
Token budget: Arms exceeding the budget are filtered before sampling

This naturally balances exploration (trying uncertain arms) and exploitation (favoring arms with high observed reward rates). Arms with fewer observations have wider posteriors, so they occasionally "win" the sample even against arms with higher means — ensuring they get tried.

Convergence¶

After ~50 observations per arm, the posteriors tighten and the system exploits the best arm most of the time. You can monitor convergence via:

curl http://localhost:8400/v1/learning/prompt-optimizer/metrics

The response includes selection rates, reward rates, and posterior statistics per arm.

Observability¶

With QORTEX_PROMETHEUS_ENABLED=true, the learning system emits:

Metric	Type	Description
`qortex_learning_selections_total`	counter	Total arm selections
`qortex_learning_observations_total`	counter	Total observations recorded
`qortex_learning_selection_duration_seconds`	histogram	Selection latency
`qortex_learning_observation_duration_seconds`	histogram	Observation latency
`qortex_learning_reward_rate`	gauge	Current reward rate per arm
`qortex_learning_posterior_alpha`	gauge	Alpha parameter per arm
`qortex_learning_posterior_beta`	gauge	Beta parameter per arm

All operations are traced via OpenTelemetry (learning.select, learning.observe, learning.pg.get, learning.pg.put).

Next Steps¶

REST API — full HTTP endpoint reference
PostgreSQL Setup — configure postgres backends
Docker Infrastructure — run the full stack