Observability and Grafana Dashboard

qortex includes an observability stack built on OpenTelemetry. Structured events drive metrics (Prometheus), traces (Tempo), and a Grafana dashboard covering the entire pipeline.

The observability layer is packaged as qortex-observe, a standalone package that can be installed independently. It provides the event system, metric definitions, trace instrumentation, and subscriber wiring.

Architecture

qortex process
  ├─ emit(Event)               ← typed frozen dataclass
  │   ├─ metrics_handlers      → OTel instruments (counters, histograms, gauges)
  │   │   ├─ OTLP push         → OTel Collector → Prometheus (remote write)
  │   │   └─ PrometheusReader  → HTTP /metrics (local scrape target, port 9464)
  │   ├─ otel_traces           → OTel spans → OTel Collector → Tempo (trace storage)
  │   ├─ structlog             → stdout / JSONL sink / VictoriaLogs
  │   ├─ jsonl                 → append-only log file
  │   └─ alerts                → threshold-based alerting
  └─ @traced decorator         → automatic parent-child span hierarchy

All 62 metrics are defined in a single declarative schema (metrics_schema.py). OTel is the sole metric backend; PrometheusMetricReader serves the /metrics endpoint for Prometheus scraping. The old prometheus.py subscriber has been removed.

Events are emitted at every stage of the pipeline:

  • Ingestion: manifest parsing, concept extraction
  • Vector index: add, search, seed yield
  • Retrieval: vec search, online edge generation, PPR scoring
  • Feedback: teleportation factor updates
  • Enrichment: template and LLM-backed rule enrichment
  • Learning: bandit selection, observation, posterior updates
  • Credit propagation: causal DAG traversal, alpha/beta deltas
  • Buffer promotion: online edge crystallization

A single set of event handlers in metrics_handlers.py translates events into OTel instruments.

Quick Start

# Start the observability stack
cd docker && docker compose up -d

# Verify services
curl -s http://localhost:9091/api/v1/query?query=up | python3 -m json.tool

# Open the dashboard
open http://localhost:3010/d/qortex-main/qortex-observability
Service Port Purpose
Memgraph 7687 Graph database
Memgraph Lab 3000 Memgraph web UI
OTel Collector 4317, 4318 Receives OTLP (gRPC + HTTP)
Prometheus 9091 Metrics storage + PromQL
Grafana 3010 Dashboard visualization
Tempo 3200 Trace storage (query via Grafana Explore with TraceQL)
VictoriaLogs 9428 Log aggregation

Environment Variables

Variable Default Description
QORTEX_OTEL_ENABLED false Enable OTel metrics and traces export
OTEL_EXPORTER_OTLP_ENDPOINT OTel Collector endpoint (e.g. http://localhost:4318)
OTEL_EXPORTER_OTLP_PROTOCOL Protocol (http/protobuf or grpc)
QORTEX_PROMETHEUS_ENABLED false Enable local Prometheus HTTP server
QORTEX_PROMETHEUS_PORT 9464 Port for the local /metrics endpoint
QORTEX_OTEL_TRACE_SAMPLE_RATE 0.1 Fraction of normal traces to export (0.0-1.0). Errors and slow traces are always exported.
QORTEX_OTEL_TRACE_LATENCY_THRESHOLD_MS 100.0 Spans slower than this are always exported regardless of sample rate.
QORTEX_EXTRACTION spacy Concept extraction strategy: spacy (default, local NER), llm (API-based), none (disabled)
MEMGRAPH_USER Memgraph Bolt auth username
MEMGRAPH_PASSWORD Memgraph Bolt auth password

Dashboard Panels

The Grafana dashboard (qortex-main) is organized into eight sections. Each section corresponds to a stage of the pipeline and opens with a Mermaid flowchart (showing data flow) and a signal table (Healthy vs Investigate thresholds) before the metric panels.

Retrieval Health

These panels show the query lifecycle: from the moment adapter.retrieve() is called to when it returns results.

Query Rate (queries/sec)

  • Metric: rate(qortex_queries_total[5m])
  • Labels: mode (graph or vec)
  • Source event: QueryCompleted (emitted at the end of GraphRAGAdapter.retrieve())
  • What it tells you: How many retrieval queries are completing per second. A sudden drop means queries are failing or the system is idle. A spike indicates burst load.

Query Latency (p50/p95/p99)

  • Metric: histogram_quantile(0.50|0.95|0.99, rate(qortex_query_duration_seconds_bucket[5m]))
  • Source event: QueryCompleted (carries latency_ms)
  • What it tells you: End-to-end retrieve latency from query embedding through vec search, online edge generation, PPR, and scoring. p95 above 1s suggests a bottleneck. Check vec search and PPR panels to isolate which stage is slow.

Vec Search Latency (p50/p95)

  • Metric: histogram_quantile(0.50|0.95, rate(qortex_vec_search_duration_seconds_bucket[5m]))
  • Source event: VecSearchCompleted (emitted after the embedding + vector similarity step)
  • What it tells you: Time spent embedding the query and searching the vector index for seed candidates. If this dominates query latency, the embedding model or index size is the bottleneck, not the graph.

Query Errors

  • Metric: rate(qortex_query_errors_total[5m])
  • Labels: stage (which pipeline stage failed)
  • Source event: QueryFailed
  • Note: This event is defined but currently not emitted by any code path. The panel exists as a placeholder for future error tracking. If you see data here, something new is emitting QueryFailed.

Learning Dynamics

These panels track how the system learns from feedback. Teleportation factors bias PPR toward nodes the user finds helpful and away from unhelpful ones.

Factor Mean Over Time

  • Metric: qortex_factor_mean
  • Source event: FactorDriftSnapshot (emitted after each batch of factor updates)
  • What it tells you: The average teleportation factor across all nodes. Starts at 1.0 (uniform). Moves above 1.0 when accepted nodes accumulate boosts. A rising mean signals the system is developing preferences. A flat line at 1.0 means no feedback is flowing.

Factor Entropy

  • Metric: qortex_factor_entropy
  • Unit: bits
  • Source event: FactorDriftSnapshot
  • What it tells you: Shannon entropy of the factor distribution. High entropy = factors are spread evenly (system is uncertain). Low entropy = factors are concentrated on a few nodes (system has strong preferences). Entropy should decrease as the system receives consistent feedback.

Factor Update Rate

  • Metric: rate(qortex_factor_updates_total[5m])
  • Labels: outcome (accepted, rejected, partial)
  • Source event: FactorUpdated (one per node per feedback call)
  • What it tells you: How fast teleportation factors are changing, broken down by outcome type. If rejected vastly outpaces accepted, the system is serving poor results. A balanced ratio suggests healthy learning.

Feedback Accept/Reject Ratio

  • Metric: rate(qortex_learning_observations_total{outcome="accepted"|"rejected"}[5m])
  • Source event: LearningObservationRecorded (emitted by GraphRAGAdapter.feedback())
  • What it tells you: The raw accept vs reject rate from user feedback. This is the primary signal for retrieval quality. If rejects trend upward over time, something is degrading.

KG Crystallization

These panels track how the knowledge graph evolves: online edges solidifying into persistent structure.

KG Coverage Ratio

  • Metric: qortex_kg_coverage
  • Display: Gauge, 0-100%
  • Source events: KGCoverageComputed (during retrieve) and BufferFlushed
  • What it tells you: The ratio of persistent KG edges to total edges (persistent + online) for a query's candidate set. 100% means the KG fully covers the retrieval neighborhood (no online edges needed). Low coverage means the system is filling gaps with cosine-similarity edges. Coverage should trend upward as online edges get promoted.
  • Correlation with learning: As coverage rises, PPR operates over more stable, vetted structure. Quality should improve.

Buffer Size and Promotions

  • Metrics: qortex_buffer_edges (gauge: current buffer size), rate(qortex_edges_promoted_total[1h]) (promotions per hour)
  • Source events: OnlineEdgeRecorded (buffer size), EdgePromoted (promotion count)
  • What it tells you: How many candidate edges are waiting in the promotion buffer and how fast they graduate to the persistent KG. A growing buffer with zero promotions means edges aren't being observed often enough (or the promotion threshold is too high). Steady promotion rate = healthy crystallization.

Total Promoted (lifetime)

  • Metric: qortex_edges_promoted_total
  • Display: Stat panel (single number)
  • What it tells you: Lifetime count of online edges that met the promotion threshold and were written to the persistent KG.

PPR Performance

These panels show the graph algorithm that powers retrieval scoring.

PPR Executions / sec

  • Metric: rate(qortex_ppr_started_total[5m])
  • Source event: PPRStarted
  • What it tells you: How often Personalized PageRank runs. Should track query rate 1:1 since every retrieve() call triggers one PPR run.

PPR Iterations to Convergence

  • Metric: rate(qortex_ppr_iterations_bucket[5m])
  • Display: Histogram
  • Source events: PPRConverged, PPRDiverged
  • What it tells you: How many power iteration steps PPR needs to converge. Fewer iterations = faster convergence. If iterations cluster near max_iterations (100), PPR is not converging; the graph may be too dense or disconnected. Typical healthy range: 20-50 iterations.

Active Factors and Node Count

  • Metric: qortex_factors_active
  • Source event: FactorDriftSnapshot
  • What it tells you: How many nodes have non-default teleportation factors. If this equals your total node count, every node has received feedback at some point.

Online Edge Generation

Online Edge Count and Generation Rate

  • Metrics: qortex_online_edge_count (gauge: edges per last query), rate(qortex_online_edges_generated_total[5m]) (events/sec)
  • Source event: OnlineEdgesGenerated
  • What it tells you: How many cosine-similarity edges are generated per query to fill KG gaps. High counts mean the KG is sparse for those queries. As KG coverage improves (via edge promotion), this should trend downward.
  • Correlation with KG coverage: online edge count should inversely correlate with the KG coverage gauge. If both are rising, online edges are being generated but not promoted. Check the promotion threshold or buffer flush frequency.

KG Coverage Over Time

  • Metric: qortex_kg_coverage
  • Display: Time series, 0-100%
  • What it tells you: Same metric as the gauge above, but as a time series to show the trend. An upward slope means the KG is maturing.

Enrichment and Ingestion

Enrichment Rate

  • Metric: rate(qortex_enrichment_total[5m])
  • Labels: backend_type (e.g. template, AnthropicEnrichmentBackend)
  • Source event: EnrichmentCompleted
  • What it tells you: How often rules are enriched (context, antipatterns, rationale added). The enrichment pipeline is separate from retrieve: it runs during ingestion or on-demand.

Enrichment Latency (p50/p95)

  • Metric: histogram_quantile(0.50|0.95, rate(qortex_enrichment_duration_seconds_bucket[5m]))
  • Source event: EnrichmentCompleted
  • What it tells you: Time to enrich a batch of rules. Template enrichment is sub-millisecond. LLM-backed enrichment can be seconds. Watch for p95 spikes indicating API timeouts.

Enrichment Fallbacks

  • Metric: rate(qortex_enrichment_fallbacks_total[5m])
  • Source event: EnrichmentFallback
  • What it tells you: How often the enrichment backend fails and falls back to template-based enrichment. Spikes here indicate LLM API issues.

Ingestion Rate

  • Metric: rate(qortex_messages_ingested_total[5m])
  • Labels: role (user, assistant)
  • Source event: MessageIngested
  • What it tells you: How often messages are ingested, broken down by role. Each message triggers concept extraction and graph indexing.

Ingest Latency (p50/p95)

  • Metric: histogram_quantile(0.50|0.95, rate(qortex_message_ingest_duration_seconds_bucket[5m]))
  • Source event: MessageIngested
  • What it tells you: Time to ingest a message (includes extraction, embedding, and graph ops). Latency scales with message length and chunk count.

Vector Index

These panels provide visibility into the vec layer: the index that stores embeddings and serves as the seed source for graph retrieval. Previously the vec layer was a black box; now you can see how it behaves.

Vec Index Size

  • Metric: qortex_vec_add_total
  • Display: Stat panel (single number)
  • Source event: VecIndexUpdated (emitted from NumpyVectorIndex.add(), SqliteVecIndex.add(), PgVectorIndex.add())
  • What it tells you: Total embedding vectors stored. Each extracted concept becomes one vector. Growth indicates new concepts being ingested.

Vec Add Rate

  • Metric: rate(qortex_vec_add_total[5m])
  • Labels: index_type (numpy or sqlite)
  • Source event: VecIndexUpdated
  • What it tells you: How often vectors are being added to the index. Spikes during ingestion, flat during query-only workloads. If you ingest but this stays at zero, embeddings aren't reaching the vec index.

Vec Add Latency (p50/p95)

  • Metric: histogram_quantile(0.50|0.95, rate(qortex_vec_add_duration_seconds_bucket[5m]))
  • Source event: VecIndexUpdated
  • What it tells you: Time to add a batch of vectors. NumpyVectorIndex is sub-millisecond for small batches. SqliteVecIndex involves disk I/O. If add latency spikes during ingestion, the index is becoming a bottleneck.

Vec Search Top Score

  • Metric: qortex_vec_search_top_score
  • Display: Time series, range 0-1
  • Source event: VecSearchResults (emitted from the index .search() method)
  • What it tells you: The highest cosine similarity score from the last vector search. High scores (> 0.8) mean the index contains strong matches for the query. Consistently low scores (< 0.3) mean the embedding space doesn't well-represent the queries. Consider a different embedding model, or the index may be under-populated.
  • Correlation with graph learning: When top scores are high, the seeds fed into PPR are strong, leading to better graph traversal. Low top scores produce weak seeds and noisy PPR results.

Vec Search Score Spread

  • Metric: qortex_vec_search_score_spread
  • Source event: VecSearchResults
  • What it tells you: The difference between the top and bottom cosine sim scores in a single search. A wide spread (> 0.3) means the index is clearly distinguishing relevant from irrelevant vectors, indicating good signal quality. A narrow spread (< 0.05) means results are clustered close together, making it hard for PPR to differentiate.
  • Correlation with graph learning: High spread = clear ranking signal for PPR seeds. Low spread = PPR is working with near-uniform weights, reducing its ability to focus activation on the most relevant subgraph.

Vec Seed Yield

  • Metric: qortex_vec_seed_yield
  • Display: Gauge, 0-100%
  • Source event: VecSeedYield (emitted from GraphRAGAdapter.retrieve() after domain filtering)
  • What it tells you: The ratio of vec search results that survive domain filtering to become PPR seeds. A yield of 100% means every vec match was in the requested domain. Low yield (< 50%) means the vec index returns many cross-domain results that get discarded. The domain structure may need attention, or the embedding model doesn't capture domain boundaries well.
  • Correlation with graph learning: Low yield wastes compute (vec search finds candidates that are immediately discarded). If yield drops over time, domain-specific re-indexing may help.

Vec Search Candidates Distribution

  • Metric: rate(qortex_vec_search_candidates_sum[5m]) / rate(qortex_vec_search_candidates_count[5m])
  • Source event: VecSearchResults
  • What it tells you: Average number of candidates returned per vec search. Should be close to fetch_k (typically top_k * 3). If consistently lower, the index is small or the threshold is filtering aggressively.

Learning & Bandits

These panels track the Thompson Sampling bandit that learns which retrieval strategies work. Each candidate action (arm) is modeled as a Beta distribution, updated by feedback outcomes.

Selection Rate

  • Metric: rate(qortex_learning_selections_total[5m])
  • Labels: learner, baseline (true = forced exploration, false = Thompson Sampling pick)
  • Source event: LearningSelectionMade
  • What it tells you: How often the bandit selects arms. The baseline=true line represents forced random exploration (default 10%). As posteriors separate, the baseline=false line should dominate. If baseline stays flat, the system hasn't learned enough to exploit.

Observation Rate

  • Metric: rate(qortex_learning_observations_total[5m])
  • Labels: learner, outcome (accepted, rejected, partial)
  • Source event: LearningObservationRecorded
  • What it tells you: Rate of reward observations by outcome. In a converging system, accepted should trend upward. Persistent rejected majority means the arm pool is bad or the signal is noisy.

Posterior Mean (top 10 arms)

  • Metric: topk(10, qortex_learning_posterior_mean)
  • Labels: learner, arm_id
  • Source event: LearningPosteriorUpdated
  • What it tells you: The posterior mean α / (α + β) for each arm. This IS the learning. Mean near 1.0 = confident success. 0.5 = uncertain. 0.0 = confident failure. A clear winner pulling away from the pack indicates convergence. All arms clustered at 0.5 means insufficient data.

Token Budget Usage

  • Metric: histogram_quantile(0.50|0.95, rate(qortex_learning_token_budget_used_bucket[5m]))
  • Source event: LearningSelectionMade
  • What it tells you: How much of the token budget each selection consumes. Empty if no token_budget constraint is configured. If p95 consistently hits the budget cap, arms are too expensive.

Concept Extraction

These panels track the extraction pipeline that converts raw text chunks into named concepts with typed relationships.

Extractions Total / Concepts Extracted / Relations Extracted

  • Metrics: qortex_messages_ingested_total (extractions), qortex_vec_add_total (concepts), qortex_online_edges_generated_total (relations)
  • Display: Stat panels (lifetime counts)
  • Source events: MessageIngested, VecIndexUpdated, OnlineEdgesGenerated
  • What it tells you: How much extraction has happened since startup. Each ingested message triggers one extraction run; each concept becomes a vector; relations map to online edges.

Concepts per Chunk (p50/p95)

  • Metric: sum(qortex_vec_add_total) / sum(qortex_messages_ingested_total)
  • Source events: VecIndexUpdated, MessageIngested
  • What it tells you: Average vectors (concepts) produced per ingested message. A ratio of 2-5 is typical for spaCy extraction.

Extraction Latency per chunk (p50/p95/p99)

  • Metric: histogram_quantile(0.50|0.95|0.99, rate(qortex_message_ingest_duration_seconds_bucket[5m]))
  • Source event: MessageIngested
  • What it tells you: Per-message ingest time (includes extraction, embedding, and graph ops). spaCy extraction is typically sub-50ms. Watch p99 for outliers.

Pipeline Latency (p50/p95)

  • Metric: histogram_quantile(0.50|0.95, rate(qortex_message_ingest_duration_seconds_bucket[5m]))
  • Source event: MessageIngested
  • What it tells you: Total ingest pipeline time per message batch. Scales with message length and chunk count.

Concepts by Strategy & Domain

  • Metric: sum by (index_type) (rate(qortex_vec_add_total[5m]))
  • Source event: VecIndexUpdated
  • What it tells you: Vector addition rate broken down by index backend (pgvector, sqlite-vec, etc.).

Credit Propagation

These panels track causal credit assignment: feedback propagating backward through the causal DAG to update ancestor concept posteriors. Requires QORTEX_CREDIT_PROPAGATION=on.

Credit Propagation Rate

  • Metric: rate(qortex_credit_propagations_total[5m])
  • Labels: learner
  • Source event: CreditPropagated
  • What it tells you: Propagations per second through the causal DAG. Should match feedback rate. Zero while feedback flows means the feature flag is off or the DAG is empty.

Concepts per Propagation (p50/p95)

  • Metric: histogram_quantile(0.50|0.95, rate(qortex_credit_concepts_per_propagation_bucket[5m]))
  • Source event: CreditPropagated
  • What it tells you: How many concepts receive credit per event (direct + ancestors). p50 of 3-5 is typical for a well-connected DAG. p50 of 1 means no ancestor credit is flowing (disconnected DAG).

Total Credit Propagations

  • Metric: qortex_credit_propagations_total
  • Display: Stat panel (single number)
  • What it tells you: Lifetime propagation count since restart. Should be monotonically increasing. Stuck at 0 means the feature is not active.

Credit Alpha vs Beta Deltas

  • Metric: qortex_credit_alpha_delta_total, qortex_credit_beta_delta_total
  • Source event: CreditPropagated
  • What it tells you: Cumulative success (alpha) vs failure (beta) signal from credit propagation. Alpha ahead = net positive signal from users. Beta dominating = users are rejecting results and that negative signal is propagating to ancestor concepts.

PostgreSQL Stores (v0.8.0+)

When QORTEX_STORE=postgres, additional metrics are emitted for the PostgreSQL-backed stores.

Pool Utilization

  • Metric: qortex_pool_size (gauge), qortex_pool_free (gauge), qortex_pool_used (gauge)
  • What it tells you: Current state of the shared asyncpg connection pool. If pool_free drops to 0 consistently, increase the pool size via DATABASE_POOL_MAX (default 10).

PgVector Operations

  • Metrics: rate(qortex_pgvec_add_total[5m]), histogram_quantile(0.95, rate(qortex_pgvec_add_duration_seconds_bucket[5m]))
  • What it tells you: pgvector insert rate and latency. Compare with the sqlite vec metrics to assess performance differences.

Migration Progress

  • Metric: qortex_migration_vectors_total (counter), qortex_migration_duration_seconds (histogram)
  • What it tells you: Progress of qortex migrate vec operations. The counter increments per batch of migrated vectors.

REST API

  • Metrics: rate(qortex_http_requests_total[5m]), histogram_quantile(0.95, rate(qortex_http_request_duration_seconds_bucket[5m]))
  • Labels: method, path, status
  • What it tells you: HTTP request rate, latency, and error rate for the REST API server.

Complete Metric Reference

Metric Type Event Labels
qortex_queries_total Counter QueryCompleted mode
qortex_query_duration_seconds Histogram QueryCompleted
qortex_vec_search_duration_seconds Histogram VecSearchCompleted
qortex_query_errors_total Counter QueryFailed stage
qortex_factor_mean Gauge FactorDriftSnapshot
qortex_factor_entropy Gauge FactorDriftSnapshot
qortex_factors_active Gauge FactorDriftSnapshot
qortex_factor_updates_total Counter FactorUpdated outcome
qortex_learning_observations_total Counter LearningObservationRecorded learner, outcome
qortex_kg_coverage Gauge KGCoverageComputed, BufferFlushed
qortex_buffer_edges Gauge OnlineEdgeRecorded
qortex_edges_promoted_total Counter EdgePromoted
qortex_ppr_started_total Counter PPRStarted
qortex_ppr_iterations Histogram PPRConverged, PPRDiverged
qortex_online_edges_generated_total Counter OnlineEdgesGenerated
qortex_online_edge_count Gauge OnlineEdgesGenerated
qortex_enrichment_total Counter EnrichmentCompleted backend_type
qortex_enrichment_duration_seconds Histogram EnrichmentCompleted
qortex_enrichment_fallbacks_total Counter EnrichmentFallback
qortex_messages_ingested_total Counter MessageIngested role
qortex_message_ingest_duration_seconds Histogram MessageIngested
qortex_vec_add_total Counter VecIndexUpdated index_type
qortex_vec_add_duration_seconds Histogram VecIndexUpdated
qortex_vec_search_candidates Histogram VecSearchResults
qortex_vec_search_top_score Gauge VecSearchResults
qortex_vec_search_score_spread Gauge VecSearchResults
qortex_vec_seed_yield Gauge VecSeedYield
qortex_learning_selections_total Counter LearningSelectionMade learner, baseline
qortex_learning_observations_total Counter LearningObservationRecorded learner, outcome
qortex_learning_posterior_mean Gauge LearningPosteriorUpdated learner, arm_id
qortex_learning_token_budget_used Histogram LearningSelectionMade
qortex_credit_propagations_total Counter CreditPropagated learner
qortex_credit_concepts_per_propagation Histogram CreditPropagated
qortex_credit_alpha_delta_total Counter CreditPropagated
qortex_credit_beta_delta_total Counter CreditPropagated
qortex_messages_ingested_total Counter MessageIngested role
qortex_vec_add_total Counter VecIndexUpdated index_type
qortex_online_edges_generated_total Counter OnlineEdgesGenerated
qortex_message_ingest_duration_seconds Histogram MessageIngested

Distributed Tracing

qortex uses the @traced decorator from qortex.observe.tracing to create OpenTelemetry spans with automatic parent-child hierarchy. When OTel is enabled, every operation produces a trace tree visible in Grafana via the Tempo datasource.

Span Hierarchy

A typical ingest_manifest call produces a trace like:

memgraph.ingest_manifest (domain=python, nodes=12, edges=8, rules=5)
  ├─ memgraph.create_domain
  │   └─ cypher.execute (CREATE (:Domain ...))
  ├─ memgraph.add_node (x12)
  │   └─ cypher.execute (MERGE (n:Concept ...))
  ├─ memgraph.add_edge (x8)
  │   └─ cypher.execute (MATCH ... CREATE (a)-[r]->(...))
  └─ memgraph.add_rule (x5)
      └─ cypher.execute (MERGE (r:Rule ...))

A personalized_pagerank call shows convergence attributes:

memgraph.personalized_pagerank
  ├─ cypher.execute (MATCH (n:Concept) ...)   ← fetch nodes
  ├─ cypher.execute (MATCH ()-[r]->() ...)    ← fetch edges
  └─ [span attributes]
      ppr.node_count=45, ppr.edge_count=32, ppr.seed_count=3
      ppr.iterations=77, ppr.final_diff=9.5e-7, ppr.converged=true
      ppr.nonzero_scores=12, ppr.latency_ms=4.2

Instrumented Operations

All major subsystems are traced:

Subsystem Span Name Attributes
Memgraph cypher.execute db.statement, db.system
memgraph.create_domain
memgraph.add_node
memgraph.add_edge
memgraph.add_rule
memgraph.ingest_manifest ingest.domain, ingest.node_count, ingest.edge_count, ingest.rule_count
memgraph.personalized_pagerank ppr.node_count, ppr.edge_count, ppr.iterations, ppr.converged, ppr.latency_ms
memgraph.get_node, get_edges, get_rules
memgraph.query_cypher, vector_search
memgraph.add_embedding, get_embedding
Online Index online_index.pipeline
online_index.chunk
online_index.embed
online_index.vec_add
online_index.add_chunk_node
online_index.extract_chunk
online_index.add_concept_nodes
online_index.add_relation_edges
online_index.co_occurrence_edges
Extraction extraction.spacy
extraction.spacy.nlp_process
extraction.spacy.extract_entities
extraction.spacy.extract_noun_chunks
extraction.spacy.deduplicate
extraction.spacy.infer_relations
extraction.llm
extraction.llm.extract_concepts
extraction.llm.extract_relations
Vec Embeddings vec.embed.sentence_transformer embed.model, embed.batch_size, embed.backend
vec.embed.openai embed.model, embed.batch_size
vec.embed.ollama embed.model, embed.batch_size
vec.embed.cached embed.cache_hits, embed.cache_misses, embed.batch_size
Vec Index vec.add, vec.search, vec.remove
Learning learning.select
learning.observe
learning.apply_credit_deltas

Embedding model spans are marked external=True, meaning they represent I/O boundaries (network calls to OpenAI, Ollama, or GPU inference for sentence-transformers).

Selective Sampling

By default, only 10% of normal traces are exported. The SelectiveSpanProcessor always exports: - Spans with error status (regardless of sample rate) - Spans slower than the latency threshold (default 100ms)

Adjust with QORTEX_OTEL_TRACE_SAMPLE_RATE and QORTEX_OTEL_TRACE_LATENCY_THRESHOLD_MS.

Viewing Traces in Grafana (Tempo)

# Ensure the stack is running
cd docker && docker compose up -d qortex

# Open Grafana Explore with Tempo datasource
open http://localhost:3010/explore

# Select the "Tempo" datasource, search for service "qortex"

Traces show the full call hierarchy: an ingest_manifest trace includes every add_node, add_edge, and underlying cypher.execute as child spans. Click any span to see its attributes (PPR convergence stats, embedding batch sizes, cache hit rates, etc.).

You can use TraceQL queries for advanced filtering, e.g. { resource.service.name = "qortex" && span.http.status_code >= 400 }.

Testing the Dashboard

The full-pipeline E2E test exercises every code path and verifies every metric:

QORTEX_GRAPH=memgraph MEMGRAPH_USER=memgraph MEMGRAPH_PASSWORD=memgraph \
QORTEX_OTEL_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf \
QORTEX_PROMETHEUS_ENABLED=true QORTEX_PROMETHEUS_PORT=9464 \
uv run pytest tests/test_full_pipeline_e2e.py -v -s

This test ingests a knowledge graph, runs 25 retrieval queries, submits feedback, triggers edge promotion, and runs enrichment, then asserts every metric is present in Prometheus and queryable through Grafana.