Observability & Monitoring¤
Comprehensive end-to-end observability stack providing distributed tracing, metrics collection, and real-time monitoring for the AI Goal-Seeking System.
Overview¤
The observability stack includes:
- Distributed Tracing - OpenTelemetry instrumentation with Jaeger backend
- Metrics Collection - Prometheus for time-series metrics and alerting
- Dashboards - Grafana for visualization and monitoring
- Structured Logging - Centralized logging with correlation IDs
- AI Quality Monitoring - Response validation metrics and quality scoring
Architecture¤
OpenTelemetry Pipeline¤
graph TB
%% Define classes for consistent styling
classDef service fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef external fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef data fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
classDef otel fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px
%% Application services (instrumented)
Frontend[React Native Frontend]:::service
API[Express API Gateway]:::service
Socket[WebSocket Handler]:::service
Queue[Message Queue]:::service
Agents[AI Agent System]:::service
RAG[RAG System]:::service
Validator[Response Validator]:::service
%% OpenTelemetry components
SDK[OpenTelemetry SDK]:::otel
Collector[OTEL Collector]:::otel
%% Storage backends
Jaeger[(Jaeger)]:::data
Prometheus[(Prometheus)]:::data
%% Visualization
Grafana[Grafana Dashboards]:::external
%% External integrations
AlertManager[Alert Manager]:::external
%% Instrumentation flow
Frontend --> SDK
API --> SDK
Socket --> SDK
Queue --> SDK
Agents --> SDK
RAG --> SDK
Validator --> SDK
%% Collection and processing
SDK --> Collector
%% Data export
Collector --> Jaeger
Collector --> Prometheus
%% Visualization and alerting
Jaeger --> Grafana
Prometheus --> Grafana
Prometheus --> AlertManager
Telemetry Data Flow¤
sequenceDiagram
participant A as Application
participant S as OTEL SDK
participant C as OTEL Collector
participant J as Jaeger
participant P as Prometheus
participant G as Grafana
%% Trace creation
A->>S: Create span (HTTP request)
S->>S: Add attributes & events
A->>S: Create child span (DB query)
S->>S: Record metrics
A->>S: Finish spans
%% Data collection
S->>C: Export traces & metrics
%% Data storage
C->>J: Store trace data
C->>P: Store metric data
%% Visualization
G->>J: Query traces
G->>P: Query metrics
G-->>A: Display dashboards
Metrics Collection Architecture¤
graph LR
%% Define classes
classDef service fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef metrics fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
classDef external fill:#fff3e0,stroke:#e65100,stroke-width:2px
%% Metric sources
API[API Endpoints]:::service
Queue[Message Queue]:::service
Agents[AI Agents]:::service
Validator[Response Validator]:::service
%% Metrics types
Counters[Counters<br/>• Request count<br/>• Error count<br/>• Message count]:::metrics
Histograms[Histograms<br/>• Response time<br/>• Queue latency<br/>• Validation scores]:::metrics
Gauges[Gauges<br/>• Active connections<br/>• Queue size<br/>• Memory usage]:::metrics
%% Collection and storage
Prometheus[(Prometheus)]:::metrics
Grafana[Grafana<br/>Dashboards]:::external
%% Flow
API --> Counters
API --> Histograms
Queue --> Counters
Queue --> Gauges
Agents --> Histograms
Validator --> Counters
Validator --> Histograms
Counters --> Prometheus
Histograms --> Prometheus
Gauges --> Prometheus
Prometheus --> Grafana
Key Metrics¤
Business Metrics¤
AI Agent Performance¤
agent_requests_total
- Total requests by agent typeagent_response_duration_seconds
- Response time distributionagent_errors_total
- Error count by agent and error typeagent_tokens_used_total
- Token consumption by agent
Validation Quality¤
validation_score_histogram
- Quality score distributionvalidation_issues_total
- Issues by type and severityvalidation_pass_rate
- Percentage of responses passing validation
Queue Performance¤
queue_messages_total
- Messages processed by queue and priorityqueue_processing_duration_seconds
- Message processing timequeue_size_current
- Current queue depthqueue_dead_letters_total
- Failed messages by queue
System Metrics¤
HTTP Performance¤
http_requests_total
- Request count by method, route, statushttp_request_duration_seconds
- Request latency distributionhttp_concurrent_connections
- Active WebSocket connections
Resource Usage¤
process_cpu_seconds_total
- CPU usageprocess_resident_memory_bytes
- Memory consumptionnodejs_heap_size_used_bytes
- Node.js heap usage
Grafana Dashboards¤
AI Validation Overview Dashboard¤
Location: grafana/dashboards/ai-validation-overview.json
Key Panels: - Total validations processed - Success rate by agent type - Average quality scores - Issues breakdown by severity
AI Validation Quality Dashboard¤
Location: grafana/dashboards/ai-validation-quality.json
Key Panels: - Quality metrics by agent (readability, accuracy, coherence) - Response length distribution - Issue trends over time - Proactive vs regular validation rates
System Performance Dashboard¤
Key Panels: - Request rate and latency - Queue processing metrics - Error rates and types - Resource utilization
Distributed Tracing¤
Trace Structure¤
graph TB
%% Span hierarchy
Root[HTTP Request Span<br/>POST /api/chat]
%% Service spans
Queue[Queue Enqueue Span]
Classify[Message Classify Span]
Agent[Agent Process Span]
RAG[RAG Search Span]
OpenAI[OpenAI API Span]
Validate[Validation Span]
Response[HTTP Response Span]
%% Span relationships
Root --> Queue
Root --> Classify
Root --> Agent
Agent --> RAG
Agent --> OpenAI
Agent --> Validate
Root --> Response
%% Attributes shown
Root -.-> |"user_id: user123<br/>conversation_id: conv456<br/>http.method: POST"| Root
Agent -.-> |"agent_type: joke<br/>message_priority: 7"| Agent
OpenAI -.-> |"model: gpt-3.5-turbo<br/>tokens_used: 45"| OpenAI
Validate -.-> |"quality_score: 0.85<br/>issues_found: 0"| Validate
Trace Correlation¤
All traces include correlation IDs:
- user_id
- User identifier
- conversation_id
- Conversation context
- message_id
- Unique message identifier
- agent_type
- AI agent handling the request
- session_id
- User session context
Configuration¤
OpenTelemetry Collector¤
Location: otel-collector-config.yaml
Key Components: - Receivers: OTLP (gRPC and HTTP) - Processors: Batch processing, resource detection - Exporters: Jaeger (traces), Prometheus (metrics)
Grafana Provisioning¤
Datasource Configuration: grafana/provisioning/datasources/prometheus.yml
Dashboard Configuration: grafana/provisioning/dashboards/dashboards.yml
Development Setup¤
Local Docker Stack¤
Environment Variables¤
Monitoring Playbooks¤
High Response Latency¤
Detection: Response time P95 > 2 seconds
Investigation: 1. Check Grafana dashboard for latency spikes 2. Query Jaeger for slow traces 3. Identify bottleneck services (OpenAI API, queue processing) 4. Review error logs for correlation
Mitigation: - Scale queue processors - Implement circuit breakers - Add response caching
Validation Quality Issues¤
Detection: Validation pass rate < 90%
Investigation: 1. Review validation dashboard for failing agents 2. Check issue types and severity distribution 3. Analyze recent changes to agent prompts 4. Review dead letter queue for failed validations
Mitigation: - Adjust validation thresholds - Improve agent prompts - Add specialized validation rules
Queue Backlog¤
Detection: Queue depth > 100 messages
Investigation: 1. Check queue processing rates 2. Identify slow message types 3. Review processor error rates 4. Check Redis/memory usage
Mitigation: - Scale message processors - Increase queue workers - Optimize slow handlers
Best Practices¤
Instrumentation¤
- Trace Everything: Instrument all service boundaries
- Structured Attributes: Use consistent attribute naming
- Correlation Context: Propagate user and conversation IDs
- Error Enrichment: Add detailed error context to spans
Metrics Design¤
- Business Metrics First: Focus on user-facing metrics
- Consistent Labels: Use standardized label names
- Appropriate Cardinality: Avoid high-cardinality labels
- Dashboard Hierarchy: Start with high-level, drill down to details
Performance¤
- Sampling Strategy: Use head-based sampling in production
- Batch Processing: Configure appropriate batch sizes
- Resource Limits: Set memory and CPU limits for collectors
- Data Retention: Configure appropriate retention policies
Troubleshooting¤
Common Issues¤
Missing Traces: - Verify OTEL collector endpoint - Check service instrumentation - Review sampling configuration - Validate network connectivity
High Metrics Cardinality: - Review label usage - Implement label filtering - Use metric aggregation - Monitor Prometheus memory usage
Dashboard Loading Issues: - Check Grafana datasource configuration - Verify Prometheus query syntax - Review dashboard JSON for errors - Check network connectivity