Monitoring and Logging Utilities

Overview

To ensure full system observability, the IUDX-Novo platform provides comprehensive monitoring and logging utilities across the AI Sandbox and associated platform components.

The observability stack supports both:

  • Developer-defined application logs

  • Runtime system and service-level metrics

Together, these capabilities enable effective monitoring, debugging, performance analysis, and operational governance.

Monitoring Capabilities

The platform uses Prometheus-based monitoring to collect metrics at multiple layers of the system.

Infrastructure Metrics

Prometheus is used to collect virtual machine and node-level metrics, including:

  • CPU utilization

  • Memory usage

  • Network throughput and latency

  • Other system-level resource statistics

These metrics provide visibility into infrastructure health and capacity.

Service and Application Metrics

Prometheus also collects service-level and application metrics, such as:

  • API call counts and request latencies

  • Event bus and messaging metrics

  • JVM performance metrics (where applicable)

  • Other operational indicators exposed by platform services

These metrics enable fine-grained monitoring of platform behavior and performance.

Logging Capabilities

Centralized Log Aggregation

The platform uses Loki-based log aggregation to collect and manage logs generated by platform services and user workloads.

Capabilities include:

  • Collection of developer-defined log messages

  • Centralized storage and indexing of logs

  • Support for runtime debugging and validation

  • Efficient diagnosis of operational issues and failures

Logs can be correlated with metrics to provide deeper insight into system behavior.

Observability Outcomes

Together, the monitoring and logging utilities provide:

  • End-to-end visibility into system behavior and performance

  • Early detection of failures and anomalies

  • Support for debugging, auditing, and root-cause analysis

  • Improved operational reliability and platform health

Role in the Platform

  • Enables proactive system monitoring

  • Supports operational excellence and SRE practices

  • Complements security, auditing, and compliance mechanisms


Last updated