Analytics Layer – Architectural Overview
Overview
This section provides a high-level architectural overview of the Analytics Layer and its placement within the overall IUDX-Novo system stack.
The Analytics Layer functions as the batch processing and analytical execution environment for the AI Sandbox, enabling scalable data processing, experimentation, and model execution.

Figure 4: Batch Processing Framework – Architectural Overview ⚠️ Diagram to be updated / added for IUDX-Novo
Integration with the System Stack
The Analytics Layer connects to multiple parts of the IUDX ecosystem:
Data Lake Stores diverse data types including:
Spatio-temporal data
IoT data streams
Images
Structured and unstructured datasets
Data Exchange Layer Enables analytical access to datasets registered on the exchange, subject to authorization and consent.
A dedicated data ingestion framework is responsible for moving data into the analytical compute environment. This framework supports both streaming and batch ingestion, using systems such as Apache Flink where applicable.
To optimize analytical workloads, an intermediate high-performance analytics cache is used. The ingestion framework enables rapid onboarding of data into this cache to support efficient, repeatable analytical workflows.
Detailed Analytical Layer Architecture

Figure 5: Batch Processing Framework / Analytics Layer – Detailed Overview ⚠️ Diagram to be updated / added for IUDX-Novo
The Analytics Layer is deployed on Kubernetes, with all components designed following cloud-native principles.
The core components of the Analytics Layer are described below.
Core Components
Process Repository – Docker Registry
The Process Repository stores reusable analytical processes, packaged as containers.
Examples include:
Regression and classification models
UNET-based segmentation models
Other containerized analytical workflows
The repository is implemented as a Docker Registry, enabling versioned, portable, and reproducible analytics execution.
Workflow Orchestrator – Argo
Argo serves as the workflow orchestration and execution engine.
Capabilities include:
Definition and execution of:
ETL pipelines
DAG-based workflows
Step-based workflows
Cron-scheduled jobs
Compute resource management
Job batching and scheduling
APIs and user interfaces for:
Job submission
Status queries
Execution monitoring
Executor Cluster – Dask / Torch / Ray
Detailed computational operations are executed on a stateless executor cluster that is dynamically provisioned.
Depending on workload type, the platform supports:
Dask for distributed DataFrame and parallel data processing
Torch for deep learning workloads
Ray for distributed execution and scalable model serving
This approach enables elastic scaling and efficient utilization of compute resources.
Intermediate Artefact Cache – MinIO
MinIO is used as the intermediate object storage layer and provides an S3-compatible interface.
Responsibilities include:
Storage of intermediate artefacts generated by ETL workflows
Reliable and fail-safe execution of analytics jobs
Concurrent access to intermediate datasets
Storage and serving of AI models used by the platform
OLAP Storage – ClickHouse
For structured datasets, ClickHouse serves as the OLAP engine, supporting high-performance analytical queries at scale.
For unstructured data stored in MinIO, efficient data container formats are used, including:
HDF5
Zarr
NetCDF
These formats enable parallel, chunk-based processing and indexed data access.
Process Management Interface – Kubeflow
Kubeflow acts as the process coordination and management interface.
It serves as the primary control plane from which:
Analytical workflows (for example, Argo workflows) are triggered
Long-running experiments are managed
AI and ML workloads are coordinated
Optional Management API Server – GoLang
In scenarios requiring additional custom management utilities—such as authentication middleware or platform-specific controls—GoLang-based API servers may be deployed as optional components.
LLM Hosting Capabilities
The Analytics Layer supports hosting and execution of Large Language Models (LLMs) using frameworks such as:
LangChain
Ollama
Related LLM toolchains
This enables both experimentation and deployment of generative AI workloads within the sandbox.
Deployment Modes and Use Cases
The Analytics Layer may also be deployed as a standalone system for routine or scheduled analytical operations, such as:
Generation of periodic data quality reports
Hosting inference servers for continuous or scheduled inference
Running data preprocessing and data enhancement workflows
However, its primary purpose is to serve as the execution environment for the IUDX AI Sandbox, enabling users to run analytical workloads, experiments, and models in a controlled, scalable, and secure compute environment.
Last updated

