Analytics Layer – Architectural Overview

Overview

This section provides a high-level architectural overview of the Analytics Layer and its placement within the overall IUDX-Novo system stack.

The Analytics Layer functions as the batch processing and analytical execution environment for the AI Sandbox, enabling scalable data processing, experimentation, and model execution.

img.png

Figure 4: Batch Processing Framework – Architectural Overview ⚠️ Diagram to be updated / added for IUDX-Novo

Integration with the System Stack

The Analytics Layer connects to multiple parts of the IUDX ecosystem:

  • Data Lake Stores diverse data types including:

    • Spatio-temporal data

    • IoT data streams

    • Images

    • Structured and unstructured datasets

  • Data Exchange Layer Enables analytical access to datasets registered on the exchange, subject to authorization and consent.

A dedicated data ingestion framework is responsible for moving data into the analytical compute environment. This framework supports both streaming and batch ingestion, using systems such as Apache Flink where applicable.

To optimize analytical workloads, an intermediate high-performance analytics cache is used. The ingestion framework enables rapid onboarding of data into this cache to support efficient, repeatable analytical workflows.

Detailed Analytical Layer Architecture

img_1.png

Figure 5: Batch Processing Framework / Analytics Layer – Detailed Overview ⚠️ Diagram to be updated / added for IUDX-Novo

The Analytics Layer is deployed on Kubernetes, with all components designed following cloud-native principles.

The core components of the Analytics Layer are described below.

Core Components

Process Repository – Docker Registry

The Process Repository stores reusable analytical processes, packaged as containers.

Examples include:

  • Regression and classification models

  • UNET-based segmentation models

  • Other containerized analytical workflows

The repository is implemented as a Docker Registry, enabling versioned, portable, and reproducible analytics execution.

Workflow Orchestrator – Argo

Argo serves as the workflow orchestration and execution engine.

Capabilities include:

  • Definition and execution of:

    • ETL pipelines

    • DAG-based workflows

    • Step-based workflows

    • Cron-scheduled jobs

  • Compute resource management

  • Job batching and scheduling

  • APIs and user interfaces for:

    • Job submission

    • Status queries

    • Execution monitoring

Executor Cluster – Dask / Torch / Ray

Detailed computational operations are executed on a stateless executor cluster that is dynamically provisioned.

Depending on workload type, the platform supports:

  • Dask for distributed DataFrame and parallel data processing

  • Torch for deep learning workloads

  • Ray for distributed execution and scalable model serving

This approach enables elastic scaling and efficient utilization of compute resources.

Intermediate Artefact Cache – MinIO

MinIO is used as the intermediate object storage layer and provides an S3-compatible interface.

Responsibilities include:

  • Storage of intermediate artefacts generated by ETL workflows

  • Reliable and fail-safe execution of analytics jobs

  • Concurrent access to intermediate datasets

  • Storage and serving of AI models used by the platform

OLAP Storage – ClickHouse

For structured datasets, ClickHouse serves as the OLAP engine, supporting high-performance analytical queries at scale.

For unstructured data stored in MinIO, efficient data container formats are used, including:

  • HDF5

  • Zarr

  • NetCDF

These formats enable parallel, chunk-based processing and indexed data access.

Process Management Interface – Kubeflow

Kubeflow acts as the process coordination and management interface.

It serves as the primary control plane from which:

  • Analytical workflows (for example, Argo workflows) are triggered

  • Long-running experiments are managed

  • AI and ML workloads are coordinated

Optional Management API Server – GoLang

In scenarios requiring additional custom management utilities—such as authentication middleware or platform-specific controls—GoLang-based API servers may be deployed as optional components.

LLM Hosting Capabilities

The Analytics Layer supports hosting and execution of Large Language Models (LLMs) using frameworks such as:

  • LangChain

  • Ollama

  • Related LLM toolchains

This enables both experimentation and deployment of generative AI workloads within the sandbox.

Deployment Modes and Use Cases

The Analytics Layer may also be deployed as a standalone system for routine or scheduled analytical operations, such as:

  • Generation of periodic data quality reports

  • Hosting inference servers for continuous or scheduled inference

  • Running data preprocessing and data enhancement workflows

However, its primary purpose is to serve as the execution environment for the IUDX AI Sandbox, enabling users to run analytical workloads, experiments, and models in a controlled, scalable, and secure compute environment.


Last updated