Analytics Layer – Architectural Overview

Overview

This section provides a high-level architectural overview of the Analytics Layer and its placement within the overall IUDX-Novo system stack.

The Analytics Layer functions as the batch processing and analytical execution environment for the AI Sandbox, enabling scalable data processing, experimentation, and model execution.

Figure 4: Batch Processing Framework – Architectural Overview ⚠️ Diagram to be updated / added for IUDX-Novo

Integration with the System Stack

The Analytics Layer connects to multiple parts of the IUDX ecosystem:

Data Lake Stores diverse data types including:
- Spatio-temporal data
- IoT data streams
- Images
- Structured and unstructured datasets
Data Exchange Layer Enables analytical access to datasets registered on the exchange, subject to authorization and consent.

A dedicated data ingestion framework is responsible for moving data into the analytical compute environment. This framework supports both streaming and batch ingestion, using systems such as Apache Flink where applicable.

To optimize analytical workloads, an intermediate high-performance analytics cache is used. The ingestion framework enables rapid onboarding of data into this cache to support efficient, repeatable analytical workflows.

Detailed Analytical Layer Architecture

Figure 5: Batch Processing Framework / Analytics Layer – Detailed Overview ⚠️ Diagram to be updated / added for IUDX-Novo

The Analytics Layer is deployed on Kubernetes, with all components designed following cloud-native principles.

The core components of the Analytics Layer are described below.

Core Components

Process Repository – Docker Registry

The Process Repository stores reusable analytical processes, packaged as containers.

Examples include:

Regression and classification models
UNET-based segmentation models
Other containerized analytical workflows

The repository is implemented as a Docker Registry, enabling versioned, portable, and reproducible analytics execution.

Workflow Orchestrator – Argo

Argo serves as the workflow orchestration and execution engine.

Capabilities include:

Definition and execution of:
- ETL pipelines
- DAG-based workflows
- Step-based workflows
- Cron-scheduled jobs
Compute resource management
Job batching and scheduling
APIs and user interfaces for:
- Job submission
- Status queries
- Execution monitoring

Executor Cluster – Dask / Torch / Ray

Detailed computational operations are executed on a stateless executor cluster that is dynamically provisioned.

Depending on workload type, the platform supports:

Dask for distributed DataFrame and parallel data processing
Torch for deep learning workloads
Ray for distributed execution and scalable model serving

This approach enables elastic scaling and efficient utilization of compute resources.

Intermediate Artefact Cache – MinIO

MinIO is used as the intermediate object storage layer and provides an S3-compatible interface.

Responsibilities include:

Storage of intermediate artefacts generated by ETL workflows
Reliable and fail-safe execution of analytics jobs
Concurrent access to intermediate datasets
Storage and serving of AI models used by the platform

OLAP Storage – ClickHouse

For structured datasets, ClickHouse serves as the OLAP engine, supporting high-performance analytical queries at scale.

For unstructured data stored in MinIO, efficient data container formats are used, including:

HDF5
Zarr
NetCDF

These formats enable parallel, chunk-based processing and indexed data access.

Process Management Interface – Kubeflow

Kubeflow acts as the process coordination and management interface.

It serves as the primary control plane from which:

Analytical workflows (for example, Argo workflows) are triggered
Long-running experiments are managed
AI and ML workloads are coordinated

Optional Management API Server – GoLang

In scenarios requiring additional custom management utilities—such as authentication middleware or platform-specific controls—GoLang-based API servers may be deployed as optional components.

LLM Hosting Capabilities

The Analytics Layer supports hosting and execution of Large Language Models (LLMs) using frameworks such as:

LangChain
Ollama
Related LLM toolchains

This enables both experimentation and deployment of generative AI workloads within the sandbox.

Deployment Modes and Use Cases

The Analytics Layer may also be deployed as a standalone system for routine or scheduled analytical operations, such as:

Generation of periodic data quality reports
Hosting inference servers for continuous or scheduled inference
Running data preprocessing and data enhancement workflows

However, its primary purpose is to serve as the execution environment for the IUDX AI Sandbox, enabling users to run analytical workloads, experiments, and models in a controlled, scalable, and secure compute environment.

Previousanalytics_layer NextAnalytics Layer

Last updated 1 month ago

Good night

hashtagOverview

hashtagIntegration with the System Stack

hashtagDetailed Analytical Layer Architecture

hashtagCore Components

hashtagProcess Repository – Docker Registry

hashtagWorkflow Orchestrator – Argo

hashtagExecutor Cluster – Dask / Torch / Ray

hashtagIntermediate Artefact Cache – MinIO

hashtagOLAP Storage – ClickHouse

hashtagProcess Management Interface – Kubeflow

hashtagOptional Management API Server – GoLang

hashtagLLM Hosting Capabilities

hashtagDeployment Modes and Use Cases