Analytics and Sandbox Technology Stack

Overview

This section describes the technology stack used for the Analytics and Sandbox environment of the IUDX-Novo platform.

The stack is organized into concentric functional layers, where each layer plays a distinct role in the data science and machine learning (ML) workflow. Together, these layers enable scalable, reproducible, and resource-efficient analytical and AI workloads.

img_1.png

Figure: Analytics and Sandbox Technology Stack ⚠️ Diagram to be updated / added

Core Layer

The core layer provides foundational runtime and storage capabilities.

Docker

  • Provides containerization for packaging applications along with their dependencies

  • Ensures portability and reproducibility across environments

NVIDIA

  • Enables GPU acceleration for deep learning and compute-intensive AI workloads

  • Supports training and inference for advanced ML models

MinIO

  • Provides S3-compatible object storage

  • Used for managing:

    • Large datasets

    • Intermediate processing artefacts

    • Model artefacts

Orchestrator Layer

This layer manages containerized workloads and workflow execution.

Kubernetes

  • Serves as the container orchestration platform

  • Manages workload scheduling, scaling, and deployment

  • Enables isolation and resource governance

Argo

  • Provides workflow orchestration on Kubernetes

  • Supports execution and management of:

    • ETL workflows

    • ML pipelines

    • Batch and scheduled jobs

Executors Layer

The executors layer provides distributed and scalable compute engines.

PyTorch

  • A widely used deep learning framework

  • Supports model training, evaluation, and inference

Dask

  • A parallel computing library for scaling Python-based data science workloads

  • Enables distributed processing of large datasets

Ray

  • A distributed computing framework for scalable ML workloads

  • Supports:

    • Distributed model training

    • Hyperparameter tuning

    • Large-scale execution

Application Layer

The application layer provides user-facing tools for development and experimentation.

JupyterHub

  • Enables multi-user interactive notebook environments

  • Serves as the primary interface for data science and ML development

Kubeflow

  • Provides a Kubernetes-native ML toolkit

  • Supports:

    • Scalable model training

    • Model deployment and serving

    • Workflow automation and lifecycle management

Summary

The Analytics and Sandbox technology stack is designed to support advanced AI, ML, and data analytics workflows by ensuring:

  • Scalability through containerization and orchestration

  • Reproducibility via standardized runtime environments

  • Efficient resource utilization using distributed computing frameworks

The combination of Docker, Kubernetes, Argo, Kubeflow, Ray, Dask, PyTorch, and JupyterHub provides a robust and flexible foundation for data-driven applications within the sandbox environment.


Last updated