Details
-
Feature
-
Not Assigned
-
None
-
-
SRC-Multi-Team SRCNet0.x
Description
If the SKA Precursor experiences are anything to go by, there will be a need in the SRCNet for a Science Platform to support HPC-based data reduction and generation of Level 7 science products, but at a much larger scale. Such a platform would compliment the CANFAR science platform currently under development by the SRCNet.
This platform would uniquely identify itself by:
- supporting MPI and multi-processing compute,
- supporting Singularity containerisation,
- scaling up to provide data-reduction analyses on 1 TB and above datasets,
- running pipelines across multiple VMs, and
- supporting on-the-fly data compression/decompression.
Rationale:
Considering the assumed data output rate from the SKAO, timely data-processing is fundamental to the success of the project - data cannot be allowed to back-up over days (or probably even hours) when there is limited fast-storage (Tier 0 or 1) available.
The CANFAR platform addresses the subset of processing that can be run within a VM or orchestrated Docker container environment. However, best-in-class algorithms (such as SoFiA-2 source-finding, LINMOS and MIRIAD ImMerge mosaicing), when used against datacubes of 500GB or more, require huge memory and CPU resources, of the scale only available across multiple cores and VMs. This is typically met by running these algorithms within an HPC parallel environment; there is no single compute core (or node) that could support timely processing on such a scale.
Although there are SRC nodes that have developed some aspects of HPC processing, the SRCNet would benefit from having a science platform that could be deployed at SRC nodes that will have access to the appropriate HPC compute resources, providing a single user-interface that can be accessed by the rest of the SRCNet user community.
Requirements:
Such a platform would minimally require:
- a flexible HPC workflow/pipeline orchestrator,
- a single user-portal interface for running user-defined algorithms,
- a jobs database,
- AAI,
- ability to auto-trigger on external events (eg incomming data),
- ability to automatically deploy on HPC resources such as SLURM,
- ability to monitor running user jobs,
- ability to run MPI processes (either containerised or natively),
- ability to run embarrasingly parallel mulit-processes, and
- ability to interface with Tier 0, 1 and 2 storage (Objectstore S3 or POSIX)
Planning:
This feature will span many Program Increments, and be the parent feature of individual features that span one or two PIs.