Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4676

HPC-based Science Platform

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • PI24 - UNCOVERED

    • SRC-Multi-Team SRCNet0.x

    Description

      If the SKA Precursor experiences are anything to go by, there will be a need in the SRCNet for a Science Platform to support HPC-based data reduction and generation of Level 7 science products, but at a much larger scale. Such a platform would compliment the CANFAR science platform currently under development by the SRCNet.

      This platform would uniquely identify itself by:

       

      • supporting MPI and multi-processing compute,
      • supporting Singularity containerisation,
      • scaling up to provide data-reduction analyses on 1 TB and above datasets,
      • running pipelines across multiple VMs, and
      • supporting on-the-fly data compression/decompression.

      Rationale:

      Considering the assumed data output rate from the SKAO, timely data-processing is fundamental to the success of the project - data cannot be allowed to back-up over days (or probably even hours) when there is limited fast-storage (Tier 0 or 1) available.

      The CANFAR platform addresses the subset of processing that can be run within a VM or orchestrated Docker container environment. However, best-in-class algorithms (such as SoFiA-2 source-finding, LINMOS and MIRIAD ImMerge mosaicing), when used against datacubes of 500GB or more, require huge memory and CPU resources, of the scale only available across multiple cores and VMs. This is typically met by running these algorithms within an HPC parallel environment; there is no single compute core (or node) that could support timely processing on such a scale.

      Although there are SRC nodes that have developed some aspects of HPC processing, the SRCNet would benefit from having a science platform that could be deployed at SRC nodes that will have access to the appropriate HPC compute resources, providing a single user-interface that can be accessed by the rest of the SRCNet user community.

      Requirements:

      Such a platform would minimally require:

      • a flexible HPC workflow/pipeline orchestrator,
      • a single user-portal interface for running user-defined algorithms,
      • a jobs database,
      • AAI,
      • ability to auto-trigger on external events (eg incomming data),
      • ability to automatically deploy on HPC resources such as SLURM,
      • ability to monitor running user jobs,
      • ability to run MPI processes (either containerised or natively),
      • ability to run embarrasingly parallel mulit-processes, and
      • ability to interface with Tier 0, 1 and 2 storage (Objectstore S3 or POSIX)

      Planning:

      This feature will span many Program Increments, and be the parent feature of individual features that span one or two PIs.

      Attachments

        Issue Links

          Structure

            Activity

              People

                b.mort Mort, Ben
                G.German German, Gordon
                Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (0%)

                  Feature Estimate: 0.0

                  IssuesStory Points
                  To Do10.0
                  In Progress   11.0
                  Complete00.0
                  Total21.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel