Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4240

System Architecture Design for Distributed Compute and Storage

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute, and how we make use of Rucio to move data between online and offline/nearline storage across SRCNet.

      This work is likely to lead to questions that are best answered by doing tests making using of the SRCNet v0.1 setup, during the next PI.

      Show
      Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute, and how we make use of Rucio to move data between online and offline/nearline storage across SRCNet. This work is likely to lead to questions that are best answered by doing tests making using of the SRCNet v0.1 setup, during the next PI.
    • Hide

      AC1: A Strategy / Design paper outlining best practice for managing distributed data and compute is created and agreed by the SRCNet Architecture Group
      AC2: This paper is presented to, and accepted by SRCNet participants.

      Show
      AC1: A Strategy / Design paper outlining best practice for managing distributed data and compute is created and agreed by the SRCNet Architecture Group AC2: This paper is presented to, and accepted by SRCNet participants.
    • PI24 - UNCOVERED

    • data-lifecycle team_DAAC

    Description

      While creating the v0.1 node specification, it is clear we need to further work to build consensus around the system architecture, that can support the implementation of the existing SRCNet architecture document.

      The plan is to make specific concrete proposals we can discuss in the areas of:

      • Storage
        • Define the storage tiers (building on the definition from the top level roadmap)
        • Define how Rucio is expected to move ODPs and SDPs
        • Define how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible
        • Define how data is distributed to sites that do not have an archive storage, that have smaller amounts of online and/or scratch storage
        • Link to higher level discussions around metadata and data discovery linking to global URIs that can then be resolved to a local location via Rucio.
      • Compute
        • Services and Interactive Compute
          • Define how there are long running services and APIs available at each site
          • For accessible data, define how to manage the resources needed to visualize that currently accessible data, ensuring the resources are available at a time that is convenient for those wanting to visualize the data.
          • Define how a user can request that a data set is made accessible for visualization, when it is archived and not currently accessible via online storage
        • Discuss the management of "scratch" storage local to the compute
        • Batch Compute
          • Discuss users requesting the creation of ADPs, and when ready getting access to visualize those ADPs.
          • Discuss the adoption of the IOVA execution broker, or similar, to implement a global federation job queue.
          • Discuss how jobs can be run on pre-existing Slurm clusters shared with non SKA SRCNet users and workloads.
        • Workflow Development
          • We will need to support CI/CD workflows to update and publish the services, interactive environments and batch compute templates
          • Direct access to development environments, with appropriate data access, to create new interactive and batch compute templates
          • SDP pipeline development, with appropriate access to visibilities
      • Resource accounting
        • We need to document how we track resource usage
        • We should document how we implement the recommendations for resource allocation, and the architecturally constraints we need to place on that allocation process
      • Security and User Traceability
        • This should not overlap with the ongoing policy work.
        • Rather we talk about the expectations on various aspects of the system around provenance of the binaries being executed and tracability of what specific user is triggering the specific resource usage

      The success of this exercise will be measured by the acceptance of an architecture paper by SRCNet Architecture group and the wider SRCNet community.

      Attachments

        Issue Links

          Structure

            Activity

              People

                r.bolton Bolton, Rosie
                D.Watson Watson, Duncan
                Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (0%)

                  Feature Estimate: 0.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete00.0
                  Total00.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel