Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4421

Document consensus around Distributed Compute and Storage

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute.

      To help engage with infrastructure providers, we need to have clearer details on how SRCNet is expected to interact with existing shared infrastructure services.

      The short document should specifically cover v0.2 site requirements around:

      • Define initial deployment diagram for SRCNet, including high level (optional) infrastructure components including: Networks, K8s, Ceph, OpenStack, Slurm, Rucio, SRCNet services, IVOA execution broker (or similar) etc.
      • Updated diagrams for clarity of expected Rucio data flows, between all sites. Including sequence diagrams for expected common data movements (ingest from telescopes, flow into long term storage, etc)
      • Sequence diagrams details the expected User flows to request compute and storage resources (e.g. interactive analysis, batch processing, data streaming from online storage, move data from long term storage into online storage, caching data near compute resources, etc.). Some of these flows may be considered "invalid" and documented as such.
      • Detailed authorization flows for compute workloads accessing the storage, and site requirements around access to that storage (i.e. describe how POSIX or S3-like Object or xrootd/WebDAV could be used to limit access for these cases)

      This is stage one of this process, based on what was outlined in SP-4240 that was thought to be too broad in scope, is to build on the top level roadmap to determine storage tiers and understand how Rucio can be used to determine data accessibility.

      Show
      Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute. To help engage with infrastructure providers, we need to have clearer details on how SRCNet is expected to interact with existing shared infrastructure services. The short document should specifically cover v0.2 site requirements around: Define initial deployment diagram for SRCNet, including high level (optional) infrastructure components including: Networks, K8s, Ceph, OpenStack, Slurm, Rucio, SRCNet services, IVOA execution broker (or similar) etc. Updated diagrams for clarity of expected Rucio data flows, between all sites. Including sequence diagrams for expected common data movements (ingest from telescopes, flow into long term storage, etc) Sequence diagrams details the expected User flows to request compute and storage resources (e.g. interactive analysis, batch processing, data streaming from online storage, move data from long term storage into online storage, caching data near compute resources, etc.). Some of these flows may be considered "invalid" and documented as such. Detailed authorization flows for compute workloads accessing the storage, and site requirements around access to that storage (i.e. describe how POSIX or S3-like Object or xrootd/WebDAV could be used to limit access for these cases) This is stage one of this process, based on what was outlined in SP-4240 that was thought to be too broad in scope, is to build on the top level roadmap to determine storage tiers and understand how Rucio can be used to determine data accessibility.
    • Hide

      Suggestion - make AC2 the first / most important part of the exercise: Test running workflows at a separate site to the storage, versus co-locating the compute and storage. Demonstrate if there is/isn't an issue. Note - doesn't have to be UK sites.

       

      Suggestion - Could this be aligned with the test campaigns? If the hypothesis is this is a problem for SRCNet not just UKSRC should this be tested against multiple sites?

      Suggestion - rather than AC1 / a separate paper: "Extend the architecture documentation to account for sites where the principle of colocated storage / compute is challenged". (Don't stipulate tiers in the AC, that may be the right solution, but let's not assume what the architect / the architecture forum will agree to).

       

      AC1: A Strategy / Design paper outlining distributed data management using a tiered structure is created and agreed by the SRCNet Architecture Group
      AC2: Demonstrate how a tiered storage model could work using two UKSRC sites
      AC3: A Strategy / Design paper outlining how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible is created and agreed by the SRCNet Architecture Group
      AC4: Demonstrate elements if data accessibility using Rucio at UKSRC sites.

      Show
      Suggestion - make AC2 the first / most important part of the exercise: Test running workflows at a separate site to the storage, versus co-locating the compute and storage. Demonstrate if there is/isn't an issue. Note - doesn't have to be UK sites.   Suggestion - Could this be aligned with the test campaigns? If the hypothesis is this is a problem for SRCNet not just UKSRC should this be tested against multiple sites? Suggestion - rather than AC1 / a separate paper: "Extend the architecture documentation to account for sites where the principle of colocated storage / compute is challenged". (Don't stipulate tiers in the AC, that may be the right solution, but let's not assume what the architect / the architecture forum will agree to).   AC1: A Strategy / Design paper outlining distributed data management using a tiered structure is created and agreed by the SRCNet Architecture Group AC2: Demonstrate how a tiered storage model could work using two UKSRC sites AC3: A Strategy / Design paper outlining how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible is created and agreed by the SRCNet Architecture Group AC4: Demonstrate elements if data accessibility using Rucio at UKSRC sites.
    • 2
    • 0
    • 23.4
    • PI24 - UNCOVERED

    • data-lifecycle team_DAAC

    Description

      While creating the v0.1 node specification, it is clear we need to further work to build consensus around the system architecture, that can support the implementation of the existing SRCNet architecture document.

      The plan is to make specific concrete proposals we can discuss in the areas of:

      • Storage
        • Stage 1 - Define the storage tiers (building on the definition from the top level roadmap)
        • Stage 1 - Define how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible
        • Later Stages - Define how Rucio is expected to move ODPs and SDPs
        • Later Stages - Define how data is distributed to sites that do not have an archive storage, that have smaller amounts of online and/or scratch storage
        • Later Stages - Link to higher level discussions around metadata and data discovery linking to global URIs that can then be resolved to a local location via Rucio.

      THE FOLLOWING ALL TO BE ADDRESSED IN LATER STAGES ...

      • Compute
        • Services and Interactive Compute
          • Define how there are long running services and APIs available at each site
          • For accessible data, define how to manage the resources needed to visualize that currently accessible data, ensuring the resources are available at a time that is convenient for those wanting to visualize the data.
          • Define how a user can request that a data set is made accessible for visualization, when it is archived and not currently accessible via online storage
        • Discuss the management of "scratch" storage local to the compute
        • Batch Compute
          • Discuss users requesting the creation of ADPs, and when ready getting access to visualize those ADPs.
          • Discuss the adoption of the IOVA execution broker, or similar, to implement a global federation job queue.
          • Discuss how jobs can be run on pre-existing Slurm clusters shared with non SKA SRCNet users and workloads.
        • Workflow Development
          • We will need to support CI/CD workflows to update and publish the services, interactive environments and batch compute templates
          • Direct access to development environments, with appropriate data access, to create new interactive and batch compute templates
          • SDP pipeline development, with appropriate access to visibilities
      • Resource accounting
        • We need to document how we track resource usage
        • We should document how we implement the recommendations for resource allocation, and the architecturally constraints we need to place on that allocation process
      • Security and User Traceability
        • This should not overlap with the ongoing policy work.
        • Rather we talk about the expectations on various aspects of the system around provenance of the binaries being executed and tracability of what specific user is triggering the specific resource usage

      The success of this exercise will be measured by the acceptance of an architecture paper by SRCNet Architecture group and the wider SRCNet community.

      Attachments

        Issue Links

          Structure

            Activity

              People

                r.bolton Bolton, Rosie
                D.Watson Watson, Duncan
                Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (0%)

                  Feature Estimate: 2.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete00.0
                  Total00.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel