Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Feature
Priority: Not Assigned
Fix Version/s: PI24
Component/s: None
Labels:
- data-lifecycle
- team_DAAC

ARTs:

SRCnet
Benefit hypothesis:
Hide

Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute.

To help engage with infrastructure providers, we need to have clearer details on how SRCNet is expected to interact with existing shared infrastructure services.

The short document should specifically cover v0.2 site requirements around:

Define initial deployment diagram for SRCNet, including high level (optional) infrastructure components including: Networks, K8s, Ceph, OpenStack, Slurm, Rucio, SRCNet services, IVOA execution broker (or similar) etc.

Updated diagrams for clarity of expected Rucio data flows, between all sites. Including sequence diagrams for expected common data movements (ingest from telescopes, flow into long term storage, etc)

Sequence diagrams details the expected User flows to request compute and storage resources (e.g. interactive analysis, batch processing, data streaming from online storage, move data from long term storage into online storage, caching data near compute resources, etc.). Some of these flows may be considered "invalid" and documented as such.

Detailed authorization flows for compute workloads accessing the storage, and site requirements around access to that storage (i.e. describe how POSIX or S3-like Object or xrootd/WebDAV could be used to limit access for these cases)

This is stage one of this process, based on what was outlined in SP-4240 that was thought to be too broad in scope, is to build on the top level roadmap to determine storage tiers and understand how Rucio can be used to determine data accessibility.
Show
Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute. To help engage with infrastructure providers, we need to have clearer details on how SRCNet is expected to interact with existing shared infrastructure services. The short document should specifically cover v0.2 site requirements around: Define initial deployment diagram for SRCNet, including high level (optional) infrastructure components including: Networks, K8s, Ceph, OpenStack, Slurm, Rucio, SRCNet services, IVOA execution broker (or similar) etc. Updated diagrams for clarity of expected Rucio data flows, between all sites. Including sequence diagrams for expected common data movements (ingest from telescopes, flow into long term storage, etc) Sequence diagrams details the expected User flows to request compute and storage resources (e.g. interactive analysis, batch processing, data streaming from online storage, move data from long term storage into online storage, caching data near compute resources, etc.). Some of these flows may be considered "invalid" and documented as such. Detailed authorization flows for compute workloads accessing the storage, and site requirements around access to that storage (i.e. describe how POSIX or S3-like Object or xrootd/WebDAV could be used to limit access for these cases) This is stage one of this process, based on what was outlined in SP-4240 that was thought to be too broad in scope, is to build on the top level roadmap to determine storage tiers and understand how Rucio can be used to determine data accessibility.
Acceptance criteria:

Hide

Suggestion - make AC2 the first / most important part of the exercise: Test running workflows at a separate site to the storage, versus co-locating the compute and storage. Demonstrate if there is/isn't an issue. Note - doesn't have to be UK sites.

Suggestion - Could this be aligned with the test campaigns? If the hypothesis is this is a problem for SRCNet not just UKSRC should this be tested against multiple sites?

Suggestion - rather than AC1 / a separate paper: "Extend the architecture documentation to account for sites where the principle of colocated storage / compute is challenged". (Don't stipulate tiers in the AC, that may be the right solution, but let's not assume what the architect / the architecture forum will agree to).

AC1: A Strategy / Design paper outlining distributed data management using a tiered structure is created and agreed by the SRCNet Architecture Group
AC2: Demonstrate how a tiered storage model could work using two UKSRC sites
AC3: A Strategy / Design paper outlining how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible is created and agreed by the SRCNet Architecture Group
AC4: Demonstrate elements if data accessibility using Rucio at UKSRC sites.

Show
Suggestion - make AC2 the first / most important part of the exercise: Test running workflows at a separate site to the storage, versus co-locating the compute and storage. Demonstrate if there is/isn't an issue. Note - doesn't have to be UK sites. Suggestion - Could this be aligned with the test campaigns? If the hypothesis is this is a problem for SRCNet not just UKSRC should this be tested against multiple sites? Suggestion - rather than AC1 / a separate paper: "Extend the architecture documentation to account for sites where the principle of colocated storage / compute is challenged". (Don't stipulate tiers in the AC, that may be the right solution, but let's not assume what the architect / the architecture forum will agree to). AC1: A Strategy / Design paper outlining distributed data management using a tiered structure is created and agreed by the SRCNet Architecture Group AC2: Demonstrate how a tiered storage model could work using two UKSRC sites AC3: A Strategy / Design paper outlining how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible is created and agreed by the SRCNet Architecture Group AC4: Demonstrate elements if data accessibility using Rucio at UKSRC sites.
Feature Points:
2
WSJF:
0
Story Point Burn-up:
Overdue:
Resolved PI.Sprint:
23.4

Requirement Status:

PI24 - UNCOVERED
Labels_MIRO:
data-lifecycle team_DAAC

Description

While creating the v0.1 node specification, it is clear we need to further work to build consensus around the system architecture, that can support the implementation of the existing SRCNet architecture document.

The plan is to make specific concrete proposals we can discuss in the areas of:

Storage
- Stage 1 - Define the storage tiers (building on the definition from the top level roadmap)
- Stage 1 - Define how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible
- Later Stages - Define how Rucio is expected to move ODPs and SDPs
- Later Stages - Define how data is distributed to sites that do not have an archive storage, that have smaller amounts of online and/or scratch storage
- Later Stages - Link to higher level discussions around metadata and data discovery linking to global URIs that can then be resolved to a local location via Rucio.

THE FOLLOWING ALL TO BE ADDRESSED IN LATER STAGES ...

Compute
- Services and Interactive Compute
  - Define how there are long running services and APIs available at each site
  - For accessible data, define how to manage the resources needed to visualize that currently accessible data, ensuring the resources are available at a time that is convenient for those wanting to visualize the data.
  - Define how a user can request that a data set is made accessible for visualization, when it is archived and not currently accessible via online storage
- Discuss the management of "scratch" storage local to the compute
- Batch Compute
  - Discuss users requesting the creation of ADPs, and when ready getting access to visualize those ADPs.
  - Discuss the adoption of the IOVA execution broker, or similar, to implement a global federation job queue.
  - Discuss how jobs can be run on pre-existing Slurm clusters shared with non SKA SRCNet users and workloads.
- Workflow Development
  - We will need to support CI/CD workflows to update and publish the services, interactive environments and batch compute templates
  - Direct access to development environments, with appropriate data access, to create new interactive and batch compute templates
  - SDP pipeline development, with appropriate access to visibilities
Resource accounting
- We need to document how we track resource usage
- We should document how we implement the recommendations for resource allocation, and the architecturally constraints we need to place on that allocation process
Security and User Traceability
- This should not overlap with the ongoing policy work.
- Rather we talk about the expectations on various aspects of the system around provenance of the binaries being executed and tracability of what specific user is triggering the specific resource usage

The success of this exercise will be measured by the acceptance of an architecture paper by SRCNet Architecture group and the wider SRCNet community.

Attachments

Issue Links

clones

SP-4240 System Architecture Design for Distributed Compute and Storage

Funnel

relates to

SP-4162 Data management API extended to include data movement capability

Done

SP-3731 Extend Science gateway functionality to request data staging via the SRCNet Data Management API

Discarded

mentioned in: Page Loading...; Page Loading...

Structure

Activity

People

Assignee:: Bolton, Rosie

Reporter:: Watson, Duncan

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Feature Progress

Story Point Burn-up: (0%)

Feature Estimate: 2.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	0	0.0
Total	0	0.0

Dates

Created:: 29/May/24 8:23 AM

Updated:: 01/Aug/24 11:34 AM

Resolved:: 24/Jul/24 1:41 PM

Due Sprint Date:: 10/Dec/24

Document consensus around Distributed Compute and Storage