Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Feature
Priority: Not Assigned
Fix Version/s: None
Component/s: SRCnet Federated Execution
Labels:
- data-lifecycle
- team_DAAC

ARTs:

SRCnet
Benefit hypothesis:

Hide

Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute, and how we make use of Rucio to move data between online and offline/nearline storage across SRCNet.

This work is likely to lead to questions that are best answered by doing tests making using of the SRCNet v0.1 setup, during the next PI.

Show
Looking towards SRCNet v0.2, we need to build consensus on common approaches for federating compute, and how we make use of Rucio to move data between online and offline/nearline storage across SRCNet. This work is likely to lead to questions that are best answered by doing tests making using of the SRCNet v0.1 setup, during the next PI.
Acceptance criteria:

Hide

AC1: A Strategy / Design paper outlining best practice for managing distributed data and compute is created and agreed by the SRCNet Architecture Group
AC2: This paper is presented to, and accepted by SRCNet participants.

Show
AC1: A Strategy / Design paper outlining best practice for managing distributed data and compute is created and agreed by the SRCNet Architecture Group AC2: This paper is presented to, and accepted by SRCNet participants.
Epic Link:
Distributed Data Computing v0.2 - Roadmap
Story Point Burn-up:
Overdue:

Requirement Status:

PI24 - UNCOVERED
Labels_MIRO:
data-lifecycle team_DAAC

Description

While creating the v0.1 node specification, it is clear we need to further work to build consensus around the system architecture, that can support the implementation of the existing SRCNet architecture document.

The plan is to make specific concrete proposals we can discuss in the areas of:

Storage
- Define the storage tiers (building on the definition from the top level roadmap)
- Define how Rucio is expected to move ODPs and SDPs
- Define how Rucio can help track where data is currently accessible, where data is archived and could be made accessible, and how to request a change in what data is currently accessible
- Define how data is distributed to sites that do not have an archive storage, that have smaller amounts of online and/or scratch storage
- Link to higher level discussions around metadata and data discovery linking to global URIs that can then be resolved to a local location via Rucio.
Compute
- Services and Interactive Compute
  - Define how there are long running services and APIs available at each site
  - For accessible data, define how to manage the resources needed to visualize that currently accessible data, ensuring the resources are available at a time that is convenient for those wanting to visualize the data.
  - Define how a user can request that a data set is made accessible for visualization, when it is archived and not currently accessible via online storage
- Discuss the management of "scratch" storage local to the compute
- Batch Compute
  - Discuss users requesting the creation of ADPs, and when ready getting access to visualize those ADPs.
  - Discuss the adoption of the IOVA execution broker, or similar, to implement a global federation job queue.
  - Discuss how jobs can be run on pre-existing Slurm clusters shared with non SKA SRCNet users and workloads.
- Workflow Development
  - We will need to support CI/CD workflows to update and publish the services, interactive environments and batch compute templates
  - Direct access to development environments, with appropriate data access, to create new interactive and batch compute templates
  - SDP pipeline development, with appropriate access to visibilities
Resource accounting
- We need to document how we track resource usage
- We should document how we implement the recommendations for resource allocation, and the architecturally constraints we need to place on that allocation process
Security and User Traceability
- This should not overlap with the ongoing policy work.
- Rather we talk about the expectations on various aspects of the system around provenance of the binaries being executed and tracability of what specific user is triggering the specific resource usage

The success of this exercise will be measured by the acceptance of an architecture paper by SRCNet Architecture group and the wider SRCNet community.

Attachments

Issue Links

Child Of

SP-4781 Distributed Data Computing v0.2 - Roadmap

Funnel

is cloned by

SP-4421 Document consensus around Distributed Compute and Storage

Discarded

Structure

Activity

People

Assignee:: Bolton, Rosie

Reporter:: Watson, Duncan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Feature Progress

Story Point Burn-up: (0%)

Feature Estimate: 0.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	0	0.0
Total	0	0.0

Dates

Created:: 02/May/24 8:30 AM

Updated:: 1 week ago 11:35 PM

System Architecture Design for Distributed Compute and Storage