Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Feature
Priority: Not Assigned
Fix Version/s: None
Component/s: SRCnet Science Enabling
Labels:
None

ARTs:

SRCnet
Benefit hypothesis:
Hide

In order to be able to use STARS on v0.1 systems and beyond, we need to make changes that improve STARS to be able to run on HPC-like environments:

We should make improvements to make it easier to run on systems whose workernodes do not have internet access (also common in HPC environments).

since the runs can be quite long, a single workload failing should not make everything stop.

some better matching from workloads to underlying resources (such as running GPU workloads on GPUs, etc)
Show
In order to be able to use STARS on v0.1 systems and beyond, we need to make changes that improve STARS to be able to run on HPC-like environments: We should make improvements to make it easier to run on systems whose workernodes do not have internet access (also common in HPC environments). since the runs can be quite long, a single workload failing should not make everything stop. some better matching from workloads to underlying resources (such as running GPU workloads on GPUs, etc)
Epic Link:
Distributed Data Computing v0.2 - Roadmap
Story Point Burn-up:
Overdue:

Requirement Status:

PI24 - UNCOVERED

Description

Some of this is duplicated in SP-4287, so we need to strip this one back to only focus on why we would investigate reframe (Alex).

Going forward, STARS needs to be improved to address the following:

Workloads are becoming more varied. For instance, some are CPU only, some are GPU only, some are short, some are long. We need a way to select a subset of these where appropriate (possibly matching the underlying resources).
Fixes related to proper separation of setup and running, particularly for environments where worker nodes do not have Internet access.

The main part of the work here would involve evaluating if it would be worth rewriting STARS with ReFrame, since it can address or simplify some of the points outlined. As part of this work are specific questions that we need to answer:

Why does it benefit STARS code?

Why use ReFrame and not something else?

What are the goals in mind and why do we need ReFrame to achieve them?

Is this actually how we are going to be using STARS?

Is this something to consider in the next 3-6 months or is this functionality a longer term goal that we don't know what direction to take in?
https://reframe-hpc.readthedocs.io/en/stable/

Attachments

Issue Links

Child Of

SP-4781 Distributed Data Computing v0.2 - Roadmap

Funnel

Structure

Activity

People

Assignee:: Clarke, Alex

Reporter:: Clarke, Alex

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Feature Progress

Story Point Burn-up: (0%)

Feature Estimate: 0.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	0	0.0
Total	0	0.0

Dates

Created:: 08/Aug/24 2:08 PM

Updated:: 6 days ago 2:11 AM

Investigate STARS improvements for HPC/SRCNet with regards to ReFrame