Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4545

Investigate STARS improvements for HPC/SRCNet with regards to ReFrame

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      In order to be able to use STARS on v0.1 systems and beyond, we need to  make changes that improve STARS to be able to run on HPC-like environments:

      • We should make improvements to make it easier to run on systems whose workernodes do not have internet access (also common in HPC environments).
      • since the runs can be quite long, a single workload failing should not make everything stop.
      • some better matching from workloads to underlying resources (such as running GPU workloads on GPUs, etc)
      Show
      In order to be able to use STARS on v0.1 systems and beyond, we need to  make changes that improve STARS to be able to run on HPC-like environments: We should make improvements to make it easier to run on systems whose workernodes do not have internet access (also common in HPC environments). since the runs can be quite long, a single workload failing should not make everything stop. some better matching from workloads to underlying resources (such as running GPU workloads on GPUs, etc)
    • PI24 - UNCOVERED

    Description

      Some of this is duplicated in SP-4287, so we need to strip this one back to only focus on why we would investigate reframe (Alex).

      Going forward, STARS needs to be improved to address the following:

      • Workloads are becoming more varied. For instance, some are CPU only, some are GPU only, some are short, some are long. We need a way to select a subset of these where appropriate (possibly matching the underlying resources).
      • Fixes related to proper separation of setup and running, particularly for environments where worker nodes do not have Internet access.

      The main part of the work here would involve evaluating if it would be worth rewriting STARS with ReFrame, since it can address or simplify some of the points outlined. As part of this work are specific questions that we need to answer:

      Why does it benefit STARS code? 

      Why use ReFrame and not something else?

      What are the goals in mind and why do we need ReFrame to achieve them?

      Is this actually how we are going to be using STARS?

      Is this something to consider in the next 3-6 months or is this functionality a longer term goal that we don't know what direction to take in?
      https://reframe-hpc.readthedocs.io/en/stable/
       

      Attachments

        Issue Links

          Structure

            Activity

              People

                A.Clarke Clarke, Alex
                A.Clarke Clarke, Alex
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (0%)

                  Feature Estimate: 0.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete00.0
                  Total00.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel