Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4084

Enabling Kubernetes Workloads on Traditional HPC Resources: Challenges and Solutions

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      Most HPC sites do not offer K8s interfaces for running on HPC resources (e.g. CSCS in Switzerland). The solution explored in this feature will be useful for achieving the integration of cloud-native software such as the Science Platform with traditional HPC centers.

      A prototype on CHSRC might be useful as it might be possible to prototype an integration between Kubernetes and FirecREST, an existing solution that abstracts away the underlying batch system (therefore not tying the integration to a specific batch system such as Slurm).

      Show
      Most HPC sites do not offer K8s interfaces for running on HPC resources (e.g. CSCS in Switzerland). The solution explored in this feature will be useful for achieving the integration of cloud-native software such as the Science Platform with traditional HPC centers. A prototype on CHSRC might be useful as it might be possible to prototype an integration between Kubernetes and FirecREST, an existing solution that abstracts away the underlying batch system (therefore not tying the integration to a specific batch system such as Slurm).
    • Hide

      AC1: Documentation in a confluence page describing the approach taken to tackle this challenge, specifying:

      • Detail what was achieved, especially in terms of integration level between Science Platform and batch system.
      • The main roadblocks (e.g. architectural, missing features in the software involved), if any. Any roadblocks would need to be accompanied with an estimated plan of what would need to be achieved (with estimated cost/complexity) in order achieve full integration.

      AC2: (Optional) A demo of the integration work if successful, or a description of what was tried and what's missing if the integration is not successful.

      Show
      AC1 : Documentation in a confluence page describing the approach taken to tackle this challenge, specifying: Detail what was achieved, especially in terms of integration level between Science Platform and batch system. The main roadblocks (e.g. architectural, missing features in the software involved), if any. Any roadblocks would need to be accompanied with an estimated plan of what would need to be achieved (with estimated cost/complexity) in order achieve full integration. AC2 : (Optional) A demo of the integration work if successful, or a description of what was tried and what's missing if the integration is not successful.
    • 3
    • 3
    • 0
    • Team_CORAL
    • Sprint 5
    • Hide

      AC1: Documentation in a confluence page describing the approach taken to tackle this challenge, specifying:

      • Detail what was achieved, especially in terms of integration level between Science Platform and batch system.
      • The main roadblocks (e.g. architectural, missing features in the software involved), if any. Any roadblocks would need to be accompanied with an estimated plan of what would need to be achieved (with estimated cost/complexity) in order achieve full integration.

      https://confluence.skatelescope.org/display/SRCSC/COR-620+Document+BridgeOperator+Spike

       

      AC2: (Optional) A demo of the integration work if successful, or a description of what was tried and what's missing if the integration is not successful.

      Show
      AC1 : Documentation in a confluence page describing the approach taken to tackle this challenge, specifying: Detail what was achieved, especially in terms of integration level between Science Platform and batch system. The main roadblocks (e.g. architectural, missing features in the software involved), if any. Any roadblocks would need to be accompanied with an estimated plan of what would need to be achieved (with estimated cost/complexity) in order achieve full integration. https://confluence.skatelescope.org/display/SRCSC/COR-620+Document+BridgeOperator+Spike   AC2 : (Optional) A demo of the integration work if successful, or a description of what was tried and what's missing if the integration is not successful. Live/recorded Demo in System Demo 30th May: Recording:  https://confluence.skatelescope.org/pages/viewpage.action?pageId=265846048 Slides: https://docs.google.com/presentation/d/1t2NJsERTr8_0G3MqRzc2shNsJeLIBItw233d6xyhzO4/edit?usp=sharing
    • 24.3
    • Stories Completed, Outcomes Reviewed
    • science-platform-services

    Description

      Some applications and tools are designed to be cloud-native, assuming that infrastructure is available through Kubernetes interfaces. The container-based Science Platform developed by CADC is such an example, which treats all workloads as Kubernetes-native Jobs.  This presents a challenge when the HPC resources are not part of the Kubernetes infrastructure, but are part of a separate system.
      This work aims to make these cloud-native applications interoperable with HPC batch systems that are not Kubernetes based.

      Tools like https://github.com/IBM/Bridge-Operator achieve a loose coupling between Kubernetes-based systems (such as CANFAR science platform) and separate batch systems.

      In this feature, we will explore leveraging this approach, starting with the Bridge Operator in particular, but perhaps migrating to other solutions if need be.

      The goal is to understand how to build an interface that allows K8s based applications to run their containerised services on HPC batch systems. We will use CANFAR to guide this exploration and provide a practical view on how to build this interface.

      Attachments

        Issue Links

          Structure

            Activity

              People

                r.bolton Bolton, Rosie
                P.Llopis Llopis, Pablo
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 3.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete39.0
                  Total39.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel