Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-3736

Implementing Secure Authentication for SRCNet Nodes in Shared HPC Environments

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      The work from this feature will facilitate the usage of shared HPC services in a secure and efficient way.

      Show
      The work from this feature will facilitate the usage of shared HPC services in a secure and efficient way.
    • Hide
      • AC1: Produce documentation that addresses:
        1. Problem description, including a description of the context in which this issue is relevant.
        2. Detail the approach taken to find a solution. In case multiple solutions are available, detail the trade-offs between solutions.
        3. Describe the chosen solution in a way that enables reproducibility.
      • AC2: Demonstrate a solution that allows a user session (such as JHub, Dask, Nextflow) to be spawned on a shared HPC service in a secure way.
      Show
      AC1: Produce documentation that addresses: Problem description, including a description of the context in which this issue is relevant. Detail the approach taken to find a solution. In case multiple solutions are available, detail the trade-offs between solutions. Describe the chosen solution in a way that enables reproducibility. AC2: Demonstrate a solution that allows a user session (such as JHub, Dask, Nextflow) to be spawned on a shared HPC service in a secure way.
    • 2
    • 2
    • 0
    • Team_CORAL
    • Sprint 5
    • Hide

      AC1: Produce documentation

      AC2: Demo

      • Demo was not presented in SRCNet ART. Created a CoP on HPC.
      Show
      AC1: Produce documentation This documentation will be populated during PI21 (Issues) CoP on HPC and Cloud: https://confluence.skatelescope.org/pages/viewpage.action?pageId=251371898 AC2: Demo Demo was not presented in SRCNet ART. Created a CoP on HPC.
    • PI23 - UNCOVERED

    • PI20-PB PI21 PI21-PB

    Description

      In some cases, SRCNet nodes will not own the HPC compute and storage resources that it runs on, but will be a tenant out of many others using these resources. This pattern is common in large HPC centers, where the organisation that hosts the HPC resources chooses a sharing model (often involving a batch system) for HPC services, since this maximises resource utilisation. This means that the complete SRCNet stack (SRCNet services, down to the HPC services and hardware) is not owned by a single organisation.

      At the same time, SRCNet operators will deploy services which are able to talk to the batch system, for instance for spawning jobs as part of a JupyterHub session, or a Dask session. However, this entails that these services need a way to authenticate to the batch system in order to perform actions such as job submission. Slurm, which is a widely used batch system in HPC, is one example of such batch system which requires authentication to make batch system submissions.

      This poses an issue because most solutions that need to make batch system submissions (Nextflow, JHub, Dask, etc) assume that they already have a key that authenticates them to the batch system. That is, most solutions assume that there is a single security domain (e.g. SRCNet operators own all the infrastructure), while the case described here involves two security domains (SRCNet on one hand, and HPC services on the other hand).

      Additional technical details:
      For instance, in Slurm you can authenticate via either:

      • shared symmetric munge key
      • Json Web Tokens (JWT)

      Sharing the munge key with the SRCNet services can pose a security risk, as anybody with the munge key can be root on the whole batch system (and not just affect SRCNet resources, but the whole HPC service). Therefore HPC service operators will want to avoid the security risk by not sharing the munge key with operators running SRCNet services.

      On the other hand, Slurm doesn't support OIDC tokens (Slurm's use of JWT isn't interoperable with OIDC tokens that are used in the SRCNet).

      Therefore, a solution will be needed in order to operate in such environments in a secure and efficient way. Solutions can range from Slurm-specific (obtaining JWT tokens from OIDC tokens and injecting them into the user sessions), to more generic (Compute API? FirecREST?) but potentially costlier (developing a new interface for the use cases that make job submissions).

      The Swiss SRC node is one such node with these restrictions, and making progress in SRCNet prototyping will involve resolving this issue. We expect other sites to face the same issue sooner or later, especially for those using shared HPC resources.

      Attachments

        Issue Links

          Structure

            Activity

              People

                r.bolton Bolton, Rosie
                P.Llopis Llopis, Pablo
                Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 2.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete54.0
                  Total54.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel