Details
Description
In some cases, SRCNet nodes will not own the HPC compute and storage resources that it runs on, but will be a tenant out of many others using these resources. This pattern is common in large HPC centers, where the organisation that hosts the HPC resources chooses a sharing model (often involving a batch system) for HPC services, since this maximises resource utilisation. This means that the complete SRCNet stack (SRCNet services, down to the HPC services and hardware) is not owned by a single organisation.
At the same time, SRCNet operators will deploy services which are able to talk to the batch system, for instance for spawning jobs as part of a JupyterHub session, or a Dask session. However, this entails that these services need a way to authenticate to the batch system in order to perform actions such as job submission. Slurm, which is a widely used batch system in HPC, is one example of such batch system which requires authentication to make batch system submissions.
This poses an issue because most solutions that need to make batch system submissions (Nextflow, JHub, Dask, etc) assume that they already have a key that authenticates them to the batch system. That is, most solutions assume that there is a single security domain (e.g. SRCNet operators own all the infrastructure), while the case described here involves two security domains (SRCNet on one hand, and HPC services on the other hand).
Additional technical details:
For instance, in Slurm you can authenticate via either:
- shared symmetric munge key
- Json Web Tokens (JWT)
Sharing the munge key with the SRCNet services can pose a security risk, as anybody with the munge key can be root on the whole batch system (and not just affect SRCNet resources, but the whole HPC service). Therefore HPC service operators will want to avoid the security risk by not sharing the munge key with operators running SRCNet services.
On the other hand, Slurm doesn't support OIDC tokens (Slurm's use of JWT isn't interoperable with OIDC tokens that are used in the SRCNet).
Therefore, a solution will be needed in order to operate in such environments in a secure and efficient way. Solutions can range from Slurm-specific (obtaining JWT tokens from OIDC tokens and injecting them into the user sessions), to more generic (Compute API? FirecREST?) but potentially costlier (developing a new interface for the use cases that make job submissions).
The Swiss SRC node is one such node with these restrictions, and making progress in SRCNet prototyping will involve resolving this issue. We expect other sites to face the same issue sooner or later, especially for those using shared HPC resources.