Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-3723

Knowledge gathering spike about Workload Management Systems

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      By studying existing WMS solutions and their applications within renowned projects like WLCG deployments (ATLAS, CMS, ALICE, LHCb), this exploratory work can help us identify solutions that are both scalable and tailored to the unique requirements of SRCNet. This should improve our understanding of existing solutions: their strengths and weaknesses for SRCNet purposes, how well they can fit within our existing architecture, and give us an initial sense of the amount of work that would be required to implement a prototype.

       

      Show
      By studying existing WMS solutions and their applications within renowned projects like WLCG deployments (ATLAS, CMS, ALICE, LHCb), this exploratory work can help us identify solutions that are both scalable and tailored to the unique requirements of SRCNet. This should improve our understanding of existing solutions: their strengths and weaknesses for SRCNet purposes, how well they can fit within our existing architecture, and give us an initial sense of the amount of work that would be required to implement a prototype.  
    • Hide

      Gather findings into at least one document (such as a Confluence page or a google doc) which details the following aspects for at least two WMS:

      • We must identify at least three existing workload management systems that are suitable for the SRCNet.
      • We must gather information about the following aspects of each system:
        • Assumptions made for WLCG use cases that may differ from SRCNet use cases. This may include, but isn't limited to, the WLCG data model, and the requirements placed on the data management system.
        • Level of maturity regarding OIDC token support.
        • Architectural aspects:
          • Architecture diagram detailing the services that run in the ecosystem (central services, local site services, enabling services)
          • Comment on how well does it fit with existing SRCNet components, especially SRCNet compute and data APIs
          • Comment on how well the WMS solutions could be make to work for astronomy use cases such as those requiring  IVOA interfaces.
        • Ease of deployment: How many moving parts, how easy are they to setup and configure and modularity of the deployment design.
        • Receptiveness/willingness of the maintainers to assist us with
          • prototyping, learning and getting started
          • collaboration in driving the technology forward jointly in the future (bug fixing, feature contributions, etc).
        • How easy would it be to fix and contribute improvements to the project? (including code/language, project lifecycle, contribution methodology).
      Show
      Gather findings into at least one document (such as a Confluence page or a google doc) which details the following aspects for at least two WMS: We must identify at least three existing workload management systems that are suitable for the SRCNet. We must gather information about the following aspects of each system: Assumptions made for WLCG use cases that may differ from SRCNet use cases. This may include, but isn't limited to, the WLCG data model, and the requirements placed on the data management system. Level of maturity regarding OIDC token support. Architectural aspects: Architecture diagram detailing the services that run in the ecosystem (central services, local site services, enabling services) Comment on how well does it fit with existing SRCNet components, especially SRCNet compute and data APIs Comment on how well the WMS solutions could be make to work for astronomy use cases such as those requiring  IVOA interfaces. Ease of deployment: How many moving parts, how easy are they to setup and configure and modularity of the deployment design. Receptiveness/willingness of the maintainers to assist us with prototyping, learning and getting started collaboration in driving the technology forward jointly in the future (bug fixing, feature contributions, etc). How easy would it be to fix and contribute improvements to the project? (including code/language, project lifecycle, contribution methodology).
    • Team_CORAL
    • Sprint 5
    • Show
      Document: https://docs.google.com/document/d/18M6CyKSe6Qes8DtxMyWAGNZF3KVp8S36I9V9IRkBsZE/edit#heading=h.cme6y2nk4ifx   Demo: https://confluence.skatelescope.org/display/SRCSC/2023-12-14+SRC+ART+System+Demo+21.1+Part+1+AM  
    • 21.1
    • Stories Completed, Demonstrated, Satisfies Acceptance Criteria, Accepted by FO
    • PI20-PB

    Description

      The Workload Management System (WMS) is a fundamental part of a scientific compute network such as the SRCNet. It becomes essential for job submission, job scheduling, job monitoring, and job accounting. In addition, it is beneficial for both users and tools as it provides a unified submission interface, scalability, ensuring reliability, efficiency, and security of job submissions at SRCNet scale.

      The biggest and most successful workload management systems have been WLCG deployments: ATLAS uses PanDA, CMS uses GlideinWMS/HTCondor, ALICE uses AliEn, and LHCb uses DIRAC.

      Given the relevance of the WMS in the SRCNet, it would be useful to study existing solutions. We could gather information about the following aspects:

      • Assumptions made for WLCG use cases that may differ from SRCNet use cases. This may include, but isn't limited to, the WLCG data model, and the requirements placed on the data management system.
      • Level of maturity regarding OIDC token support.
      • Architectural aspects:
        • Architecture diagram detailing the services that run in the ecosystem (central services, local site services, enabling services)
        • Comment on how well does it fit with existing SRCNet components, especially SRCNet compute and data APIs
        • Comment on how well the WMS solutions could be make to work for astronomy use cases such as those requiring  IVOA interfaces.
      • Ease of deployment: How many moving parts, how easy are they to setup and configure and modularity of the deployment design.
      • Receptiveness/willingness of the maintainers to assist us with
        • prototyping, learning and getting started
        • collaboration in driving the technology forward jointly in the future (bug fixing, feature contributions, etc).
      • How easy would it be to fix and contribute improvements to the project? (including code/language, project lifecycle, contribution methodology).

      There is interest / ability in Coral and Magenta teams to drive this work and arrange conversations with the relevant technical folks in the WLCG community.

      Attachments

        Issue Links

          Structure

            Activity

              People

                Jesus.Salgado Salgado, Jesus
                P.Llopis Llopis, Pablo
                Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 0.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete711.0
                  Total711.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel