Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4266

Enable Job queueing for CANFAR Science Platform

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      BH: In order for SRCNet to use the Canadian instance of CANFAR for v0.1 testing without impacting the science community, it will be necessary to implement job queueing within the CANFAR science platform.

      BH: This could allow managed use of all CANFAR deployments in SRCNet to optimize resource usage enabling concurrent science use and testing with interactive and batch workflows. 

      Show
      BH: In order for SRCNet to use the Canadian instance of CANFAR for v0.1 testing without impacting the science community, it will be necessary to implement job queueing within the CANFAR science platform. BH: This could allow managed use of all CANFAR deployments in SRCNet to optimize resource usage enabling concurrent science use and testing with interactive and batch workflows. 
    • Hide

      AC:  Demonstrate submission and queue execution of multiple jobs, without impacting interactive users or overwhelming the system.

       

      Show
      AC:  Demonstrate submission and queue execution of multiple jobs, without impacting interactive users or overwhelming the system.  
    • 3.5
    • 3.5
    • 0
    • Team_RED
    • Sprint 5
    • Show
      Demonstration: https://confluence.skatelescope.org/pages/viewpage.action?pageId=280401971 Documentation: https://confluence.skatelescope.org/display/SRCSC/Kueue+in+CANFAR+Notes?src=contextnavpagetreemode Pull request: https://github.com/opencadc/science-platform/pull/665
    • 24.3
    • Stories Completed, Outcomes Reviewed
    • PI24 - UNCOVERED

    • SRC23-PB SRCNet0.1 science-platform-services

    Description

      This Feature will build on the lessons learned from SP-3885. There will be an emphasis on low-level scheduling at the kubernetes level.

      This will involve having kueue execute Jobs on the cluster and prioritize interactive jobs over headless jobs.

      Single kueue

      There may need to be an additional queue to support preemptive behaviour. This supports launching interactive Jobs with no wait times.

      The Job description properties/labels are already supported by the skaha Jobs.

      This will require Helm chart changes.

      Dynamic queue creation at deployment time is out of scope for this story.

       

      There are 2 possible approaches to ensure that interactive jobs have resource priority over   batch jobs:

      1. Use Workload Priority Class in KUEUE to prioritise a job.  This requires at least 2 workload priority classes to be defined, each with a value for the priority. A higher value ensures a higher priority over a lower value. Example – a value of 90000 has a higher weightage over 50000. Constraint – all jobs must be scheduled on a single queue.
      2. Schedule headless jobs using KUEUE while using the native kubernetes scheduler for interactive jobs. This behaviour needs to be observed.

      Attachments

        Issue Links

          Structure

            Activity

              People

                Robert.Perry Perry, Robert
                s.goliath sharon goliath
                Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 3.5

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete1127.0
                  Total1127.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel