Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4282

Resource Allocation OpenStack: help users pick when the cloud has space

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      When we come to do federated compute, such as via the IVO execution broker, we need a very clear answer on which clouds are full, and which have space, and when.

      This feature builds on SP-4279, where Blazar reservations ensure we get a clear answer on if the cloud has space or not right now.

      This feature is focused on the case when all clouds are currently full, and we need to fine the next available slot of the requested compute job.

      For Azimuth, we are looking to create something similar to a ticket booking website, where you get a visual way to pick the slot that best meets your need. It is optimizing for looking for a contiguous blocks where the requested size of resource is available.

      Within Azimuth, in an attempt to reduce resource fragmentation, we are going to implement 8 hours slots for reservations, aligned with the typical working day of the local time zone for the datacentre. If resources are available right away, the first "slot" might now be shorter than 8 hours. If the full 8 hours are not required, users are encouraged to delete their platforms "early" such that those resources can be used by other systems.

      Note, that while filling gaps between the reservations with preemptable instances for batch jobs is being prototyped, it is out of scope for this current feature.

      For clarity, these APIs and Azimuth's UI only consider the "local" cloud's resource availability, while automation such as the IVOA execution broker is expected to talk to multiple candidate clouds, based on data availability and if the cloud has the correct types of hardware that a specific workflow requires.

      Show
      When we come to do federated compute, such as via the IVO execution broker, we need a very clear answer on which clouds are full, and which have space, and when. This feature builds on SP-4279 , where Blazar reservations ensure we get a clear answer on if the cloud has space or not right now. This feature is focused on the case when all clouds are currently full, and we need to fine the next available slot of the requested compute job. For Azimuth, we are looking to create something similar to a ticket booking website, where you get a visual way to pick the slot that best meets your need. It is optimizing for looking for a contiguous blocks where the requested size of resource is available. Within Azimuth, in an attempt to reduce resource fragmentation, we are going to implement 8 hours slots for reservations, aligned with the typical working day of the local time zone for the datacentre. If resources are available right away, the first "slot" might now be shorter than 8 hours. If the full 8 hours are not required, users are encouraged to delete their platforms "early" such that those resources can be used by other systems. Note, that while filling gaps between the reservations with preemptable instances for batch jobs is being prototyped, it is out of scope for this current feature. For clarity, these APIs and Azimuth's UI only consider the "local" cloud's resource availability, while automation such as the IVOA execution broker is expected to talk to multiple candidate clouds, based on data availability and if the cloud has the correct types of hardware that a specific workflow requires.
    • Hide

      AC1: Azimuth users requesting resources at times when the cloud is full are provided with an option to request resources at a future time when the resources they have requested are available.
      AC2: Azimuth is able to successfully create the requested platforms one Blazar confirms the reservation has started, and the platform is gracefully deleted just before the reservation has ended.

      Show
      AC1: Azimuth users requesting resources at times when the cloud is full are provided with an option to request resources at a future time when the resources they have requested are available. AC2: Azimuth is able to successfully create the requested platforms one Blazar confirms the reservation has started, and the platform is gracefully deleted just before the reservation has ended.
    • 2
    • 2
    • 0
    • Team_DAAC
    • Sprint 5
      • Users are provided with an option to schedule work for a later date
      • Resources are reserved for these later slots and cannot be over-booked
      • Users work is successfully scheduled and can be run at the later time slot
    • PI23 - UNCOVERED

    • SRC23-PB SRCNet0.x operations-and-infrastructure site-provisioning team_DAAC

    Description

      Using OpenStack quota is often too course grained to enforce the resource allocations decided by resource allocation committees within shared e-infrastructure. This is particularly a problem for very in demand (and expensive) resources, such as high memory nodes and GPU nodes.
      Success of this work looks like:
      Decrease in user frustration trying to get cloud resources, and understanding OpenStack cloud capacity
      Reduced unexpected failures when creating platforms due to the cloud running out of capacity
      Better utilization of cloud resources, i.e. more science out of the same cloud capacity
      Easier to give users a small CPU hour based cloud resource allocation, e.g. for shorter lived development and test activities
      When creating resources in OpenStack, it can fail with the rather opaque error "no valid host", which often means the cloud is full, and there is no space.
      OpenStack Blazar allows users to reserve capacity, and it can be used to get a clear answer about if the cloud is currently full. In addition, there is also the option to reserve resources in the future, when there is space. Users of fixed capacity clouds often start "squatting" on resources when they don't need them, because if they give them back to the general resource pool, they know there is a high chance they will not get those resources back. Allowing users to reserve the resources for a future date should help users spend their compute credits more efficiently.
      There APIs can be used by many different OpenStack API clients. Azimuth has been created in a way that should make this easy to test and get initial feedback on what API changes are needed to build a good user experience on top of these upstream APIs. StackHPC have been involved in related upstream discussions, along with Chameleon cloud in the US and Nectar cloud in Australia, for some years.
      We will look at the following implementation phases for managing OpenStack resources:
      Demo existing (IRIS funded) OpenStack cloud capacity exporter and its dashboards, and how you can somewhat secure access using SKA IAM in front of Grafana (demo via existing Azimuth feature to expose Grafana in this way).
      Test creating Blazar reservation for short lived and long lived platforms, and demo how that can help provide quick feedback users about when the cloud is full. Azimuth will be used to quickly prototype the end to end experience of asking users to specify the expected end date for their platforms.
      Use Blazar reservations to consume CPU and GPU hour credits, to help better implement resource allocation commit decisions compared to the limitations you have with OpenStack Quota (such as assuming a uniform resource usage, and making it impossible to give small "seedcorn" allocations for innovation and new user testing). Azimuth will be used to prototype a user experience on top of Blazar where users get feedback on if they are either out of credits or the cloud is currently full.
      When a cloud is full, but you have available credits, one extra option we have with Blazar is to try a future time when the cloud might have space. We propose an additional Blazar API to help propose alternative reservations where there is space. Azimuth will be used to prototype the overall user experience around booking a reservation in the future, when the cloud is too full to allow the platform to start right away.
      While reservations bring many benefits, that can leave gaps between the reservations. One one hand we can look to reduce resource fragmentation, making use of good flavour sizes, and limiting the choices for when reservations end. But more interestingly, we should explore using back jobs to back fill the gaps between reservations with preemptable instances, potentially K8s based CNCF Armarda and/or Slurm.
      Currently Cambridge within UKSRC are offering to test this solution on the Cambridge Arcus OpenStack cloud, and its associated Azimuth instance. Ideally we work in parallel with other sites across the wider SRC NET that would like to test the solution, both with and without Azimuth. Ideally they would also being running the Yoga release or newer of OpenStack, ideally using Kayobe/kolla-ansible to deploy their cloud.

      Attachments

        Issue Links

          Structure

            Activity

              People

                b.mort Mort, Ben
                D.Watson Watson, Duncan
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (0%)

                  Feature Estimate: 2.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete00.0
                  Total00.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel