Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4035

Resource Allocation for OpenStack Sites

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      We believe that, to offer a better user experience when creating servers in an OpenStack environment, users should get a clear yes/no answer about if the cloud is full or not. Secondly, when the cloud is full, users should be able to reserve future capacity, at a point when the cloud is no longer full.

      To test this, we will run requests at times when we know the cloud to be full.

      Following success, we will document how this can be configured at other OpenStack sites.

      The current alternative is OpenStack Quotas. Quotas only limit a point in time high watermark of resource usage, when each resource is created. There is no user agreement to ever delete those resources they create.

      If the sum of all project Quota is less than or equal to the available cloud resources, Quota will give you good feedback on cloud capacity. However, this leads to massive underutilization of your cloud, and lots of busy work carefully adjusting Quotas.

      When the sum of all project Quota is larger the the available cloud capacity, users typically start "squatting" on any resources them manage to get even when they don't need them, because its a simple way to ensure they do have those resources available when they do need them.

      Show
      We believe that, to offer a better user experience when creating servers in an OpenStack environment, users should get a clear yes/no answer about if the cloud is full or not. Secondly, when the cloud is full, users should be able to reserve future capacity, at a point when the cloud is no longer full. To test this, we will run requests at times when we know the cloud to be full. Following success, we will document how this can be configured at other OpenStack sites. The current alternative is OpenStack Quotas. Quotas only limit a point in time high watermark of resource usage, when each resource is created. There is no user agreement to ever delete those resources they create. If the sum of all project Quota is less than or equal to the available cloud resources, Quota will give you good feedback on cloud capacity. However, this leads to massive underutilization of your cloud, and lots of busy work carefully adjusting Quotas. When the sum of all project Quota is larger the the available cloud capacity, users typically start "squatting" on any resources them manage to get even when they don't need them, because its a simple way to ensure they do have those resources available when they do need them.
    • Hide

      AC1: When requesting resources at times when the cloud is full, users get clear feedback that the cloud is currently full.

      AC2: Blazar needs to enforce limits on the resources users can reserve.

      AC3: Prove that resources requested for future slots are reserved successfully

      AC4: When requesting resources at times when the cloud is full, users are provided with an option to request resources at a future time when the resources they have requested our available.

      AC5: Work allocated to future time slots run successfully

      Show
      AC1: When requesting resources at times when the cloud is full, users get clear feedback that the cloud is currently full. AC2: Blazar needs to enforce limits on the resources users can reserve. AC3: Prove that resources requested for future slots are reserved successfully AC4: When requesting resources at times when the cloud is full, users are provided with an option to request resources at a future time when the resources they have requested our available. AC5: Work allocated to future time slots run successfully
    • Team_TEAL
    • Sprint 5
    • Hide
      • Users no longer receive ""No Valid Host" messages
      • Users are provided with an option to schedule work for a later date
      • Resources are reserved for these later slots and cannot be over-booked
      • Users work is successfully scheduled and can be run at the later time slot
      Show
      Users no longer receive ""No Valid Host" messages Users are provided with an option to schedule work for a later date Resources are reserved for these later slots and cannot be over-booked Users work is successfully scheduled and can be run at the later time slot
    • 22.6
    • PI23 - UNCOVERED

    • Teal-D operations-and-infrastructure site-provisioning

    Description

      Using OpenStack quota is often too course grained to enforce the resource allocations decided by resource allocation committees within shared e-infrastructure. This is particularly a problem for very in demand (and expensive) resources, such as high memory nodes and GPU nodes.

      Success of this work looks like:

      • Decrease in user frustration trying to get cloud resources, and understanding OpenStack cloud capacity
      • Reduced unexpected failures when creating platforms due to the cloud running out of capacity
      • Better utilization of cloud resources, i.e. more science out of the same cloud capacity
      • Easier to give users a small CPU hour based cloud resource allocation, e.g. for shorter lived development and test activities

      When creating resources in OpenStack, it can fail with the rather opaque error "no valid host", which often means the cloud is full, and there is no space. 

      OpenStack Blazar allows users to reserve capacity, and it can be used to get a clear answer about if the cloud is currently full. In addition, there is also the option to reserve resources in the future, when there is space. Users of fixed capacity clouds often start "squatting" on resources when they don't need them, because if they give them back to the general resource pool, they know there is a high chance they will not get those resources back. Allowing users to reserve the resources for a future date should help users spend their compute credits more efficiently. 

      There APIs can be used by many different OpenStack API clients. Azimuth has been created in a way that should make this easy to test and get initial feedback on what API changes are needed to build a good user experience on top of these upstream APIs. StackHPC have been involved in related upstream discussions, along with Chameleon cloud in the US and Nectar cloud in Australia, for some years.

      We will look at the following implementation phases for managing OpenStack resources:

      • Demo existing (IRIS funded) OpenStack cloud capacity exporter and its dashboards, and how you can somewhat secure access using SKA IAM in front of Grafana (demo via existing Azimuth feature to expose Grafana in this way).
      • Test creating Blazar reservation for short lived and long lived platforms, and demo how that can help provide quick feedback users about when the cloud is full. Azimuth will be used to quickly prototype the end to end experience of asking users to specify the expected end date for their platforms.
      • Use Blazar reservations to consume CPU and GPU hour credits, to help better implement resource  allocation commit decisions compared to the limitations you have with OpenStack Quota (such as assuming a uniform resource usage, and making it impossible to give small "seedcorn" allocations for innovation and new user testing). Azimuth will be used to prototype a user experience on top of Blazar where users get feedback on if they are either out of credits or the cloud is currently full.
      • When a cloud is full, but you have available credits, one extra option we have with Blazar is to try a future time when the cloud might have space. We propose an additional Blazar API to help propose alternative reservations where there is space. Azimuth will be used to prototype the overall user experience around booking a reservation in the future, when the cloud is too full to allow the platform to start right away.
      • While reservations bring many benefits, that can leave gaps between the reservations. One one hand we can look to reduce resource fragmentation, making use of good flavour sizes, and limiting the choices for when reservations end. But more interestingly, we should explore using back jobs to back fill the gaps between reservations with preemptable instances, potentially K8s based CNCF Armarda and/or Slurm.

      Currently Cambridge within UKSRC are offering to test this solution on the Cambridge Arcus OpenStack cloud, and its associated Azimuth instance. Ideally we work in parallel with other sites across the wider SRC NET that would like to test the solution, both with and without Azimuth. Ideally they would also being running the Yoga release or newer of OpenStack, ideally using Kayobe/kolla-ansible to deploy their cloud.

      Attachments

        Issue Links

          Structure

            Activity

              People

                r.bolton Bolton, Rosie
                D.Watson Watson, Duncan
                Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (0%)

                  Feature Estimate: 0.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete00.0
                  Total00.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel