Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4280

Resource Allocation OpenStack: Cloud credits limit size of resource reservations

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      This feature builds on SP-4279, where Blazar reservations ensure we get a clear answer on if the cloud has space or not right now.

      Without limits, a Blazar user is free to reserve any currently unreserved cloud capacity for a duration that is longer than the expected lifetime of the supporting hardware. Note that Blazar does not check or respect any Quota that may or may not further restrict what a user could do with their reservation.

      For Blazar to work within a production e-Infrastructure environment, we need to add some limits to ensure a fair share of resources.

      Firstly, we are modelling the resource limits on what is done within Slurm, that has limits based on CPU hours, GPU hours and maximum resource request and maximum duration. 

      Secondly, we are assuming allocations are specified by something similar to the IRIS or DiRAC resource allocation process, co-ordinating with the local cloud provider. We are assuming these limits will be applied across a group of OpenStack projects, for example GAIA make good use of dev, staging and production projects all sharing the same pool of resources. Typically these allocations are expire at a particular time, e.g. quarterly or annually. There are also small "seedcorn" allocations that can be made for those wanting to better understand what amount o resources to request, or for users with a much smaller, but sometimes urgent, resource need.

      Currently, these sorts of limits are typically implemented by using project/tenant isolation filters, ring fencing hypervisors for a particular research group, based on the resource allocation committee's allocation. This assumes the need for resources is uniform across a year, but this is very far from the reality of most groups' needs. It would be better to instead limit what can be reserved over the allocation year.

      Implementing this feature will involve adding a new enforcement plugin into the existing Blazar enforcement system, that calls out to an external REST API to check the cloud credits. For more details please see:
      https://stackhpc.github.io/coral-credits/mission/

      Show
      This feature builds on SP-4279 , where Blazar reservations ensure we get a clear answer on if the cloud has space or not right now. Without limits, a Blazar user is free to reserve any currently unreserved cloud capacity for a duration that is longer than the expected lifetime of the supporting hardware. Note that Blazar does not check or respect any Quota that may or may not further restrict what a user could do with their reservation. For Blazar to work within a production e-Infrastructure environment, we need to add some limits to ensure a fair share of resources. Firstly, we are modelling the resource limits on what is done within Slurm, that has limits based on CPU hours, GPU hours and maximum resource request and maximum duration.  Secondly, we are assuming allocations are specified by something similar to the IRIS or DiRAC resource allocation process, co-ordinating with the local cloud provider. We are assuming these limits will be applied across a group of OpenStack projects, for example GAIA make good use of dev, staging and production projects all sharing the same pool of resources. Typically these allocations are expire at a particular time, e.g. quarterly or annually. There are also small "seedcorn" allocations that can be made for those wanting to better understand what amount o resources to request, or for users with a much smaller, but sometimes urgent, resource need. Currently, these sorts of limits are typically implemented by using project/tenant isolation filters, ring fencing hypervisors for a particular research group, based on the resource allocation committee's allocation. This assumes the need for resources is uniform across a year, but this is very far from the reality of most groups' needs. It would be better to instead limit what can be reserved over the allocation year. Implementing this feature will involve adding a new enforcement plugin into the existing Blazar enforcement system, that calls out to an external REST API to check the cloud credits. For more details please see: https://stackhpc.github.io/coral-credits/mission/
    • Hide

      AC1: A reservation request that would exceed your cloud credits is rejected, with a clear error message that is because you exceeded your available cloud credits.

      AC2: Deleting a reservation early, or before it starts, will refund some of the unused credit to the cloud credit account.

      AC3: Clear docs on how to assign cloud credits to specific account, and how to map that specific account to specific OpenStack projects in various globally distributed clouds.

      AC4: Clear docs on how federation managers can review the amount of cloud credits assigned to the accounts, how many have been consumed already, how many are reserved and how many are left unused.

      AC5: Clear docs on how end users can understand their existing cloud credit balance, and what has consumed their cloud credits.

      AC6: Document how other sites and countries can adopt this solution.

      Show
      AC1: A reservation request that would exceed your cloud credits is rejected, with a clear error message that is because you exceeded your available cloud credits. AC2: Deleting a reservation early, or before it starts, will refund some of the unused credit to the cloud credit account. AC3: Clear docs on how to assign cloud credits to specific account, and how to map that specific account to specific OpenStack projects in various globally distributed clouds. AC4: Clear docs on how federation managers can review the amount of cloud credits assigned to the accounts, how many have been consumed already, how many are reserved and how many are left unused. AC5: Clear docs on how end users can understand their existing cloud credit balance, and what has consumed their cloud credits. AC6: Document how other sites and countries can adopt this solution.
    • 3.5
    • 2
    • 0
    • Team_DAAC
    • Sprint 5
    • PI24 - UNCOVERED

    • PI24 PI24-PB SRC-CompPlat SRC23-PB operations-and-infrastructure site-provisioning team_DAAC

    Description

      Implementing this feature will involve adding a new enforcement plugin into the existing Blazar enforcement system, that calls out to an external REST API to check the cloud credits. For more details please see:
      https://stackhpc.github.io/coral-credits/mission/

      Currently Cambridge within UKSRC are offering to test this solution on the Cambridge Arcus OpenStack cloud, and its associated Azimuth instance. Ideally we work in parallel with other sites across the wider SRC NET that would like to test the solution, both with and without Azimuth. Ideally they would also being running the Yoga release or newer of OpenStack, ideally using Kayobe/kolla-ansible to deploy their cloud.

      This Feature should have been delivered in PI-23. Due to a dependency on SP-4279, however, which did not complete in PI-23, this Feature has been delayed to PI-24. In addition, significant documentation is required to complete the delivery and, due to contractual constraints with the supplier, this could also not be achieved PI23. To ensure that higher priority work on UKSRCNet0.1 is prioritised, this Feature will now not be delivered until PI24, Sprint 5.

      Attachments

        Issue Links

          Structure

            Activity

              People

                P.Llopis Llopis, Pablo
                D.Watson Watson, Duncan
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 3.5

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete528.0
                  Total528.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel