Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4279

Resource Allocation OpenStack: clarity about when the cloud is full

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      When we come to do federated compute, such as via the IVO execution broker, we need a very clear answer on which clouds are full, and which have space.

      This feature focuses on OpenStack providing APIs for users the get a clear answer on cloud capacity. This is done by request a resource reservation, with a clear start time, end time, and the required flavors of servers that are needed.

      We are integrating this with Azimuth, such that we have a clear worked example of how to build automation around these APIs. In addition, Azimuth users will now get clear feedback on if there is space available when the request Azimuth to create a platform. The key challenge to the Azimuth and Blazar integration work, at this stage, is ensuring the OpenTofu automation is fed the private flavour uuids  dynamically created by Blazar for each reservation, rather than using the public flavour uuids used in the Blazar reservation request.

      The alternative approach would be to use Quotas to dedicate chunks of resource to different projects, however that has been show to have various issues including: no defined date when the resources are returned for others to use, underutilization if you don't "overcommit" your quota, resource squatting when you do overcommit your quota.

      Show
      When we come to do federated compute, such as via the IVO execution broker, we need a very clear answer on which clouds are full, and which have space. This feature focuses on OpenStack providing APIs for users the get a clear answer on cloud capacity. This is done by request a resource reservation, with a clear start time, end time, and the required flavors of servers that are needed. We are integrating this with Azimuth, such that we have a clear worked example of how to build automation around these APIs. In addition, Azimuth users will now get clear feedback on if there is space available when the request Azimuth to create a platform. The key challenge to the Azimuth and Blazar integration work, at this stage, is ensuring the OpenTofu automation is fed the private flavour uuids  dynamically created by Blazar for each reservation, rather than using the public flavour uuids used in the Blazar reservation request. The alternative approach would be to use Quotas to dedicate chunks of resource to different projects, however that has been show to have various issues including: no defined date when the resources are returned for others to use, underutilization if you don't "overcommit" your quota, resource squatting when you do overcommit your quota.
    • Hide

      AC1: Azimuth users creating a platform must now specify when they are going to stop using the resources (Done in PI22)

      AC2: Azimuth can automatically destroy platforms at the time the user has requested when creating the platform (Done in PI22)

      AC3: Azimuth users will get a visual warning when their platform is going to be automatically deleted (Done in PI22)

      AC4: Azimuth users get clear feedback that the cloud is currently full. This is Implemented by creating a Blazar reservation for the requested duration, and consuming those reserved resources when creating a platform.

      AC5: Document the underlying OpenStack APIs Azimuth is using to create resource reservations, to determine if the cloud has space for that request or not.

      Show
      AC1: Azimuth users creating a platform must now specify when they are going to stop using the resources (Done in PI22) AC2: Azimuth can automatically destroy platforms at the time the user has requested when creating the platform (Done in PI22) AC3: Azimuth users will get a visual warning when their platform is going to be automatically deleted (Done in PI22) AC4: Azimuth users get clear feedback that the cloud is currently full. This is Implemented by creating a Blazar reservation for the requested duration, and consuming those reserved resources when creating a platform. AC5: Document the underlying OpenStack APIs Azimuth is using to create resource reservations, to determine if the cloud has space for that request or not.
    • 2.5
    • 2
    • 0
    • Team_DAAC
    • Sprint 5
      • Users no longer receive ""No Valid Host" messages
    • PI24 - UNCOVERED

    • PI24 PI24-PB SRC23-PB operations-and-infrastructure site-provisioning team_DAAC

    Description

      Using OpenStack quota is often too course grained to enforce the resource allocations decided by resource allocation committees within shared e-infrastructure. This is particularly a problem for very in demand (and expensive) resources, such as high memory nodes and GPU nodes.

      Success of this work looks like:

      • Decrease in user frustration trying to get cloud resources, and understanding OpenStack cloud capacity
      • Reduced unexpected failures when creating platforms due to the cloud running out of capacity
      • Better utilization of cloud resources, i.e. more science out of the same cloud capacity
      • Easier to give users a small CPU hour based cloud resource allocation, e.g. for shorter lived development and test activities

      When creating resources in OpenStack, it can fail with the rather opaque error "no valid host", which often means the cloud is full, and there is no space. 

      Currently Cambridge within UKSRC are offering to test this solution on the Cambridge Arcus OpenStack cloud, and its associated Azimuth instance. Ideally we work in parallel with other sites across the wider SRC NET that would like to test the solution, both with and without Azimuth. Ideally they would also being running the Yoga release or newer of OpenStack, ideally using Kayobe/kolla-ansible to deploy their cloud.

      This Feature was originally expected to deliver in PI23. An unexpected error, late in testing, however, could not be fixed within the remaining time available, given contractual resource constraints with the supplier, Stack HPC. Completion of the work has therefore been planned into PI24 where the fix (now identified) for the defect willow be applied and retested. The original estimate had also not allowed for deployment into production. For PI24, we have included deployment to production. To enable higher priority work to proceed on UKSRCNet0.1, however, this deployment into production will not happen until Sprint 4.

      Attachments

        Issue Links

          Structure

            Activity

              People

                P.Llopis Llopis, Pablo
                D.Watson Watson, Duncan
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (60.00%)

                  Feature Estimate: 2.5

                  IssuesStory Points
                  To Do24.0
                  In Progress   00.0
                  Complete56.0
                  Total710.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel