Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Feature
Priority: Must have
Fix Version/s: PI23, PI24
Component/s: SRCnet Federated Execution
Labels:

ARTs:

SRCnet
Benefit hypothesis:

Hide

When we come to do federated compute, such as via the IVO execution broker, we need a very clear answer on which clouds are full, and which have space.

This feature focuses on OpenStack providing APIs for users the get a clear answer on cloud capacity. This is done by request a resource reservation, with a clear start time, end time, and the required flavors of servers that are needed.

We are integrating this with Azimuth, such that we have a clear worked example of how to build automation around these APIs. In addition, Azimuth users will now get clear feedback on if there is space available when the request Azimuth to create a platform. The key challenge to the Azimuth and Blazar integration work, at this stage, is ensuring the OpenTofu automation is fed the private flavour uuids dynamically created by Blazar for each reservation, rather than using the public flavour uuids used in the Blazar reservation request.

The alternative approach would be to use Quotas to dedicate chunks of resource to different projects, however that has been show to have various issues including: no defined date when the resources are returned for others to use, underutilization if you don't "overcommit" your quota, resource squatting when you do overcommit your quota.

Show
When we come to do federated compute, such as via the IVO execution broker, we need a very clear answer on which clouds are full, and which have space. This feature focuses on OpenStack providing APIs for users the get a clear answer on cloud capacity. This is done by request a resource reservation, with a clear start time, end time, and the required flavors of servers that are needed. We are integrating this with Azimuth, such that we have a clear worked example of how to build automation around these APIs. In addition, Azimuth users will now get clear feedback on if there is space available when the request Azimuth to create a platform. The key challenge to the Azimuth and Blazar integration work, at this stage, is ensuring the OpenTofu automation is fed the private flavour uuids dynamically created by Blazar for each reservation, rather than using the public flavour uuids used in the Blazar reservation request. The alternative approach would be to use Quotas to dedicate chunks of resource to different projects, however that has been show to have various issues including: no defined date when the resources are returned for others to use, underutilization if you don't "overcommit" your quota, resource squatting when you do overcommit your quota.
Acceptance criteria:

Hide

AC1: Azimuth users creating a platform must now specify when they are going to stop using the resources (Done in PI22)

AC2: Azimuth can automatically destroy platforms at the time the user has requested when creating the platform (Done in PI22)

AC3: Azimuth users will get a visual warning when their platform is going to be automatically deleted (Done in PI22)

AC4: Azimuth users get clear feedback that the cloud is currently full. This is Implemented by creating a Blazar reservation for the requested duration, and consuming those reserved resources when creating a platform.

AC5: Document the underlying OpenStack APIs Azimuth is using to create resource reservations, to determine if the cloud has space for that request or not.

Show
AC1: Azimuth users creating a platform must now specify when they are going to stop using the resources (Done in PI22) AC2: Azimuth can automatically destroy platforms at the time the user has requested when creating the platform (Done in PI22) AC3: Azimuth users will get a visual warning when their platform is going to be automatically deleted (Done in PI22) AC4: Azimuth users get clear feedback that the cloud is currently full. This is Implemented by creating a Blazar reservation for the requested duration, and consuming those reserved resources when creating a platform. AC5: Document the underlying OpenStack APIs Azimuth is using to create resource reservations, to determine if the cloud has space for that request or not.
Feature Points:
2.5
Initial Size:
2
WSJF:
0
Epic Link:
Distributed Data Computing v0.1 - Roadmap
Agile Teams:

Team_DAAC
Due Sprint:
Sprint 5
Story Point Burn-up:
Overdue:
Outcomes:
- Users no longer receive ""No Valid Host" messages

Requirement Status:

PI24 - UNCOVERED
Labels_MIRO:
PI24 PI24-PB SRC23-PB operations-and-infrastructure site-provisioning team_DAAC

Description

Using OpenStack quota is often too course grained to enforce the resource allocations decided by resource allocation committees within shared e-infrastructure. This is particularly a problem for very in demand (and expensive) resources, such as high memory nodes and GPU nodes.

Success of this work looks like:

Decrease in user frustration trying to get cloud resources, and understanding OpenStack cloud capacity
Reduced unexpected failures when creating platforms due to the cloud running out of capacity
Better utilization of cloud resources, i.e. more science out of the same cloud capacity
Easier to give users a small CPU hour based cloud resource allocation, e.g. for shorter lived development and test activities

When creating resources in OpenStack, it can fail with the rather opaque error "no valid host", which often means the cloud is full, and there is no space.

Currently Cambridge within UKSRC are offering to test this solution on the Cambridge Arcus OpenStack cloud, and its associated Azimuth instance. Ideally we work in parallel with other sites across the wider SRC NET that would like to test the solution, both with and without Azimuth. Ideally they would also being running the Yoga release or newer of OpenStack, ideally using Kayobe/kolla-ansible to deploy their cloud.

This Feature was originally expected to deliver in PI23. An unexpected error, late in testing, however, could not be fixed within the remaining time available, given contractual resource constraints with the supplier, Stack HPC. Completion of the work has therefore been planned into PI24 where the fix (now identified) for the defect willow be applied and retested. The original estimate had also not allowed for deployment into production. For PI24, we have included deployment to production. To enable higher priority work to proceed on UKSRCNet0.1, however, this deployment into production will not happen until Sprint 4.

Attachments

Issue Links

Child Of

SP-4868 Distributed Data Computing v0.1 - Roadmap

Implementing

is required by

SP-4282 Resource Allocation OpenStack: help users pick when the cloud has space

Program Backlog

SP-4280 Resource Allocation OpenStack: Cloud credits limit size of resource reservations

Implementing

mentioned in: Page Loading...; Page Loading...

Structure

Activity

People

Assignee:: Llopis, Pablo

Reporter:: Watson, Duncan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Feature Progress

Story Point Burn-up: (60.00%)

Feature Estimate: 2.5

	Issues	Story Points
To Do	2	4.0
In Progress	0	0.0
Complete	5	6.0
Total	7	10.0

Dates

Created:: 10/May/24 8:44 AM

Updated:: 1 week ago 11:52 PM

Due Sprint Date:: 19/Nov/24

Resource Allocation OpenStack: clarity about when the cloud is full