Details
-
Feature
-
Must have
-
SRCnet
-
-
-
2.5
-
2
-
0
-
Team_DAAC
-
Sprint 5
-
-
-
- Users no longer receive ""No Valid Host" messages
-
-
PI24 PI24-PB SRC23-PB operations-and-infrastructure site-provisioning team_DAAC
Description
Using OpenStack quota is often too course grained to enforce the resource allocations decided by resource allocation committees within shared e-infrastructure. This is particularly a problem for very in demand (and expensive) resources, such as high memory nodes and GPU nodes.
Success of this work looks like:
- Decrease in user frustration trying to get cloud resources, and understanding OpenStack cloud capacity
- Reduced unexpected failures when creating platforms due to the cloud running out of capacity
- Better utilization of cloud resources, i.e. more science out of the same cloud capacity
- Easier to give users a small CPU hour based cloud resource allocation, e.g. for shorter lived development and test activities
When creating resources in OpenStack, it can fail with the rather opaque error "no valid host", which often means the cloud is full, and there is no space.
Currently Cambridge within UKSRC are offering to test this solution on the Cambridge Arcus OpenStack cloud, and its associated Azimuth instance. Ideally we work in parallel with other sites across the wider SRC NET that would like to test the solution, both with and without Azimuth. Ideally they would also being running the Yoga release or newer of OpenStack, ideally using Kayobe/kolla-ansible to deploy their cloud.
This Feature was originally expected to deliver in PI23. An unexpected error, late in testing, however, could not be fixed within the remaining time available, given contractual resource constraints with the supplier, Stack HPC. Completion of the work has therefore been planned into PI24 where the fix (now identified) for the defect willow be applied and retested. The original estimate had also not allowed for deployment into production. For PI24, we have included deployment to production. To enable higher priority work to proceed on UKSRCNet0.1, however, this deployment into production will not happen until Sprint 4.