Details
-
Feature
-
Could have
-
SRCnet
-
-
-
2
-
2
-
0
-
-
-
- Users are provided with an option to schedule work for a later date
- Resources are reserved for these later slots and cannot be over-booked
- Users work is successfully scheduled and can be run at the later time slot
-
-
SRC23-PB SRCNet0.x operations-and-infrastructure site-provisioning team_DAAC
Description
Using OpenStack quota is often too course grained to enforce the resource allocations decided by resource allocation committees within shared e-infrastructure. This is particularly a problem for very in demand (and expensive) resources, such as high memory nodes and GPU nodes.
Success of this work looks like:
Decrease in user frustration trying to get cloud resources, and understanding OpenStack cloud capacity
Reduced unexpected failures when creating platforms due to the cloud running out of capacity
Better utilization of cloud resources, i.e. more science out of the same cloud capacity
Easier to give users a small CPU hour based cloud resource allocation, e.g. for shorter lived development and test activities
When creating resources in OpenStack, it can fail with the rather opaque error "no valid host", which often means the cloud is full, and there is no space.
OpenStack Blazar allows users to reserve capacity, and it can be used to get a clear answer about if the cloud is currently full. In addition, there is also the option to reserve resources in the future, when there is space. Users of fixed capacity clouds often start "squatting" on resources when they don't need them, because if they give them back to the general resource pool, they know there is a high chance they will not get those resources back. Allowing users to reserve the resources for a future date should help users spend their compute credits more efficiently.
There APIs can be used by many different OpenStack API clients. Azimuth has been created in a way that should make this easy to test and get initial feedback on what API changes are needed to build a good user experience on top of these upstream APIs. StackHPC have been involved in related upstream discussions, along with Chameleon cloud in the US and Nectar cloud in Australia, for some years.
We will look at the following implementation phases for managing OpenStack resources:
Demo existing (IRIS funded) OpenStack cloud capacity exporter and its dashboards, and how you can somewhat secure access using SKA IAM in front of Grafana (demo via existing Azimuth feature to expose Grafana in this way).
Test creating Blazar reservation for short lived and long lived platforms, and demo how that can help provide quick feedback users about when the cloud is full. Azimuth will be used to quickly prototype the end to end experience of asking users to specify the expected end date for their platforms.
Use Blazar reservations to consume CPU and GPU hour credits, to help better implement resource allocation commit decisions compared to the limitations you have with OpenStack Quota (such as assuming a uniform resource usage, and making it impossible to give small "seedcorn" allocations for innovation and new user testing). Azimuth will be used to prototype a user experience on top of Blazar where users get feedback on if they are either out of credits or the cloud is currently full.
When a cloud is full, but you have available credits, one extra option we have with Blazar is to try a future time when the cloud might have space. We propose an additional Blazar API to help propose alternative reservations where there is space. Azimuth will be used to prototype the overall user experience around booking a reservation in the future, when the cloud is too full to allow the platform to start right away.
While reservations bring many benefits, that can leave gaps between the reservations. One one hand we can look to reduce resource fragmentation, making use of good flavour sizes, and limiting the choices for when reservations end. But more interestingly, we should explore using back jobs to back fill the gaps between reservations with preemptable instances, potentially K8s based CNCF Armarda and/or Slurm.
Currently Cambridge within UKSRC are offering to test this solution on the Cambridge Arcus OpenStack cloud, and its associated Azimuth instance. Ideally we work in parallel with other sites across the wider SRC NET that would like to test the solution, both with and without Azimuth. Ideally they would also being running the Yoga release or newer of OpenStack, ideally using Kayobe/kolla-ansible to deploy their cloud.