Details
-
Enabler
-
Should have
-
None
-
Data Processing
-
-
-
1
-
1
-
0
-
Team_PSS
-
Sprint 5
-
-
-
-
21.6
-
Stories Completed, Outcomes Reviewed, Satisfies Acceptance Criteria, Accepted by FO
Description
The work in PI20 on this feature revealed that we can not use the Nvidia Driver Container Toolkit to support testing the installation of NVidia Drivers in containers. Some more details are available in the comments for AT4-1161. After some discussions we concluded that we will not test hardware related components using the container components tests in the ska-pss-ci-systems repository. We will skip these tests when the container tests run. As part of doing this we will investigate grouping hardware tests so we can manage them separately.
Suggested steps for PI21:
- investigate grouping hardware tests
- remove hardware tests from ska-pss-ci-systems tool component testing
- agree on how to test/verify the ansible scripts for hardware installations (including the nvidia drivers)
- Manually run the hardware installation tests
- Check in ansible scripts
OLD DESCRIPTION:
Our CICD for the our ska-ci-systems repo uses containers. If we want to continue using containers for this we will need to run these containers on a machine that has the Nvidia Container Toolkit installed. This is because when installing cuda it is a requirement to install kernel headers that support (within a range) the version of cuda being installed. Normally on bare metal, a version of cuda that works with the os, would be being installed and the kernel headers would match the actual kernel. But because using containers means the os in the container can be different from the native os and therefore versions of kernel headers might not match the host kernel, there is a problem. Nvidia have created support of this with their Nvidia Container Toolkit.
SKAO have machines in the k8s cluster that have GPU's and have the required toolkit installed. We should investigate if we can use these. It appears this could simply mean changing our runner tags.
If not we will need to investigate installing the Nvidia Container Toolkit on our own machines with runners.