Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-3776

Investigate container support for Nvidia Drivers

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Enabler
    • Should have
    • PI20, PI21
    • COM PSS SW
    • None
    • Data Processing
    • Hide

      PSS will be able to support the install of an Nvidia Driver in our Ansible scripts. This is software that is required to support Cheetah and therefore we need the capability to install it on our machines.

      Show
      PSS will be able to support the install of an Nvidia Driver in our Ansible scripts. This is software that is required to support Cheetah and therefore we need the capability to install it on our machines.
    • Hide

      Installation of Nvidia Driver on a real machine using the ska-pss-ci-systems Ansible scripts is successful, and the relevant scripts are merged to the main branch.

      Show
      Installation of Nvidia Driver on a real machine using the ska-pss-ci-systems Ansible scripts is successful, and the relevant scripts are merged to the main branch.
    • 1
    • 1
    • 0
    • Team_PSS
    • Sprint 5
    • Hide

      We investigated grouping low level/'hardware related' software in ansible. But it was discovered that the easiest way to do this is to use an ansible fact to determine if the tasks being performed are running in a docker container or not and, based on this, not to run the tasks that we don't want to use in containers. 

      Due to the above, we did not need to remove the tests as they run successfully, but as expected do not do anything. 

      We agreed a plan for testing the Nvidia Driver on Dokimi and executed the plan successfully. This involved first testing the Nvidia Driver in isolation in a playbook, and then also testing it as part of a full installation of all our dependencies. As this was successful, we were able to merge the Nvidia Driver changes into the dev branch. 

      The MR for the ska-pss-ci-systems changes is here:
      https://gitlab.com/ska-telescope/pss/ska-pss-ci-systems/-/merge_requests/36

      Show
      We investigated grouping low level/'hardware related' software in ansible. But it was discovered that the easiest way to do this is to use an ansible fact to determine if the tasks being performed are running in a docker container or not and, based on this, not to run the tasks that we don't want to use in containers.  Due to the above, we did not need to remove the tests as they run successfully, but as expected do not do anything.  We agreed a plan for testing the Nvidia Driver on Dokimi and executed the plan successfully. This involved first testing the Nvidia Driver in isolation in a playbook, and then also testing it as part of a full installation of all our dependencies. As this was successful, we were able to merge the Nvidia Driver changes into the dev branch.  The MR for the ska-pss-ci-systems changes is here: https://gitlab.com/ska-telescope/pss/ska-pss-ci-systems/-/merge_requests/36
    • 21.6
    • Stories Completed, Outcomes Reviewed, Satisfies Acceptance Criteria, Accepted by FO
    • PI22 - UNCOVERED

    Description

      The work in PI20 on this feature revealed that we can not use the Nvidia Driver Container Toolkit to support testing the installation of NVidia Drivers in containers. Some more details are available in the comments for AT4-1161. After some discussions we concluded that we will not test hardware related components using the container components tests in the ska-pss-ci-systems repository. We will skip these tests when the container tests run. As part of doing this we will investigate grouping hardware tests so we can manage them separately.

      Suggested steps for PI21:

      • investigate grouping hardware tests
      • remove hardware tests from ska-pss-ci-systems tool component testing
      • agree on how to test/verify the ansible scripts for hardware installations (including the nvidia drivers)
      • Manually run the hardware installation tests
      • Check in ansible scripts

      OLD DESCRIPTION:
      Our CICD for the our ska-ci-systems repo uses containers. If we want to continue using containers for this we will need to run these containers on a machine that has the Nvidia Container Toolkit installed. This is because when installing cuda it is a requirement to install kernel headers that support (within a range) the version of cuda being installed. Normally on bare metal, a version of cuda that works with the os, would be being installed and the kernel headers would match the actual kernel. But because using containers means the os in the container can be different from the native os and therefore versions of kernel headers might not match the host kernel, there is a problem. Nvidia have created support of this with their Nvidia Container Toolkit.

      SKAO have machines in the k8s cluster that have GPU's and have the required toolkit installed. We should investigate if we can use these. It appears this could simply mean changing our runner tags.

      If not we will need to investigate installing the Nvidia Container Toolkit on our own machines with runners.

      Attachments

        Structure

          Activity

            People

              A.Noutsos Noutsos, Aristeidis
              L.Levin-Preston Levin-Preston, Lina
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Feature Progress

                Story Point Burn-up: (100.00%)

                Feature Estimate: 1.0

                IssuesStory Points
                To Do00.0
                In Progress   00.0
                Complete1125.0
                Total1125.0

                Dates

                  Created:
                  Updated:
                  Resolved:

                  Structure Helper Panel