Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4368

Improved Environment Management for CICD and integration

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Feature
    • Must have
    • PI23
    • COM CICD
    • None
    • Services
    • Hide

      The proposed feature aims to significantly enhance the developer experience by providing a consolidated view of CI and integration environments, reducing the time developers spend searching for critical information and expediting debugging and resolution processes. Automated management and fair distribution of runner workloads will optimise resource usage and ensure a stable CI/CD pipeline. By collecting and analysing metrics on resource usage and pipeline failures, teams will gain actionable insights, leading to continuous improvement of CI/CD processes. Encouraging the setup of persistent or better-isolated testing environments will result in more reliable and reproducible test results, thereby improving software quality. Ultimately, identifying and eliminating bottlenecks will streamline the integration process, reducing the overall time to integrate for new features and updates, thereby increasing efficiency and productivity across the board.

      Show
      The proposed feature aims to significantly enhance the developer experience by providing a consolidated view of CI and integration environments, reducing the time developers spend searching for critical information and expediting debugging and resolution processes. Automated management and fair distribution of runner workloads will optimise resource usage and ensure a stable CI/CD pipeline. By collecting and analysing metrics on resource usage and pipeline failures, teams will gain actionable insights, leading to continuous improvement of CI/CD processes. Encouraging the setup of persistent or better-isolated testing environments will result in more reliable and reproducible test results, thereby improving software quality. Ultimately, identifying and eliminating bottlenecks will streamline the integration process, reducing the overall time to integrate for new features and updates, thereby increasing efficiency and productivity across the board.
    • Hide
      • Improved View for CI and Integration Environments:
        • Developers can access a unified dashboard that displays all relevant information about their CI and integration tests, including:
          • Logs
          • Resource usage
          • Pod status
          • Links to Test results
        • Navigation within the dashboards for above is intuitive and well documented, allowing easy location of specific information.
      • Automated Management of CI and Integration Environments:
        • All CI and integration environments are automatically annotated/tagged upon creation.
        • Runner workloads are visible to investigate how to distribute fairly across environments.
        • Runner and k8s resources are visible to investigate how to share fairly between deployments.
        • Metrics on resource usage are generated and shared with teams and business management. They should be able to grouped by repository, how many times a job has run
        • A system is in place to automatically collect and report on resource usage, network issues, and pipeline failures if they happen 
      • Improved Pipeline Configuration:
        • Gitlab configuration is reviewed and optimised to improve the stopping of pipeline runs. Repositories are configured automatically
        • Persistent or better-isolated environments for testing are set up and maintained as a self serving system for teams
        • Data on pipeline failures (including resource usage, network issues, and successful re-runs) is collected and analysed for insights. They should be able to grouped by repository, how many times a job has run
      • Bottleneck Identification and Elimination:
        • Bottlenecks related to integration environment usage (and pipeline machinery in general) and unnecessary time spent are identified and addressed.
        • Pipeline machinery and Gitlab metrics are utilised to streamline faster integration processes.
      Show
      Improved View for CI and Integration Environments: Developers can access a unified dashboard that displays all relevant information about their CI and integration tests, including: Logs Resource usage Pod status Links to Test results Navigation within the dashboards for above is intuitive and well documented, allowing easy location of specific information. Automated Management of CI and Integration Environments: All CI and integration environments are automatically annotated/tagged upon creation. Runner workloads are visible to investigate how to distribute fairly across environments. Runner and k8s resources are visible to investigate how to share fairly between deployments. Metrics on resource usage are generated and shared with teams and business management. They should be able to grouped by repository, how many times a job has run A system is in place to automatically collect and report on resource usage, network issues, and pipeline failures if they happen  Improved Pipeline Configuration: Gitlab configuration is reviewed and optimised to improve the stopping of pipeline runs. Repositories are configured automatically Persistent or better-isolated environments for testing are set up and maintained as a self serving system for teams Data on pipeline failures (including resource usage, network issues, and successful re-runs) is collected and analysed for insights. They should be able to grouped by repository, how many times a job has run Bottleneck Identification and Elimination: Bottlenecks related to integration environment usage (and pipeline machinery in general) and unnecessary time spent are identified and addressed. Pipeline machinery and Gitlab metrics are utilised to streamline faster integration processes.
    • 3
    • 3
    • 0
    • Team_SYSTEM
    • Sprint 5
    • Hide
      Show
      New Grafana Gitlab Dashboards are generated https://monitoring.skao.int/d/gitlab_ci_jobs/gitlab-ci-jobs?orgId=1  ,  https://monitoring.skao.int/d/a87fb0d919ec0ea5f6543124e16c42a5/kubernetes-compute-resources-namespace-workloads?orgId=1   New Namespace Management dashbaords are generated with a new namespace manager (that we can extend to cover more complex scenarios such as clean CI namespaces only after a pod starts failing for 10 minutes etc.)  https://monitoring.skao.int/d/e374e7bb-e223-4398-aaa8-f15845755fd6/namespace-manager-namespaces?orgId=1   New metrics for environment management are generated with accompanying dashboards  https://monitoring.skao.int/d/edvfag7wkt7nke/namespace-manager-overall?orgId=1   Gitlab related settings are reviewed and automated (FF is turned off for now):  https://developer.skao.int/en/latest/tools/ci-cd/gitlab-settings.html   Deployed Headlamp Pipeline Machinery Metrics are implemented and deployed: https://monitoring.skao.int/dashboards/f/ddvpmaqhx3klcb/  
    • PI24 - UNCOVERED

    Description

      Note: this may be too big a feature, and it may need splitting. - Improve DX by providing easier to navigate/find related information for CI and integration environments for logs, resources, documentation, dependencies, tests. i.e. Developers should have a consolidated/unified view of where their CI and integration tests are running and where to get information (logs, kubeconfig etc.) to debug easily, how much resource their pods are using, what's their status etc.

      • Enable automated auditable management of CI and Integration environments on cloud. Annotate/tag each created environments, fairly distribute runner workloads, create and share metrics on resource usage for teams and BM
      • Identify and eliminate bottlenecks for the lack of integration environment usage and unnecessary time spent (follow up of TMC workshop) using pipeline machinery and gitlab metrics to help faster integration

       

      IP Workshop Notes:

      • review Gitlab configuration and see if there is any improvement on that can be applied to improve how pipeline runs are stopped
      • Encourage teams to setup persistent or better isolated environments for testing with the help of system team
      • Collect better data about pipeline failures (resource usage, network issues, successful re-runs ... ) 

      Attachments

        Issue Links

          Structure

            Activity

              People

                m.bartolini Bartolini, Marco
                v.allan Allan, Verity
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 3.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete2039.0
                  Total2039.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel