Uploaded image for project: 'SAFe Solution'
  1. SAFe Solution
  2. SS-82

Make sure SKAMPI is stable for developers and users

    XporterXMLWordPrintable

Details

    • Enabler
    • Must have
    • PI12
    • None
    • None
    • Data Processing, Services, Obs Mgt & Controls
    • Hide
      • It takes less time for teams to integrate Features in SKAMPI
      • SKAMPI Tests work reliably as integration tests
      • SKAMPI infrastructure status is decoupled from SKAMPI software quality
      • Team morale is increased, as they are not hampered by SKAMPI
      • Teams are shown that we care about software quality above pure output
      Show
      It takes less time for teams to integrate Features in SKAMPI SKAMPI Tests work reliably as integration tests SKAMPI infrastructure status is decoupled from SKAMPI software quality Team morale is increased, as they are not hampered by SKAMPI Teams are shown that we care about software quality above pure output
    • Hide

      In PI12 we will focus on rebuilding the SKAMPI integration from the ground up. This will start with only a subset of components (see description) and proceed in inverse dependency order. At any stage we will: 

      • Make available a SKAMPI release that satisfies these criteria:
        • SKAMPI deployment is predictable, testable and repeatable 
        • The CI pipeline is GREEN, and tests can be reliably executed to identify bugs and errors
        • Tests are meaningful and their results can correctly predict system behaviour
        • Stress tests are executed correctly
        • The components to be re-integrated are: 
          • platform
          • tango-base 
          • test harness
          • central node  
          • Logging database and centralised logging capability
          • (uncommitted)
          • skuid 
          • EDA
          • landing page 
        • for all components re-integrated in the integration environment:
          • use best practices for container images, tools, and test processes (and update/disseminate docs when the docs aren't current)
          • re-align with SKA CI/CD policies
          • artefacts are released and published in the CAR following the Release Process 
          • Code Ownership is correctly defined within SKAMPI assigning clear responsibility on different areas of the integration repository

      It is noted that this will possibly include only a subset of the subsystems and of the tests base. 


      The end goal of this capability is to get to state where, at component and integrated Skampi level - , we need to:

      • Common deployment process that is flexible and scales with the environment and partial deployment of Skampi
      • Common pipeline machinery including gitlab-ci steps, and Makefile targets
      • Common machinery for runtime applications (eg: device servers) including command line options, environment variable handling/standardisation, mem/cpu/resource handling, builtin health-checkers - to support robust deployment/management
      • Testing framework that is flexible, with easy to understand output to support diagnosing problems
      • Testing that equally focuses on unhappy as well as happy paths 
      • Have a suite of tests for SKAMPI software that must pass in order for any SKAMPI version to be accepted.
      • Have a suite of tests for SKAMPI infrastructure (pipelines, virtual machines, attached storage…) that indicate when the underlying infrastructure is failing.
      • Demonstrate that all tests included can run reliably and are not flaky. Flaky tests should be isolated for investigation.
      • Have a dashboard that shows the status of SKAMPI infrastructure and SKAMPI tests for a number of versions (at least, the one from last PI demo, until the current ones; at most, the ones from the last 6 months) which is accessible by everyone.
      Show
      In PI12 we will focus on rebuilding the SKAMPI integration from the ground up. This will start with only a subset of components (see description) and proceed in inverse dependency order. At any stage we will:  Make available a SKAMPI release that satisfies these criteria: SKAMPI deployment is predictable, testable and repeatable  The CI pipeline is GREEN, and tests can be reliably executed to identify bugs and errors Tests are meaningful and their results can correctly predict system behaviour Stress tests are executed correctly The components to be re-integrated are:  platform tango-base  test harness central node   Logging database and centralised logging capability (uncommitted) skuid  EDA landing page  for all components re-integrated in the integration environment: use best practices for container images, tools, and test processes (and update/disseminate docs when the docs aren't current) re-align with SKA CI/CD policies artefacts are released and published in the CAR following the Release Process  Code Ownership is correctly defined within SKAMPI assigning clear responsibility on different areas of the integration repository It is noted that this will possibly include only a subset of the subsystems and of the tests base.  The end goal of this capability is to get to state where, at component and integrated Skampi level - , we need to: Common deployment process that is flexible and scales with the environment and partial deployment of Skampi Common pipeline machinery including gitlab-ci steps, and Makefile targets Common machinery for runtime applications (eg: device servers) including command line options, environment variable handling/standardisation, mem/cpu/resource handling, builtin health-checkers - to support robust deployment/management Testing framework that is flexible, with easy to understand output to support diagnosing problems Testing that equally focuses on unhappy as well as happy paths  Have a suite of tests for SKAMPI software that must pass in order for any SKAMPI version to be accepted. Have a suite of tests for SKAMPI infrastructure (pipelines, virtual machines, attached storage…) that indicate when the underlying infrastructure is failing. Demonstrate that all tests included can run reliably and are not flaky. Flaky tests should be isolated for investigation. Have a dashboard that shows the status of SKAMPI infrastructure and SKAMPI tests for a number of versions (at least, the one from last PI demo, until the current ones; at most, the ones from the last 6 months) which is accessible by everyone.
    • 13
    • PI24 - UNCOVERED

    Description

      SKAMPI is brittle: it is easy for SKAMPI to be in a state in which tests do not pass, and where reverting to a past configuration still test do not pass. We need to be able to fix this before more development on SKAMPI features can happen.

      In order to do that, we need to a) provide tests that are robust, and that prove whether SKAMPI is working or not; b) fix SKAMPI so that tests can pass, or fix temporarily the test, annotating an SKB against the test to make sure that the underlying reason is fixed; c) provide a well-known version of SKAMPI to work against.

      Many initiatives have been promoted to try and address this issue programmatically, these are related to improved processes for testing and bug fixing, better release management for better coordination, refactoring of some core aspects within the control system, but the problem is persisting. After careful evaluation, the obvious choice seems to be to refocus on the internals of the SKAMPI integration and make sure that it conforms to the necessary quality standards that enable a smoother integration and testing activity. 

      To this extent, in PI12 it has been decided that a task team shall be coordinated, composed by members from different teams, to rebuild the SKAMPI integration to a level where it is reliable and controllable. They will proceed by rebuilding the integration from the ground up and verifying some core properties. Their activity will focus on the integration of some key components:

      • Test harness: skallop , organisation of test code and .features 
      • Integration platform: deployment and monitoring of the k8s cluster and additional services 
      • TMC (common, TANGO DB, central node, EDA), skuid - possibly starting with a version that only deploys the central node. 

      INTEGRATION ORDER: 

      1. platform ( what ? k8s, Elastic, MariaDB TimescaleDB ) 
      2. tango-base (tango-cpp, tango-db, tango-dsconfig) 
      3. TMC (central node) 
      4. skuid
      5. Verifying that logging and transaction ID are correctly implemented in the integrated components
      6. EDA 
      7. landing page

      Other activities related to SS-82 can proceed in parallel to this integration effort in order to improve future integrations of components that are not touched in this initial effort. In particular: 

      • ruthless bug fixing activity
      • skallop code refactoring and tests standardisation
      • Taranta - resource usage and deployability 
      • Archiver - refactoring 
      • SDP - updates to the component level testing to be more robust to failure modes
      • TMC - updates to the component level testing to be more robust to failure modes
      • OET - updates to the component level testing to be more robust to failure modes
      • Work already planned in relation to the refactoring of the Control System guidelines and their implementation 

       

      Please note that training of users so that they can make better use of SKAMPI is not part of the scope of this Enabler/Capability.

      Attachments

        Issue Links

          Features

          Structure

            Activity

              People

                m.bartolini Bartolini, Marco
                j.santander-vela Santander-Vela, Juande
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (98.96%)

                  Feature Estimate: 13.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   11.0
                  Complete3895.5
                  Total3996.5

                  Capability Progress

                    Feature Point Burn-up: (93.33%)

                    Capability Estimate: 13

                    CountFeature Points
                    Todo00
                    In Progress   12
                    Done3814
                    Total3915

                    Dates

                      Created:
                      Updated:
                      Resolved:

                      Structure Helper Panel