Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-1884

Ensure all SDP components can be (re)deployed or be restarted independently and will auto-heal

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Enabler
    • Must have
    • PI12
    • COM SDP SW
    • None
    • Data Processing
    • Hide

      As a core part of SKA software, SDP must be robust, resilient, and flexible to meet the demands of being part of a complex distributed system.

      Ensuring all SDP components are loosely coupled and independently deployable is an important quality in a system that stays responsive in the face of failure.

      Show
      As a core part of SKA software, SDP must be robust, resilient, and flexible to meet the demands of being part of a complex distributed system. Ensuring all SDP components are loosely coupled and independently deployable is an important quality in a system that stays responsive in the face of failure.
    • Hide
      • Demonstrate (preferably using automated testing) that all controllers (processing, Helm) and the Tango interface can be restarted independently and recover from the restart without needing manual intervention and without putting SDP into an unresponsive or unrecoverable failure state.
      • Tests should ensure that
        • if a controller component is unavailable due to failure, others will continue to behave in a predictable and well-defined way and remain responsive.
        • If a controller component goes into an unexpected fault or failure state, it can be restarted and recover without needing to restart the entire SDP system.
      • A detailed assessment of what additional steps are needed to reach the robustness goal for all SDP services should be provided.
      Show
      Demonstrate (preferably using automated testing) that all controllers (processing, Helm) and the Tango interface can be restarted independently and recover from the restart without needing manual intervention and without putting SDP into an unresponsive or unrecoverable failure state. Tests should ensure that if a controller component is unavailable due to failure, others will continue to behave in a predictable and well-defined way and remain responsive. If a controller component goes into an unexpected fault or failure state, it can be restarted and recover without needing to restart the entire SDP system. A detailed assessment of what additional steps are needed to reach the robustness goal for all SDP services should be provided.
    • 2
    • 2
    • 100
    • 50
    • Team_ORCA
    • Sprint 2
    • Hide

      The investigations into the effect of SDP component failures are described in SDP Component Failures and its sub-pages. The good news is that SDP components are already fairly robust to failures. The components store their state in the configuration database, so if they are restarted, they are able to read the state and resume it. Configuration DB failures are much more problematic, because the entire SDP loses its state. This risk could be mitigated by using an etcd deployment with redundant servers or a backing store on disk.

      We have implemented a test in the SDP integration pipeline for a failure of a subarray. The investigation into the different ways to implement the test is described in Integration tests for SDP internal failures.

      We have drafted some features for further work to improve the robustness of the system, and to add reporting of internal failures at the application level (e.g. via the Tango devices). We have also added some smaller items to the Orca team backlog as improvements, including some background investigations for the features. The proposed features and improvement items are in Proposed features to make SDP more robust.

      Show
      The investigations into the effect of SDP component failures are described in  SDP Component Failures and its sub-pages. The good news is that SDP components are already fairly robust to failures. The components store their state in the configuration database, so if they are restarted, they are able to read the state and resume it. Configuration DB failures are much more problematic, because the entire SDP loses its state. This risk could be mitigated by using an etcd deployment with redundant servers or a backing store on disk. We have implemented a test in the SDP integration pipeline for a failure of a subarray. The investigation into the different ways to implement the test is described in Integration tests for SDP internal failures . We have drafted some features for further work to improve the robustness of the system, and to add reporting of internal failures at the application level (e.g. via the Tango devices). We have also added some smaller items to the Orca team backlog as improvements, including some background investigations for the features. The proposed features and improvement items are in Proposed features to make SDP more robust .
    • 12.5
    • Stories Completed, Integrated, Solution Intent Updated, BDD Testing Passes (no errors), Outcomes Reviewed, NFRS met, Demonstrated, Satisfies Acceptance Criteria, Accepted by FO

    Description

      Contribution to SS-82 by making sure SDP is independently resilient before attempting further tests of the resilience of the entire SKA software deployment.

       

      Attachments

        Issue Links

          Structure

            Activity

              People

                p.wortmann Wortmann, Peter
                f.graser Graser, Ferdl
                Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 2.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete814.0
                  Total814.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel