Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-2045

SDP reports component status and failures

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Data Processing
    • Hide

      See "Why?" in description.

      Show
      See "Why?" in description.
    • Hide

      See "What?" in description.

      Show
      See "What?" in description.
    • 3
    • 3
    • 0
    • 0
    • Team_ORCA
    • Sprint 5
    • Hide

      Demo: at System Demo 17.6 (slides)

      We investigated various ways of improving the SDP's stability and how the reporting of each components should happen. These are documented on Confluence:

      We discussed these with the Feature Owner (notes on the second page) and decided to implement Kubernetes liveness probes, exit handlers, and lease entries in the Configuration database for the following components:

      • Controller device
      • Subarray device(s)
      • Processing Controller
      • Helm Deployer

      The lease entries are used by the controller device to report on component statuses. In addition, we added a BDD test (merged MR should be merged soon) to the integration repository for this new feature. The following releases contain the updates:

      Documentation of component status reporting on the controller device: https://developer.skao.int/projects/ska-sdp-lmc/en/latest/sdp-controller.html

      Show
      Demo: at System Demo 17.6 ( slides ) We investigated various ways of improving the SDP's stability and how the reporting of each components should happen. These are documented on Confluence: https://confluence.skatelescope.org/display/SE/Kubernetes+liveness%2C+readiness+and+startup+probes https://confluence.skatelescope.org/display/SE/Analysis+and+Implementation+Plan We discussed these with the Feature Owner (notes on the second page) and decided to implement Kubernetes liveness probes, exit handlers, and lease entries in the Configuration database for the following components: Controller device Subarray device(s) Processing Controller Helm Deployer The lease entries are used by the controller device to report on component statuses. In addition, we added a BDD test (merged MR should be merged soon ) to the integration repository for this new feature. The following releases contain the updates: Configuration Library: ska-sdp-config==0.4.4 SDP LMC:  ska-sdp-lmc==0.21.0 Processing Controller: ska-sdp-proccontrol==0.11.4 Helm Deployer: ska-sdp-helmdeploy==0.11.4 SDP: 0.14.0 ( https://jira.skatelescope.org/browse/REL-367 ) Documentation of component status reporting on the controller device: https://developer.skao.int/projects/ska-sdp-lmc/en/latest/sdp-controller.html
    • 17.6
    • Stories Completed, Integrated, BDD Testing Passes (no errors), Outcomes Reviewed, NFRS met, Demonstrated, Satisfies Acceptance Criteria, Accepted by FO
    • PI23 - UNCOVERED

    • LOW_SUT1 LOW_SUT2 MID_SUT1 MID_SUT3

    Description

      Introduction

      When an SDP component goes offline for a period of time (i.e. it is not simply restarted), in most cases there is no indication that something is wrong. This is especially the case for the processing controller and the Helm deployer. In other cases, the error messages are vague and hard to decode. Providing a robust system that is able to monitor itself and self-heal will be necessary to ensure continuity and availability of SDP processing. Reporting of state and failures will enable operators to understand the state of the system and take any required action should issues arise.

      We need to implement a monitoring and error reporting system, which informs the other SDP components, the other SKA subsystems and the operators of a component failure. This could include:

      • When a component is offline, reporting the appropriate status via the attributes of the Tango controller and subarray devices (state, health state, and observing state).
      • Reporting state when ska-sdp CLI is used (not just that a processing block started).
      • In the case of a config DB error, providing a more descriptive / user-friendly method of reporting in the CLI.
      • Developing a component state dashboard as part of the operator interface.
      • Where suitable this could make use of Kubernetes liveness probes to restart a component when it has failed.

      Who?

      • AIV engineers
      • AA Operators

      What?

      • Component failures are reported in an appropriate way on the Tango devices.
      • CLI shows the component status and more informative error messages about failures.
      • Kubernetes probes are implemented to restart components when they fail.
      • Stretch: Web interface displays information about the status of the components.

      Why?

      • Reporting the status of the SDP to other subsystems and to operators will improve its usability and allow faults to be recovered from faster.

      References

      • SP-1884 for suggested approaches from previous investigations.

      Attachments

        Issue Links

          Structure

            Activity

              People

                m.ashdown Ashdown, Mark
                m.ashdown Ashdown, Mark
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 3.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete1327.0
                  Total1327.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel