Details
-
Feature
-
Must have
-
Data Processing
-
-
-
3
-
3
-
0
-
0
-
REL-367 SDP 0.14.0
-
Team_ORCA
-
Sprint 5
-
-
-
-
17.6
-
Stories Completed, Integrated, BDD Testing Passes (no errors), Outcomes Reviewed, NFRS met, Demonstrated, Satisfies Acceptance Criteria, Accepted by FO
-
-
LOW_SUT1 LOW_SUT2 MID_SUT1 MID_SUT3
Description
Introduction
When an SDP component goes offline for a period of time (i.e. it is not simply restarted), in most cases there is no indication that something is wrong. This is especially the case for the processing controller and the Helm deployer. In other cases, the error messages are vague and hard to decode. Providing a robust system that is able to monitor itself and self-heal will be necessary to ensure continuity and availability of SDP processing. Reporting of state and failures will enable operators to understand the state of the system and take any required action should issues arise.
We need to implement a monitoring and error reporting system, which informs the other SDP components, the other SKA subsystems and the operators of a component failure. This could include:
- When a component is offline, reporting the appropriate status via the attributes of the Tango controller and subarray devices (state, health state, and observing state).
- Reporting state when ska-sdp CLI is used (not just that a processing block started).
- In the case of a config DB error, providing a more descriptive / user-friendly method of reporting in the CLI.
- Developing a component state dashboard as part of the operator interface.
- Where suitable this could make use of Kubernetes liveness probes to restart a component when it has failed.
Who?
- AIV engineers
- AA Operators
What?
- Component failures are reported in an appropriate way on the Tango devices.
- CLI shows the component status and more informative error messages about failures.
- Kubernetes probes are implemented to restart components when they fail.
- Stretch: Web interface displays information about the status of the components.
Why?
- Reporting the status of the SDP to other subsystems and to operators will improve its usability and allow faults to be recovered from faster.
References
SP-1884for suggested approaches from previous investigations.