Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-3280

TMC should be robust against the absence of subsystems

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Obs Mgt & Controls
    • Hide

      The TMC is the overall telescope control and monitoring system. It is important that it remains "live" and able to cope with the absence or presence of the other systems, allowing the user to observe the state of the whole system.

      Show
      The TMC is the overall telescope control and monitoring system. It is important that it remains "live" and able to cope with the absence or presence of the other systems, allowing the user to observe the state of the whole system.
    • Hide
      1. When starting the system the TMC will start cleanly and be able to report the absence of any key subsystem (dishes, CSP, SDP, MCCS) if that subsystem does not start (either deliberately or because of failure).
      2. The TMC should cleanly show that it has not been able to read attributes from the missing subsystem and reflect this in any aggregated report.
      3. The TMC should reject attempts to command the missing subsystem.
      4. STRETCH - the TMC should be able to detect the loss of a key subsystem that was present and then at as in the points above.
      5. STRETCH - the TMC should be able to detect the successful starting of a key subsystem that was not present and then act normally.

      Clarification for point 3 above: There is a subtlety for commands such as "assign resources" and "configure". There may be reasons to take individual subsystems through the Observing State Machine without all being present - or even if all are present. A particular example is to be able to instruct the SDP to run a pipeline on data it already has for re-processing. In this example only SDP resources are assigned to a subarray and only the SDP is commanded. The JSON requesting the resources and configurations will only contain an SDP structure. In this example the TMC should not be instructing Dish/MCCS and CSP because no resources are requested of them,  and no configuration is requested of them. This subtlety may need more discussion during development.

      Note this can be tested in integration repo

      Show
      When starting the system the TMC will start cleanly and be able to report the absence of any key subsystem (dishes, CSP, SDP, MCCS) if that subsystem does not start (either deliberately or because of failure). The TMC should cleanly show that it has not been able to read attributes from the missing subsystem and reflect this in any aggregated report. The TMC should reject attempts to command the missing subsystem. STRETCH - the TMC should be able to detect the loss of a key subsystem that was present and then at as in the points above. STRETCH - the TMC should be able to detect the successful starting of a key subsystem that was not present and then act normally. Clarification for point 3 above: There is a subtlety for commands such as "assign resources" and "configure". There may be reasons to take individual subsystems through the Observing State Machine without all being present - or even if all are present . A particular example is to be able to instruct the SDP to run a pipeline on data it already has for re-processing. In this example only SDP resources are assigned to a subarray and only the SDP is commanded. The JSON requesting the resources and configurations will only contain an SDP structure. In this example the TMC should not be instructing Dish/MCCS and CSP because no resources are requested of them,  and no configuration is requested of them. This subtlety may need more discussion during development. Note this can be tested in integration repo
    • 2
    • 2
    • 0
    • Team_SAHYADRI
    • Sprint 3
    • Hide

      Design:
      The design for availability reporting was discussed with the FO. The same is available on the confluence page: 
      https://confluence.skatelescope.org/display/SWSI/Availability+Attribute
      This includes the availability reporting for CSP and SDP subarray devices, by the respective leaf nodes, and aggregation of it by the TMC subarray node.
      The Dish leaf nodes are not considered in the subarray availability reporting, (this may need further consideration)
      Implementation:

      Show
      Design : The design for availability reporting was discussed with the FO. The same is available on the confluence page:  https://confluence.skatelescope.org/display/SWSI/Availability+Attribute This includes the availability reporting for CSP and SDP subarray devices, by the respective leaf nodes, and aggregation of it by the TMC subarray node. The Dish leaf nodes are not considered in the subarray availability reporting, (this may need further consideration) Implementation : TMC subarray leaf nodes detect the unavailability of  CSP and SDP subarray devices. This is reported to the subarray node, which in turn reports an aggregated 'availability' value. TMC master leaf nodes detect the unavailability of  CSP and SDP master devices. This is reported to the central node, which in turn reports it on its 'availability' attribute. TMC blocks the command execution depending on the SDP and CSP system unavailability. The implementation is Verified on individual TMC node repos. The related MRs that are merged as below: CentralNode: https://gitlab.com/ska-telescope/ska-tmc/ska-tmc-centralnode/-/merge_requests/103 SubarrayNode: https://gitlab.com/ska-telescope/ska-tmc/ska-tmc-subarraynode/-/merge_requests/102               https://gitlab.com/ska-telescope/ska-tmc/ska-tmc-subarraynode/-/merge_requests/104 Tmc leaf nodes: https://gitlab.com/ska-telescope/ska-tmc/ska-tmc-cspleafnodes/-/merge_requests/59 https://gitlab.com/ska-telescope/ska-tmc/ska-tmc-sdpleafnodes/-/merge_requests/338 https://gitlab.com/ska-telescope/ska-tmc/ska-tmc-dishleafnode/-/merge_requests/39 System Demo provided for the TMC reporting unavailability of csp and sdp sub-systems.   Integration Updates/Issues (tmc integration repo): It is observed that, on startup. the Central node misses the subarray node's availability events, and hence does not report the subarray availability correctly.  During subsequent operation, if the csp/sdp devices become unavailable, it is reflected correctly on the subarray node's aggregated 'availability'. In this case, the Central node also receives the events and reports the subarray 'availability' correctly. Due to the issue mentioned above, the integration of the implemented functionality is not complete .  This integration activity needs to be continued in the next PI, hence feature may be carried forward, cloned or new feature can be created to take the work to closure.
    • 19.6
    • Stories Completed, Integrated, BDD Testing Passes (no errors), Outcomes Reviewed, Demonstrated, Satisfies Acceptance Criteria
    • PI24 - UNCOVERED

    • Team_SAHYADRI
    • OMC-G1 SOL-G4

    Description

      The TMC should be able to handle the absence of key subsystems (CSP, SDP, MCCS, Dish,...) in a graceful manner.

      1. If a key subsystem is not present then the TMC should report this to the user and be able to show the absence cleanly in a dashboard.
      2. When a subsystem is not present then the TMC should handle the failure to obtain key attributes gracefully, e.g. attributes that feed into an aggregated status.
      3. If the TMC is asked (by the OET or by a user) to issue commands to an absent subsystem it should handle this cleanly and report the problem to the user (But note the "clarification" in the acceptance criteria - this may need further discussion).
      4. The TMC should be able to detect that a key subsystem has disappeared if this should happen while the whole system is running, and then it should act as above (stretch)
      5. The TMC should be able to detect the appearance of a key subsystem and absorb it cleanly into the system. (stretch)

      Note: testing these scenarios can be done in the integration repo only.

      Attachments

        Issue Links

          Structure

            Activity

              People

                g.leroux Le Roux, Gerhard [X] (Inactive)
                a.bridger Bridger, Alan
                Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 2.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete837.0
                  Total837.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel