Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-3749

SRCNet Operations Health Monitor UI for all services

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Feature
    • Should have
    • PI20
    • None
    • SRCnet
    • Hide

      In the distributed and federated SRCNet environment, understanding a current problem will be an enormous challenge. The challenge arises because the number and types of failures that will occur will be dependent on the number of parts and their execution location. As well, any update to the SRCNet, which will happen regularly, can create a new type of failure.

      General observability is a measure of how well the system’s internal states can be inferred from knowledge of its external outputs. SRCNet observability will use the data and insights that monitoring produces to provide a holistic understanding of the system, including its health and performance. That understanding will arise partly from identifying which monitoring metrics should be used to interpret SRCNet health, and how to collect, collate, and effectively present the information.

      Operators, engineers, analysts, and other team members will benefit because observability offers a shared view of the environment, providing a more comprehensive understanding of its architecture, health and performance over time. Access to the same insights about services, users and other system elements will help to execute more accurate post-incident reviews, as all parties can examine documented records of real-time system behaviour instead of piecing events together from siloed, individual sources. Data will help teams understand why incidents occurred for better prevention and future incident handling.

      SRCNet Observability will allow developers to understand the SRCNet’s internal state at any given time. It should allow operators to have access to more accurate information about SRCNet faults in the distributed production environments. It should enable developers to more easily fix and eventually prevent problems, and it should foster a greater understanding of SRCNet performance and how it shapes user experience.

      Show
      In the distributed and federated SRCNet environment, understanding a current problem will be an enormous challenge. The challenge arises because the number and types of failures that will occur will be dependent on the number of parts and their execution location. As well, any update to the SRCNet, which will happen regularly, can create a new type of failure. General observability is a measure of how well the system’s internal states can be inferred from knowledge of its external outputs. SRCNet observability will use the data and insights that monitoring produces to provide a holistic understanding of the system, including its health and performance. That understanding will arise partly from identifying which monitoring metrics should be used to interpret SRCNet health, and how to collect, collate, and effectively present the information. Operators, engineers, analysts, and other team members will benefit because observability offers a shared view of the environment, providing a more comprehensive understanding of its architecture, health and performance over time. Access to the same insights about services, users and other system elements will help to execute more accurate post-incident reviews, as all parties can examine documented records of real-time system behaviour instead of piecing events together from siloed, individual sources. Data will help teams understand why incidents occurred for better prevention and future incident handling. SRCNet Observability will allow developers to understand the SRCNet’s internal state at any given time. It should allow operators to have access to more accurate information about SRCNet faults in the distributed production environments. It should enable developers to more easily fix and eventually prevent problems, and it should foster a greater understanding of SRCNet performance and how it shapes user experience.
    • Hide

      AC1: SRCNet Health Monitor UI deployed and operating to monitor SRCNet node/service health.

      AC2: Operations community feedback on the deployment and operations of the SRCNet Health Monitor UI is captured.

      AC3: Gap analysis to determine what could be added to service APIs. 

      Show
      AC1: SRCNet Health Monitor UI deployed and operating to monitor SRCNet node/service health. AC2: Operations community feedback on the deployment and operations of the SRCNet Health Monitor UI is captured. AC3: Gap analysis to determine what could be added to service APIs. 
    • Team_RED
    • Sprint 4
    • Overdue
    • PI23 - UNCOVERED

    • PI20-PB

    Description

      Design and implement a prototype UI to track:

      • health overview of each service deployed globally and at each SRCNet node
      • per-service drill down, e.g. 24 hour trend tracking
      • Add the necessary Availability endpoints to services lacking them (e.g GMS)

      Attachments

        Structure

          Activity

            People

              r.bolton Bolton, Rosie
              s.goliath sharon goliath
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Feature Progress

                Story Point Burn-up: (0%)

                Feature Estimate: 0.0

                IssuesStory Points
                To Do00.0
                In Progress   00.0
                Complete00.0
                Total00.0

                Dates

                  Created:
                  Updated:

                  Structure Helper Panel