Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Feature
Priority: Should have
Fix Version/s: PI20
Component/s: None
Labels:
- PI20-PB

ARTs:

SRCnet
Benefit hypothesis:

Hide

In the distributed and federated SRCNet environment, understanding a current problem will be an enormous challenge. The challenge arises because the number and types of failures that will occur will be dependent on the number of parts and their execution location. As well, any update to the SRCNet, which will happen regularly, can create a new type of failure.

General observability is a measure of how well the system’s internal states can be inferred from knowledge of its external outputs. SRCNet observability will use the data and insights that monitoring produces to provide a holistic understanding of the system, including its health and performance. That understanding will arise partly from identifying which monitoring metrics should be used to interpret SRCNet health, and how to collect, collate, and effectively present the information.

Operators, engineers, analysts, and other team members will benefit because observability offers a shared view of the environment, providing a more comprehensive understanding of its architecture, health and performance over time. Access to the same insights about services, users and other system elements will help to execute more accurate post-incident reviews, as all parties can examine documented records of real-time system behaviour instead of piecing events together from siloed, individual sources. Data will help teams understand why incidents occurred for better prevention and future incident handling.

SRCNet Observability will allow developers to understand the SRCNet’s internal state at any given time. It should allow operators to have access to more accurate information about SRCNet faults in the distributed production environments. It should enable developers to more easily fix and eventually prevent problems, and it should foster a greater understanding of SRCNet performance and how it shapes user experience.

Show
In the distributed and federated SRCNet environment, understanding a current problem will be an enormous challenge. The challenge arises because the number and types of failures that will occur will be dependent on the number of parts and their execution location. As well, any update to the SRCNet, which will happen regularly, can create a new type of failure. General observability is a measure of how well the system’s internal states can be inferred from knowledge of its external outputs. SRCNet observability will use the data and insights that monitoring produces to provide a holistic understanding of the system, including its health and performance. That understanding will arise partly from identifying which monitoring metrics should be used to interpret SRCNet health, and how to collect, collate, and effectively present the information. Operators, engineers, analysts, and other team members will benefit because observability offers a shared view of the environment, providing a more comprehensive understanding of its architecture, health and performance over time. Access to the same insights about services, users and other system elements will help to execute more accurate post-incident reviews, as all parties can examine documented records of real-time system behaviour instead of piecing events together from siloed, individual sources. Data will help teams understand why incidents occurred for better prevention and future incident handling. SRCNet Observability will allow developers to understand the SRCNet’s internal state at any given time. It should allow operators to have access to more accurate information about SRCNet faults in the distributed production environments. It should enable developers to more easily fix and eventually prevent problems, and it should foster a greater understanding of SRCNet performance and how it shapes user experience.
Acceptance criteria:

Hide

AC1: SRCNet Health Monitor UI deployed and operating to monitor SRCNet node/service health.

AC2: Operations community feedback on the deployment and operations of the SRCNet Health Monitor UI is captured.

AC3: Gap analysis to determine what could be added to service APIs.

Show
AC1: SRCNet Health Monitor UI deployed and operating to monitor SRCNet node/service health. AC2: Operations community feedback on the deployment and operations of the SRCNet Health Monitor UI is captured. AC3: Gap analysis to determine what could be added to service APIs.
Epic Link:
Mini SRCNet Demonstrator
Agile Teams:

Team_RED
Due Sprint:
Sprint 4
Story Point Burn-up:
Overdue:
Overdue

Requirement Status:

PI23 - UNCOVERED
Labels_MIRO:
PI20-PB

Description

Design and implement a prototype UI to track:

health overview of each service deployed globally and at each SRCNet node
per-service drill down, e.g. 24 hour trend tracking
Add the necessary Availability endpoints to services lacking them (e.g GMS)

Attachments

Structure

Activity

People

Assignee:: Bolton, Rosie

Reporter:: sharon goliath

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Feature Progress

Story Point Burn-up: (0%)

Feature Estimate: 0.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	0	0.0
Total	0	0.0

Dates

Created:: 23/Aug/23 9:55 AM

Updated:: 26/Feb/24 2:00 PM

SRCNet Operations Health Monitor UI for all services

Details

Description

Attachments

Structure

Activity

People

Feature Progress

Dates

Structure Helper Panel