Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Feature
Priority: Not Assigned
Fix Version/s: PI25
Component/s: SRCnet Federated Execution
Labels:
None

ARTs:

SRCnet
Benefit hypothesis:

Hide

In our early prototype (~~SP-4326~~) we have already demonstrated scraping metrics for SRCs to build dashboards.

While the current solution is the easiest implementation possible, when scaling to an SRCNet-wide solution, multiple aspects need to be considered, such as agreeing on an interface for sharing metrics, security, maintainability, scalability, etc.

The current approach has the disadvantage that it would require exposing many http services on the internet so that they can be scraped. It would be desirable to find a better interface to be shared at the SRCNet level which can be used for sharing metrics/alerts.

Show
In our early prototype ( SP-4326 ) we have already demonstrated scraping metrics for SRCs to build dashboards. While the current solution is the easiest implementation possible, when scaling to an SRCNet-wide solution, multiple aspects need to be considered, such as agreeing on an interface for sharing metrics, security, maintainability, scalability, etc. The current approach has the disadvantage that it would require exposing many http services on the internet so that they can be scraped. It would be desirable to find a better interface to be shared at the SRCNet level which can be used for sharing metrics/alerts.
Epic Link:
Distributed Data Computing v0.2 - Roadmap
Story Point Burn-up:
Overdue:

Requirement Status:

PI24 - UNCOVERED

Description

In our early prototype (~~SP-4326~~) we have already demonstrated scraping metrics for SRCs to build dashboards.

While the current solution is the easiest implementation possible, when scaling to an SRCNet-wide solution, multiple aspects need to be considered, such as agreeing on an interface for sharing metrics, security, maintainability, scalability, etc.

The current approach has the disadvantage that it would require exposing many http services on the internet so that they can be scraped. It would be desirable to find a better interface to be shared at the SRCNet level which can be used for sharing metrics/alerts.

A low hanging fruit here might be using Prometheus's federation capabilities:
https://prometheus.io/docs/prometheus/latest/federation/

This would allow us to implement a hierarchical structure, in which for instance each SRC runs a prometheus instance that only needs to expose the /federate endpoint, and one (or multiple) global dashboards are able to aggregate a subset of those metrics (only the ones that are relevant from a global SRCNet viewpoint). There are potential advantages in this approach (can't confirm all of these, since I'm not a Prometheus expert, but these sound reasonable enough to share):

Easier for security reasons: sites only need to expose a single /federate endpoints, as opposed to a "wider" opening.
Scalability: We only need to scrape a subset of relevant metrics from the federate endpoint, but if even those become too much, we can just add layers to the hierarchy (I suspect adding layers will also increase the latency, but that might be acceptable at the global scale, and especially if we keep alerting at the local sites). I suspect this can be much harder to tackle with a "single" instance.
Availability and resiliency: It should be possible to have a second prometheus providing a similar global SRCNet view by scraping the same federated endpoints. This way we could choose to have a second instance in case our instance or metrics store goes down.
In terms of collaboration: Individual sites can add/expose their own monitoring metrics and we scrape (using match expressions) and aggregate those metrics.

Of course this warrants further discussion, and the implications of this might be that we'd have to agree on SRCs to implement the /federate endpoint. This wasn't explicitly part of v0.1, but then again AFAIK we never had a proposal for how exactly the SRCNet-wide monitoring would work.

Another option is going for a more purely push-based approach, such as a message queue or similar (which might also entail more infrastructure and work).

Attachments

Issue Links

Child Of

SP-4781 Distributed Data Computing v0.2 - Roadmap

Funnel

Structure

Activity

People

Assignee:: Mort, Ben

Reporter:: Llopis, Pablo

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Feature Progress

Story Point Burn-up: (0%)

Feature Estimate: 0.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	0	0.0
Total	0	0.0

Dates

Created:: 26/Aug/24 11:07 AM

Updated:: 6 days ago 2:37 AM

Due Sprint Date:: 11/Mar/25

Explore SRCNet-wide monitoring infra model