Details
-
Feature
-
Not Assigned
-
None
Description
In our early prototype (SP-4326) we have already demonstrated scraping metrics for SRCs to build dashboards.
While the current solution is the easiest implementation possible, when scaling to an SRCNet-wide solution, multiple aspects need to be considered, such as agreeing on an interface for sharing metrics, security, maintainability, scalability, etc.
The current approach has the disadvantage that it would require exposing many http services on the internet so that they can be scraped. It would be desirable to find a better interface to be shared at the SRCNet level which can be used for sharing metrics/alerts.
A low hanging fruit here might be using Prometheus's federation capabilities:
https://prometheus.io/docs/prometheus/latest/federation/
This would allow us to implement a hierarchical structure, in which for instance each SRC runs a prometheus instance that only needs to expose the /federate endpoint, and one (or multiple) global dashboards are able to aggregate a subset of those metrics (only the ones that are relevant from a global SRCNet viewpoint). There are potential advantages in this approach (can't confirm all of these, since I'm not a Prometheus expert, but these sound reasonable enough to share):
- Easier for security reasons: sites only need to expose a single /federate endpoints, as opposed to a "wider" opening.
- Scalability: We only need to scrape a subset of relevant metrics from the federate endpoint, but if even those become too much, we can just add layers to the hierarchy (I suspect adding layers will also increase the latency, but that might be acceptable at the global scale, and especially if we keep alerting at the local sites). I suspect this can be much harder to tackle with a "single" instance.
- Availability and resiliency: It should be possible to have a second prometheus providing a similar global SRCNet view by scraping the same federated endpoints. This way we could choose to have a second instance in case our instance or metrics store goes down.
- In terms of collaboration: Individual sites can add/expose their own monitoring metrics and we scrape (using match expressions) and aggregate those metrics.
Of course this warrants further discussion, and the implications of this might be that we'd have to agree on SRCs to implement the /federate endpoint. This wasn't explicitly part of v0.1, but then again AFAIK we never had a proposal for how exactly the SRCNet-wide monitoring would work.
Another option is going for a more purely push-based approach, such as a message queue or similar (which might also entail more infrastructure and work).
Attachments
Issue Links
- Child Of
-
SP-4781 Distributed Data Computing v0.2 - Roadmap
- Funnel