Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-1089

Explore performance issues of Grafana & Prometheus

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Spike
    • Not Assigned
    • None
    • None
    • Obs Mgt & Controls
    • Hide

      We learn if Grafana + prometheus used for monitoring has a throughput that is acceptable (greater than 1000 updates/s) as a solution for building UIs for real time monitoring.

      A second aspect is to measure what is the round trip latency of commands sent by a dashboards and received and acted upon by a Tango device. This should be of the order of 100ms.

      If so, then the Grafana + Prometheus solution could constitute a valid alternative to Webjive. It has the added advantage of being a mature product, with many feature that we need to implement in WebJive.

      If it isn't, it is still likely to be very attractive as a framework for a Web-based Engineering Archive interface (another thing on the Tango to-do list), but if the two functions can be combined in the one architectural, there will clearly be additional synergies. Hence, this spike will also contribute to understanding of the Grafana+Prometheus architecture for these other areas.

      Show
      We learn if Grafana + prometheus used for monitoring has a throughput that is acceptable (greater than 1000 updates/s) as a solution for building UIs for real time monitoring. A second aspect is to measure what is the round trip latency of commands sent by a dashboards and received and acted upon by a Tango device. This should be of the order of 100ms. If so, then the Grafana + Prometheus solution could constitute a valid alternative to Webjive. It has the added advantage of being a mature product, with many feature that we need to implement in WebJive. If it isn't, it is still likely to be very attractive as a framework for a Web-based Engineering Archive interface (another thing on the Tango to-do list), but if the two functions can be combined in the one architectural, there will clearly be additional synergies. Hence, this spike will also contribute to understanding of the Grafana+Prometheus architecture for these other areas.
    • Hide

      A performance assessment report similar to the one that was done for Webjive.

      Show
      A performance assessment report similar to the one that was done for Webjive.
    • 2
    • 7
    • 11.5

    Description

       After d-carlo.matteo's lightning talk about using Grafana as a possible alternative to WebJive, we need to assess the extent to which the architecture underlying grafana poses performance risks to us. 
      This works entails doing a performance test of the architecture described by Matteo di Carlo in https://docs.google.com/presentation/d/1H1fd5Arkl7b93nbgebzqxcNy0WyRDMTMcO-7Vy3ET6A/edit#slide=id.p1 against the performance requirements that were stated for Webjive, namely the 1000 changes in a dashboard in less than 1s.

      Underlying this throughput specification is also a latency requirement that was assumed in the event driven architecture of WebJive, but may not be possible in a polled monitoring system like Grafana. We need to be able to make control changes and see immediate monitoring feedback responses so there is no discernible lag. Hence we need to assess how long it takes for a round trip action of a command to get from a dashboard to a Tango device and an associated monitoring response to return. This should be in the 100ms range for the lag to be below the detection threshold.

      One way to do this could be:

      • setup a Tango device that is capable of producing those many changes in several attributes per second
      • Have these changes triggered by a command or attribute.
      • create a grafana dashboard that has widgets for displaying values of attributes and the basic control mechanism.
      • run a test for each combination of settings

      See what was done for https://jira.skatelescope.org/browse/SP-296

       

      Attachments

        Structure

          Activity

            People

              g.brajnik Brajnik, Giorgio
              g.brajnik Brajnik, Giorgio
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Feature Progress

                Story Point Burn-up: (0%)

                Feature Estimate: 2.0

                IssuesStory Points
                To Do00.0
                In Progress   00.0
                Complete00.0
                Total00.0

                Dates

                  Created:
                  Updated:
                  Resolved:

                  Structure Helper Panel