Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-2085

DALiuGE implementation of distributed, I/O-intensive pipeline

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Enabler
    • Should have
    • PI13
    • COM SDP SW
    • None
    • Data Processing
    • Hide

      One of the major selling points of the SKA SDP architecture is that it would allow us to scale with the telescope by running state-of-the-art algorithms in scalable execution frameworks. This is meant to start gathering experience with some of the problems SDP will be facing in this space.

      Show
      One of the major selling points of the SKA SDP architecture is that it would allow us to scale with the telescope by running state-of-the-art algorithms in scalable execution frameworks. This is meant to start gathering experience with some of the problems SDP will be facing in this space.
    • Hide
      • DALiuGE pipeline developed that distributes both image and frequency space while demonstrating the indicated scaling behaviour.
        • This can be either forward (grid-to-image) or backward (image-to-grid)
        • For "frequency space" the standard is that we have visibilities at some arbitrary UVW positions (choosing on-grid positions is easiest, as it does not require [de]gridding).
      • Initial investigation into scaling behaviour:
        • image size: Can we show that image data is fairly distributed across nodes with no (or little) redundancy?
        • grid/visibility space: Can we demonstrate that we can coordinate a "scan" through grid space across nodes, touching every region only once on one node?
      • Stretch: Look into (de)gridding and the required visibility binning steps.
      Show
      DALiuGE pipeline developed that distributes both image and frequency space while demonstrating the indicated scaling behaviour. This can be either forward (grid-to-image) or backward (image-to-grid) For "frequency space" the standard is that we have visibilities at some arbitrary UVW positions (choosing on-grid positions is easiest, as it does not require [de] gridding). Initial investigation into scaling behaviour: image size: Can we show that image data is fairly distributed across nodes with no (or little) redundancy? grid/visibility space: Can we demonstrate that we can coordinate a "scan" through grid space across nodes, touching every region only once on one node? Stretch: Look into (de)gridding and the required visibility binning steps.
    • 5
    • 5
    • 5
    • 1
    • Team_YANDA
    • Sprint 5
    • Show
      Scaling Behavior of IO-intense Workflows https://confluence.skatelescope.org/pages/viewpage.action?pageId=165623315 DALiuGE Scaling Behavior and Workflow Developer Support https://confluence.skatelescope.org/pages/viewpage.action?pageId=168664182
    • 14.1
    • Stories Completed, Integrated, Solution Intent Updated, Outcomes Reviewed, NFRS met, Demonstrated, Satisfies Acceptance Criteria, Accepted by FO

    Description

      The overall goal here should be to make a first attempt at assessing how execution frameworks can (or cannot) help us addressing SDP scaling challenges. Following the consortia work [1] on this topic we are still working off the assumption that our toughest problem is managing

      • storage I/O
      • internal I/O
      • local memory residency
      • workload balance

      In a not-yet-entirely-determined priority order. Imaging is best understood in this context, so we should focus on that at first. The overall challenge would be to implement a pipeline with the following properties:

      1. Computational scaling of ~O(n_vis + log(n_image) n_image^3 + n_source), where n_image is the total image resolution (scaling with maximum baseline length).
      2. Work with image sizes that are bigger than fit into individual node memory
      3. Work with (much) more visibilities than fit into collective node memory, i.e. use background storage for loading them (>= 10 TB class)
      4. Load every visibility from storage only once per major loop iteration

      This is a tough combination of challenges. There are basically two known approaches that fit the bill:

      • Facet imaging [2] - obviously fulfills (2), can fulfill (1) by reducing facet visibilities using BLDA, can do (3)+(4) by designating nodes to load visibility chunks, phase rotate and distribute to (de)gridding nodes
      • Distributed FFT - either using the standard approach or [3] - fulfills (1) natively, mostly fulfills (2) with somewhat increased overhead, can do (3)+(4) with better quality than BDA

      What both approaches have in common is that they put quite a bit of strain on the execution framework: In either case we need to load visibilities on a node, then proceed to relate it to the image data - which means scheduling many related tasks across multiple nodes.

      [1] http://ska-sdp.org/sites/default/files/attachments/pipeline-working-sets.pdf

      [2] https://www.aanda.org/articles/aa/abs/2018/03/aa31474-17/aa31474-17.html

      [3] https://gitlab.com/ska-telescope/sdp/ska-sdp-exec-iotest, https://gitlab.com/scpmw/crocodile/-/blob/io_benchmark/examples/notebooks/facet-subgrid-impl-new.ipynb and https://arxiv.org/abs/2108.10720 [under review]

      Attachments

        Issue Links

          Structure

            Activity

              People

                p.wortmann Wortmann, Peter
                f.graser Graser, Ferdl
                Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 5.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete712.0
                  Total712.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel