Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Enabler
Priority: Should have
Fix Version/s: PI13
Component/s: COM SDP SW
Labels:
None

ARTs:

Data Processing
Benefit hypothesis:

Hide

One of the major selling points of the SKA SDP architecture is that it would allow us to scale with the telescope by running state-of-the-art algorithms in scalable execution frameworks. This is meant to start gathering experience with some of the problems SDP will be facing in this space.

Show
One of the major selling points of the SKA SDP architecture is that it would allow us to scale with the telescope by running state-of-the-art algorithms in scalable execution frameworks. This is meant to start gathering experience with some of the problems SDP will be facing in this space.
Acceptance criteria:
Hide

DALiuGE pipeline developed that distributes both image and frequency space while demonstrating the indicated scaling behaviour.

This can be either forward (grid-to-image) or backward (image-to-grid)

For "frequency space" the standard is that we have visibilities at some arbitrary UVW positions (choosing on-grid positions is easiest, as it does not require [de]gridding).

Initial investigation into scaling behaviour:

image size: Can we show that image data is fairly distributed across nodes with no (or little) redundancy?

grid/visibility space: Can we demonstrate that we can coordinate a "scan" through grid space across nodes, touching every region only once on one node?

Stretch: Look into (de)gridding and the required visibility binning steps.
Show
DALiuGE pipeline developed that distributes both image and frequency space while demonstrating the indicated scaling behaviour. This can be either forward (grid-to-image) or backward (image-to-grid) For "frequency space" the standard is that we have visibilities at some arbitrary UVW positions (choosing on-grid positions is easiest, as it does not require [de] gridding). Initial investigation into scaling behaviour: image size: Can we show that image data is fairly distributed across nodes with no (or little) redundancy? grid/visibility space: Can we demonstrate that we can coordinate a "scan" through grid space across nodes, touching every region only once on one node? Stretch: Look into (de)gridding and the required visibility binning steps.
Feature Points:
5
Initial Size:
5
Cost of Delay:
5
WSJF:
1
Epic Link:
SDP Initial workflows
Agile Teams:

Team_YANDA
Due Sprint:
Sprint 5
Story Point Burn-up:
Overdue:
Outcomes:

Hide

Scaling Behavior of IO-intense Workflows
https://confluence.skatelescope.org/pages/viewpage.action?pageId=165623315

DALiuGE Scaling Behavior and Workflow Developer Support
https://confluence.skatelescope.org/pages/viewpage.action?pageId=168664182

Show
Scaling Behavior of IO-intense Workflows https://confluence.skatelescope.org/pages/viewpage.action?pageId=165623315 DALiuGE Scaling Behavior and Workflow Developer Support https://confluence.skatelescope.org/pages/viewpage.action?pageId=168664182
Resolved PI.Sprint:
14.1

Feature Checklist:

Stories Completed, Integrated, Solution Intent Updated, Outcomes Reviewed, NFRS met, Demonstrated, Satisfies Acceptance Criteria, Accepted by FO

Demos:
- DP_ART_13.6
- PI13_Demo
Requirement Status:

PI24 - UNCOVERED
Goals_MIRO:
SPO-1578

Description

The overall goal here should be to make a first attempt at assessing how execution frameworks can (or cannot) help us addressing SDP scaling challenges. Following the consortia work [1] on this topic we are still working off the assumption that our toughest problem is managing

storage I/O
internal I/O
local memory residency
workload balance

In a not-yet-entirely-determined priority order. Imaging is best understood in this context, so we should focus on that at first. The overall challenge would be to implement a pipeline with the following properties:

Computational scaling of ~O(n_vis + log(n_image) n_image^3 + n_source), where n_image is the total image resolution (scaling with maximum baseline length).
Work with image sizes that are bigger than fit into individual node memory
Work with (much) more visibilities than fit into collective node memory, i.e. use background storage for loading them (>= 10 TB class)
Load every visibility from storage only once per major loop iteration

This is a tough combination of challenges. There are basically two known approaches that fit the bill:

Facet imaging [2] - obviously fulfills (2), can fulfill (1) by reducing facet visibilities using BLDA, can do (3)+(4) by designating nodes to load visibility chunks, phase rotate and distribute to (de)gridding nodes
Distributed FFT - either using the standard approach or [3] - fulfills (1) natively, mostly fulfills (2) with somewhat increased overhead, can do (3)+(4) with better quality than BDA

What both approaches have in common is that they put quite a bit of strain on the execution framework: In either case we need to load visibilities on a node, then proceed to relate it to the image data - which means scheduling many related tasks across multiple nodes.

[1] http://ska-sdp.org/sites/default/files/attachments/pipeline-working-sets.pdf

[2] https://www.aanda.org/articles/aa/abs/2018/03/aa31474-17/aa31474-17.html

[3] https://gitlab.com/ska-telescope/sdp/ska-sdp-exec-iotest, https://gitlab.com/scpmw/crocodile/-/blob/io_benchmark/examples/notebooks/facet-subgrid-impl-new.ipynb and https://arxiv.org/abs/2108.10720 [under review]

Attachments

Issue Links

relates to

SP-2086 Dask implementation of distributed, I/O-intensive pipeline

Done

SP-2158 Collected Execution Framework development (DALiuGE & EAGLE)

Done

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...

(1 mentioned in)

Structure

Activity

People

Assignee:: Wortmann, Peter

Reporter:: Graser, Ferdl

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Feature Progress

Story Point Burn-up: (100.00%)

Feature Estimate: 5.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	7	12.0
Total	7	12.0

Dates

Created:: 04/Nov/21 3:39 AM

Updated:: 15/Apr/24 9:46 AM

Resolved:: 18/Mar/22 6:16 PM

DALiuGE implementation of distributed, I/O-intensive pipeline