Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Enabler
Priority: High
Fix Version/s: PI25
Component/s: COM SDP SW
Labels:
- AA2

Technical Debt:

True
ARTs:

Data Processing
Benefit hypothesis:

Hide

We want our ICAL pipelines to be able to keep up with the Low and Mid telescopes at AA2 scale, i.e., these pipelines should take at most twice the duration of the processed observation.

Show
We want our ICAL pipelines to be able to keep up with the Low and Mid telescopes at AA2 scale, i.e., these pipelines should take at most twice the duration of the processed observation.
Acceptance criteria:
Hide

Systematic profile analysis, identifying phases and what currently bottlenecks their CPU utilisation

Resolve the most significant bottleneck

Provide benchmarks for Low and Mid self-calibration pipelines including extrapolation to AA2 scale
Show
Systematic profile analysis, identifying phases and what currently bottlenecks their CPU utilisation Resolve the most significant bottleneck Provide benchmarks for Low and Mid self-calibration pipelines including extrapolation to AA2 scale
Feature Points:
8
Initial Size:
8
WSJF:
0
Story Point Burn-up:
Overdue:

Requirement Status:

PI24 - UNCOVERED
Labels_MIRO:
AA2

Description

Profiles of our runs show that we have largely successfully managed to distribute the expensive operations of ICAL. However, it looks like (now/still) a lot of time is spent in phases where not much parallel work is happening at all - in some cases it actually looks like we spend hours using just a few (or even just one) core on the master node. The net result is that we are likely only using <5% of the compute available to us.

What?

Identify all major phases where CPU utilisation drops below ~50% for the 3-node run. This might require improving instrumentation / loging. Ideally we would use more representative dataset sizes where possible, and check why we are sometimes seeing different results despite using the same parameters.
Determine (informally) why these phases currently take as long as they do, and why they don't use more nodes (or threads)
Resolve the most significant bottleneck by working on and possibly re-distributing processing functions - by doing (at least) one of the following:
Should have a serious look at sky model filtering, and whether it can be sped up or distributed effectively (or integrate existing solution?)
Attempt to distribute deconvolution processing functions (thinking about how to make this work with RADLER would be very valuable long-term)
Reduce memory usage of calibration to prevent swapping (average visibilities / normal eqs?). Ideally have a mechanism to balance gridding efficiency (favours large time and frequency intervals) and memory usage (favours short time and frequency intervals).

See frame on DP ART board: https://miro.com/app/board/uXjVK6Lrdw4=/?moveToWidget=3458764597687428890&cot=14

Attachments

Structure

Activity

People

Assignee:: Wortmann, Peter

Reporter:: Graser, Ferdl

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Feature Progress

Story Point Burn-up: (0%)

Feature Estimate: 8.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	0	0.0
Total	0	0.0

Dates

Created:: 27/Aug/24 2:57 PM

Updated:: 06/Sep/24 1:24 PM

Due Sprint Date:: 11/Mar/25

Further increase parallelism in ICAL pipeline