Loading...

ARTs:

Data Processing

Benefit hypothesis:

Hide

This feature is intended to integrate and bring to fruition the data management products that have been prototyped so far: https://gitlab.com/ska-telescope/ska-data-lifecycle , https://gitlab.com/ska-telescope/sdp/ska-sdp-dataproduct-dashboard . The objectives we had in PI21 have not been reached and the intention is to resolve the blockers that got in the way of making that happen.
The immediate benefit hypothesis is to enable discoverability and access to data for the Low AA0.5 telescope (July 2024). Initially this applies to data products generated by the SDP (ingested visibilities, calibration solutions) as well as to station-level data products generated by the MCCS (station visibilities, unchannelised data, channelised data).
This provides fundamental service to enable the ability to ingest, track, trace, find and access data products of any kind within the SKAO. This will be using a generic data management schema initially, but allow for product specific schemas as well.

Show

This feature is intended to integrate and bring to fruition the data management products that have been prototyped so far: https://gitlab.com/ska-telescope/ska-data-lifecycle , https://gitlab.com/ska-telescope/sdp/ska-sdp-dataproduct-dashboard . The objectives we had in PI21 have not been reached and the intention is to resolve the blockers that got in the way of making that happen. The immediate benefit hypothesis is to enable discoverability and access to data for the Low AA0.5 telescope (July 2024). Initially this applies to data products generated by the SDP (ingested visibilities, calibration solutions) as well as to station-level data products generated by the MCCS (station visibilities, unchannelised data, channelised data). This provides fundamental service to enable the ability to ingest, track, trace, find and access data products of any kind within the SKAO. This will be using a generic data management schema initially, but allow for product specific schemas as well.

Acceptance criteria:

Hide

For SDP final and intermediate data products, generated by the SDP processing pipelines, descriptions of data products are available on confluence (confluence page per product following standard template under https://confluence.skatelescope.org/display/SWSI/Data+Products)
Details about how to deploy the ska-data-lifecycle solution are sorted with respect to deployment locations, database access and configuration. Deployment of the ska-data-lifecycle and all auxiliary services is described on confluence in the Solution Intent space.
Introduce service that can be used to track for intermediate and final data products by ID (as of ~~ADR-54~~). The DLM API allows to:
- Add a new data products
- Register new storage locations / removal of storage locations for these data products (i.e. also notify of moves/copies)
- Update data products on their status (finished)
- Obtain Location information about specific data product (should allow a number of storage backends, including informal ones like something that is in storage on an HPC system like CSD3, STFC cloud or Pawsey)
- Lifecycle information for specific data product (at minimum lifetime)
- (stretch) Update user annotations

Integrate into SDP
- the data mangement services are available in the DP Integration platform
- have data product dashboard show information from it
- consider authentication (this would need to go via execution block ID metadata)

Enable data migration from MCCS server

Based on https://confluence.skatelescope.org/display/SE/PI23+Planning+-+DP+-+DLM+design+session

YANDA

Review with SDP the necessary APIs to Initialise and Register data products.
Configure the DLM for the necessary storage locations: Ceph storage on data processing cluster and AA0.5 plus the MCCS
Develop and integrate the workflow to integrate with the DPD (as a starting point when Data Product is "finished")
Update the RCV pipeline to integrate with the DLM service
Demonstrate the DLM functionalities in DP integration environment
Create documentation about how to publish new data products to the DLM
Define a list of groups that are necessary to be implemented in the AAA system for integration with DLM

NALEDI

Implement and expose Update Metadata API endpoints as needed.
Integrate the elastic persistence layer

BANG

Support YANDA in configuring the storage locations
Provide a PGSQL instances to be used in the DP integration cluster as per request from teams.
Support DLM integration on the LOW AA0.5 PGSQL instance
Provide DLM and DPD with DNS name entries in the different environments
Provide a dedicated Elastic index for DPD integration

SST

Initiate the conversation about the need and schedule to expand our storage capacity

Show

For SDP final and intermediate data products, generated by the SDP processing pipelines, descriptions of data products are available on confluence (confluence page per product following standard template under https://confluence.skatelescope.org/display/SWSI/Data+Products ) Details about how to deploy the ska-data-lifecycle solution are sorted with respect to deployment locations, database access and configuration. Deployment of the ska-data-lifecycle and all auxiliary services is described on confluence in the Solution Intent space. Introduce service that can be used to track for intermediate and final data products by ID (as of ADR-54 ). The DLM API allows to: Add a new data products Register new storage locations / removal of storage locations for these data products (i.e. also notify of moves/copies) Update data products on their status (finished) Obtain Location information about specific data product (should allow a number of storage backends, including informal ones like something that is in storage on an HPC system like CSD3, STFC cloud or Pawsey) Lifecycle information for specific data product (at minimum lifetime) (stretch) Update user annotations Integrate into SDP the data mangement services are available in the DP Integration platform have data product dashboard show information from it consider authentication (this would need to go via execution block ID metadata) Enable data migration from MCCS server Based on https://confluence.skatelescope.org/display/SE/PI23+Planning+-+DP+-+DLM+design+session YANDA Review with SDP the necessary APIs to Initialise and Register data products. Configure the DLM for the necessary storage locations: Ceph storage on data processing cluster and AA0.5 plus the MCCS Develop and integrate the workflow to integrate with the DPD (as a starting point when Data Product is "finished") Update the RCV pipeline to integrate with the DLM service Demonstrate the DLM functionalities in DP integration environment Create documentation about how to publish new data products to the DLM Define a list of groups that are necessary to be implemented in the AAA system for integration with DLM NALEDI Implement and expose Update Metadata API endpoints as needed. Integrate the elastic persistence layer BANG Support YANDA in configuring the storage locations Provide a PGSQL instances to be used in the DP integration cluster as per request from teams. Support DLM integration on the LOW AA0.5 PGSQL instance Provide DLM and DPD with DNS name entries in the different environments Provide a dedicated Elastic index for DPD integration SST Initiate the conversation about the need and schedule to expand our storage capacity

Feature Points:

5

Initial Size:

5

WSJF:

0

Agile Teams:

Team_NALEDI, Team_YANDA

Due Sprint:

Sprint 5

Story Point Burn-up:

Overdue:

Outcomes:

Hide

Final, end of PI23, status of this feature:

Very significant progress has been achieved, but not all of the acceptance criteria have been met, mostly due to integration issues and uncertainties around the deployment infrastructure. As a summary one can say that the individual teams have done their part of the work as far as possible, but the integration onto the actual LOW platform and also the integration between various subsystems (DLM, DPD, SDP rcv, MCCS) has not been done and is showing a deficiency of system integration activities or at least support for them. Since we can't have features crossing PI boundaries and since we have the follow-up features already in PI24 we are releasing this one now.

There are multiple sets of acceptance criteria for this feature, one overall, one for each of the teams involved YANDA, NALEDI, BANG and SST. The following tables are addressing the status of each of them:

Overall Feature Acceptance Criteria

Acceptance criteria	Status
For SDP final and intermediate data products, generated by the SDP processing pipelines, descriptions of data products are available on confluence (confluence page per product following standard template under https://confluence.skatelescope.org/display/SWSI/Data+Products)	The definition of data products is responsibility of the subsystem generating those data products, not the DLM or any of the supporting technical teams. Currently the page only lists a single product.
Details about how to deploy the ska-data-lifecycle solution are sorted with respect to deployment locations, database access and configuration. Deployment of the ska-data-lifecycle and all auxiliary services is described on confluence in the Solution Intent space.	Not finalized, but we made a very good start for SKA-LOW identifying the actual target locations. Far too many unknowns and planned changes to the SKA-LOW platform to finalize this. Cluster will be expanded and amended with AAVS machines and even just the simple question on where the DLM and the Dashboard would run on the cluster could not be resolved. This is the main goal for the new features in PI24.
Introduce service that can be used to track for intermediate and final data products by ID (as of ~~ADR-54~~).	We have not tackled this at all, since there are far more fundamental issues to be solved before we can even think about that level of tracking detail.
Integrate into SDP: the data mangement services are available in the DP Integration platform have data product dashboard show information from it consider authentication (this would need to go via execution block ID metadata)	The DLM is deployed on the DP cluster in the YANDA namespace. The integration with the Data Product Dashbboard failed due to integration issues, but we have now marked that as technical dept and will resolve it asap.
Enable data migration from MCCS server	This is fully related to the

YANDA Acceptance Criteria

Acceptance criteria	Status
Review with SDP the necessary APIs to Initialise and Register data products.	done, but no final agreement reached: https://confluence.skatelescope.org/display/SWSI/ADR-101+SDP+buffer+management+and+interface+to+Data+Lifecycle+Management
Configure the DLM for the necessary storage locations: Ceph storage on data processing cluster and AA0.5 plus the MCCS	pending, targets unclear/undefined
Develop and integrate the workflow to integrate with the DPD (as a starting point when Data Product is "finished")	done: DLM is sending metadata to DPD
Update the RCV pipeline to integrate with the DLM service	done: However, there is no automatic integration test for this system level integration
Demonstrate the DLM functionalities in DP integration environment	partially done: DLM is on the DP cluster, but we have not done an official demo. We have now a ticket to write a guideline on how to use the DLM inside the DP cluster.
Create documentation about how to publish new data products to the DLM	done: The documentation is auto-generated. There is likely still room for improvement since the current documentation is still fairly low level and there are mutliple ways to 'publish' data products to the DLM.
Define a list of groups that are necessary to be implemented in the AAA system for integration with DLM	done: pointed again to the extended group system used by ALMA, which is almost certainly also applicable to SKA. The actual implementation is not a DLM issue, but a observatory wide activity.

NALEDI Acceptance Criteria

Acceptance criteria	Status
Implement and expose Update Metadata API endpoints as needed.	done
Integrate the elastic persistence layer	??

BANG Acceptance Criteria

Acceptance criteria	Status
Support YANDA in configuring the storage locations	pending
Provide a PGSQL instances to be used in the DP integration cluster as per request from teams.	done
Support DLM integration on the LOW AA0.5 PGSQL instance	pending
Provide DLM and DPD with DNS name entries in the different environments	partially done
Provide a dedicated Elastic index for DPD integration	??

SST Acceptance Criteria

Acceptance criteria	Status
Initiate the conversation about the need and schedule to expand our storage capacity	done, but the actual implementation is still pending, which blocks the registration on the DLM target endpoints.

NALEDI

In this PI the way that metadata is handles in the DPD have been changes. It now makes use of a persistent PostgreSQL database for storage of metadata (if available, the in-memory functionality has been updated and maintained to assist with the users during the changeover to PostgreSQL). The search implementation has also been updated to make use of the Elasticsearch instance made available by the !Bang team. (Again, the in-memory search has been updated and maintained to assist with the users during the changeover to Elasticsearch)

To enable these changes, restructuring of the API and Dashboard were needed. These changes included updates to the data structure used to serve data to the MUI DataGrid component on the dashboard, rework of the API project structure to improve the logical separation between different modes of operation (In-memory or making use of databases when available), updating all the supporting methods to align with the requirements of saving data in PostgreSQL and Elasticsearch, improved handling of env variables and various other improvements.

For a full change log of all the changes done as part of this feature, please see the respective change logs of each application:

Updates to the SDP Data Product Dashboard API:

Changelog — ska-sdp-dataproduct-api 0.8.0 documentation (skao.int)

Updates to the SDP Data Product Dashboard API:{}

Changelog — ska-sdp-dataproduct-dashboard 0.8.2 documentation (skao.int)

Releases:

A temporary deployment can be accessed here: https://sdhp.stfc.skao.int/dp-naledi-andre/dashboard/

https://sdhp.stfc.skao.int/dp-naledi-andre/api/status

Show

Final, end of PI23, status of this feature: Very significant progress has been achieved, but not all of the acceptance criteria have been met, mostly due to integration issues and uncertainties around the deployment infrastructure. As a summary one can say that the individual teams have done their part of the work as far as possible, but the integration onto the actual LOW platform and also the integration between various subsystems (DLM, DPD, SDP rcv, MCCS) has not been done and is showing a deficiency of system integration activities or at least support for them. Since we can't have features crossing PI boundaries and since we have the follow-up features already in PI24 we are releasing this one now. There are multiple sets of acceptance criteria for this feature, one overall, one for each of the teams involved YANDA, NALEDI, BANG and SST. The following tables are addressing the status of each of them: Overall Feature Acceptance Criteria Acceptance criteria Status For SDP final and intermediate data products, generated by the SDP processing pipelines, descriptions of data products are available on confluence (confluence page per product following standard template under https://confluence.skatelescope.org/display/SWSI/Data+Products ) The definition of data products is responsibility of the subsystem generating those data products, not the DLM or any of the supporting technical teams. Currently the page only lists a single product. Details about how to deploy the ska-data-lifecycle solution are sorted with respect to deployment locations, database access and configuration. Deployment of the ska-data-lifecycle and all auxiliary services is described on confluence in the Solution Intent space. Not finalized, but we made a very good start for SKA-LOW identifying the actual target locations. Far too many unknowns and planned changes to the SKA-LOW platform to finalize this. Cluster will be expanded and amended with AAVS machines and even just the simple question on where the DLM and the Dashboard would run on the cluster could not be resolved. This is the main goal for the new features in PI24. Introduce service that can be used to track for intermediate and final data products by ID (as of ADR-54 ). We have not tackled this at all, since there are far more fundamental issues to be solved before we can even think about that level of tracking detail. Integrate into SDP: the data mangement services are available in the DP Integration platform have data product dashboard show information from it consider authentication (this would need to go via execution block ID metadata) The DLM is deployed on the DP cluster in the YANDA namespace. The integration with the Data Product Dashbboard failed due to integration issues, but we have now marked that as technical dept and will resolve it asap. Enable data migration from MCCS server This is fully related to the YANDA Acceptance Criteria Acceptance criteria Status Review with SDP the necessary APIs to Initialise and Register data products. done, but no final agreement reached: https://confluence.skatelescope.org/display/SWSI/ADR-101+SDP+buffer+management+and+interface+to+Data+Lifecycle+Management Configure the DLM for the necessary storage locations: Ceph storage on data processing cluster and AA0.5 plus the MCCS pending, targets unclear/undefined Develop and integrate the workflow to integrate with the DPD (as a starting point when Data Product is "finished") done: DLM is sending metadata to DPD Update the RCV pipeline to integrate with the DLM service done: However, there is no automatic integration test for this system level integration Demonstrate the DLM functionalities in DP integration environment partially done: DLM is on the DP cluster, but we have not done an official demo. We have now a ticket to write a guideline on how to use the DLM inside the DP cluster. Create documentation about how to publish new data products to the DLM done: The documentation is auto-generated. There is likely still room for improvement since the current documentation is still fairly low level and there are mutliple ways to 'publish' data products to the DLM. Define a list of groups that are necessary to be implemented in the AAA system for integration with DLM done: pointed again to the extended group system used by ALMA, which is almost certainly also applicable to SKA. The actual implementation is not a DLM issue, but a observatory wide activity. NALEDI Acceptance Criteria Acceptance criteria Status Implement and expose Update Metadata API endpoints as needed. done Integrate the elastic persistence layer ?? BANG Acceptance Criteria Acceptance criteria Status Support YANDA in configuring the storage locations pending Provide a PGSQL instances to be used in the DP integration cluster as per request from teams. done Support DLM integration on the LOW AA0.5 PGSQL instance pending Provide DLM and DPD with DNS name entries in the different environments partially done Provide a dedicated Elastic index for DPD integration ?? SST Acceptance Criteria Acceptance criteria Status Initiate the conversation about the need and schedule to expand our storage capacity done, but the actual implementation is still pending, which blocks the registration on the DLM target endpoints. NALEDI In this PI the way that metadata is handles in the DPD have been changes. It now makes use of a persistent PostgreSQL database for storage of metadata (if available, the in-memory functionality has been updated and maintained to assist with the users during the changeover to PostgreSQL). The search implementation has also been updated to make use of the Elasticsearch instance made available by the !Bang team. (Again, the in-memory search has been updated and maintained to assist with the users during the changeover to Elasticsearch) To enable these changes, restructuring of the API and Dashboard were needed. These changes included updates to the data structure used to serve data to the MUI DataGrid component on the dashboard, rework of the API project structure to improve the logical separation between different modes of operation (In-memory or making use of databases when available), updating all the supporting methods to align with the requirements of saving data in PostgreSQL and Elasticsearch, improved handling of env variables and various other improvements. For a full change log of all the changes done as part of this feature, please see the respective change logs of each application: Updates to the SDP Data Product Dashboard API: Changelog — ska-sdp-dataproduct-api 0.8.0 documentation (skao.int) Updates to the SDP Data Product Dashboard API: { } Changelog — ska-sdp-dataproduct-dashboard 0.8.2 documentation (skao.int) Releases: <PENDING improvements of test coverage as part of NAL-1157> A temporary deployment can be accessed here: https://sdhp.stfc.skao.int/dp-naledi-andre/dashboard/ https://sdhp.stfc.skao.int/dp-naledi-andre/api/status

	Issues	Story Points
To Do	2	4.0
In Progress	0	0.0
Complete	15	31.0
Total	17	35.0

Find and access AA0.5 Data Products for further analysis

Details

Final, end of PI23, status of this feature:

Overall Feature Acceptance Criteria

YANDA Acceptance Criteria

NALEDI Acceptance Criteria

BANG Acceptance Criteria

SST Acceptance Criteria

Description

Attachments

Issue Links

Structure

Activity

People

Feature Progress

Dates

Structure Helper Panel