Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4654

prepareData Operation: Data Environment With References To Repository Data for SRCNet Users

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • prepareData Functionality
    • SRCnet
    • Hide

      BH1: Replicas of data into SRCNet nodes are controlled in time, including extra copies of popular data into efficient storage and close to the computing units

      BH2: Multiple replicas of the same data product are not needed at the same SRCNet node in the same storage

      BH3: The efficiency of accessing local data products will be better than a preliminary download by-stream approach for products already present at this SRCNet node

      BH4: Users will be able to select data into the science gateway and open an analysis environment (e.g. notebook) where the selected data is visible and prepared for analysis

      Show
      BH1: Replicas of data into SRCNet nodes are controlled in time, including extra copies of popular data into efficient storage and close to the computing units BH2: Multiple replicas of the same data product are not needed at the same SRCNet node in the same storage BH3: The efficiency of accessing local data products will be better than a preliminary download by-stream approach for products already present at this SRCNet node BH4: Users will be able to select data into the science gateway and open an analysis environment (e.g. notebook) where the selected data is visible and prepared for analysis
    • 0

    Description

      Introduction

      Granting users direct access to data from repositories managed by the Data Management System (DDM) would significantly benefit SRCNet. This approach avoids creating local copies for large SKA data products, addressing the following concerns: 

      • Storage Quotas: Replicas created outside the DDM must be counted into user quotas. Since SKA data cubes can be several terabytes, users might require quotas comparable or exceeding this size across nodes for efficient scientific workflows. This translates to reserving a significant amount of storage space for user areas, potentially 10-100 TBs per user per node, multiplied by the expected user base.
      • Lifecycle Management: Replicas outside the DDM lack lifecycle management by SRCNet services. DDM-created replicas could utilise timestamps for deletion after a period of inactivity. Local copies (outside the DDM) are difficult to delete as they might be genuine user data, leading to near-permanent storage usage.
      • Redundant Replicas: Local copies can lead to users accessing different replicas of the same data product, even when a replica already exists in hot storage near the computing area. DDM-created replicas could be shared read-only with multiple users.
      • Improved Efficiency: Linking to existing data allows immediate access, skipping the "download to user area" step. However, this step might still be necessary for multi-tier storage nodes (explained in Note 3).

       Avoiding the creation local copies for large SKA data products, we expect to get the following benefits:

      • Cost Implications. These issues can significantly impact node costs. Data copies might be created multiple times for the same file with undefined lifetimes, controlled only by user quotas. When quotas are full, users are forced to clean up their areas. DDM-managed replicas accessible in read-only mode wouldn't contribute to user storage quotas, resulting in more controlled node data management and a system optimized for user analysis and scientific production. While performance improvements might be an immediate benefit, cost reduction remains the primary consideration for SRCNet.
      • Performance Considerations. Performance implications exist (e.g., reading local vs. streamed data). Although local data access is generally faster than streaming, this might be less relevant than the abovementioned cost benefits.
      • Optimized Cache Storage with DDM Control. DDM control over replicas facilitates an optimized storage system with multiple replicas of popular data near users and lifespans managed by last access timestamps. This process resembles the approach used by Content Delivery Networks (CDNs) (https://en.wikipedia.org/wiki/Content_delivery_network) for efficient website content delivery across time zones.

       

      Additional Notes:

      1. Downloading data to user areas via streams is already supported by SRCNet services (TAP->DataLink->Download). This feature aims to optimize cost and performance.Downloading data to user areas via streams is already supported by SRCNet services (TAP->DataLink->Download). This feature aims to optimize cost and performance.
      2. Processes requiring read/write access on the original data necessitate downloading data to user areas. This proposed approach primarily focuses on read-only data access. Although this will be allowed, we discourage adding additional layers onto large datasets, promoting the analysis of data by producing new products with references to the original data product IDs, a process more efficient for SKA data.
      3. For multi-tier storage systems (e.g., with RSEs using high-latency technologies), it's recommended to have an RSE near the computing area. Data is then moved to this storage during the "prepare data" process. In simpler terms, the staging area is also part of the Rucio DDM. DDM can remove replicas of popular data from the staging area when the data is no longer accessed or move them to a less-performance storage RSE. For instance, using a layer like Cavern/VOSpace, the reference to the symbolic link would be replaced with a reference to a long-term replica into a less efficient RSE (different tier).

       

      EPIC Definition

      The current EPIC content is a minimal implementation of the prepareData function. The involved steps are:

      • Data selection for analysis at the gateway (shopping basket already implemented)
      • Gateway invokes a REST service to request data available for the user (already implemented in gateway and DDM) (may require updates)
      • DM API checks user access rights for the selected data (partial implementation, additional work needed for a basic prototype)
      • DM API (with DDM system) ensures data presence at the local node (if necessary, by invoking replica creation - mostly done)
      • DM API invokes the "set data available for the user into user area" method (the core of this EPIC) e.g. by creating a link to the repository
      • User can have info of the process completion (using UWS) (already done)
      • User can invoke the analysis button to open a notebook interface (where data can be explored) (mostly done, pending a notebook example to make use of the data)

       
      Breakdown for this EPIC
      This operation is further divided into at least four features that can be implemented by different teams in coordination:

      Attachments

        Issue Links

          Structure

            Activity

              People

                Jesus.Salgado Salgado, Jesus
                Jesus.Salgado Salgado, Jesus
                Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                  Created:
                  Updated:

                  Structure Helper Panel