Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4277

Demonstrate using CephRGW S3 as storage within the Data lake and evaluate viability

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • SRCnet
    • Hide

      For environments where users have full control to develop novel workflows (e.g. the user has root access), we cannot use POSIX permissions on a shared filesystem to isolate data, because the user would be able to bypass these restrictions (although submounts could work around some of those problems).

      Object storage is an industry standard way of securing access to a large amount of data in a secure way. Ceph is already known to scale well.

      We need to do work to better understand if Ceph RGW S3 can provide a secure datalake solution compatible with standard tooling that assumes filesystems, such as CASA and CARTA.

      We need to understand and document how this compares to other approaches to accessing a data lake, such as VOSpaces, OneData.org and a local shared file system (e.g. NFS).

      Show
      For environments where users have full control to develop novel workflows (e.g. the user has root access), we cannot use POSIX permissions on a shared filesystem to isolate data, because the user would be able to bypass these restrictions (although submounts could work around some of those problems). Object storage is an industry standard way of securing access to a large amount of data in a secure way. Ceph is already known to scale well. We need to do work to better understand if Ceph RGW S3 can provide a secure datalake solution compatible with standard tooling that assumes filesystems, such as CASA and CARTA. We need to understand and document how this compares to other approaches to accessing a data lake, such as VOSpaces, OneData.org and a local shared file system (e.g. NFS).
    • Hide

      Timeboxed to 1 FP.

      SPIKE AC: Document and communicate back to the ART all findings gathered, such as:

      How to manually mount S3 storage, using user login via SKA IAM

      Permissions for: public read-only, group specific read-only bucket setup

      Permissions that work for writing outputs (ADPs) that are not publicly readable until the end of the competitive phase agreed for that data.

      How to share some outputs (ADPs) with other groups of users, ahead of the data being made public.

      The performance of CASA and CARTA using these S3 buckets.

      Ability to automatically mount S3 storage buckets into Azimuth Workstation, Slurm platforms and JuypterHub, such that the "data lake" is available to all these platforms, using permissions that match the chosen project.

      Storage types that can be made available across sites

      Show
      Timeboxed to 1 FP. SPIKE AC: Document and communicate back to the ART all findings gathered, such as: How to manually mount S3 storage, using user login via SKA IAM Permissions for: public read-only, group specific read-only bucket setup Permissions that work for writing outputs (ADPs) that are not publicly readable until the end of the competitive phase agreed for that data. How to share some outputs (ADPs) with other groups of users, ahead of the data being made public. The performance of CASA and CARTA using these S3 buckets. Ability to automatically mount S3 storage buckets into Azimuth Workstation, Slurm platforms and JuypterHub, such that the "data lake" is available to all these platforms, using permissions that match the chosen project. Storage types that can be made available across sites
    • 1
    • 1
    • 0
    • Team_DAAC
    • Sprint 2
    • SRC23-PB data-lifecycle team_DAAC

    Description

      The online storage at the largest SRCNet sites are expected to need a minimum of around 200PB.

      We are assuming that data received into the data lake goes directly into the online storage, and is then written to tape (or similar) at the next available opportunity. From that point onwards the data can be deleted to make room for other data, with the option to restore from and restored from tape.

      Currently the most widely adopted storage technology for this scale of data is object storage. However the currently available libraries, e.g. casa, generally assume a POSIX like filesystem interface (although there have been some efforts to look at HDF5).

      For now we are assuming:

      • Data written into S3 by Rucio/FTS
      • Appropriate bucket permissions can be setup by Rucio/FTS
      • Data discovery interface helps locate the dataset within the S3 based "datalake"
      • Ceph RGW S3 is OpenStack Keystone integrated (the alternative is OIDC direct to SKA IAM, but there a no plans to test that in this feature)

      The specific things we want to test:

      • Link SKA IAM group membership to access token used to access S3 object storage, documenting how to mount the datalake using a user that can authenticate with SKA IAM
      • Azimuth can mount the appropriate S3 storage, with the correct access token, ahead of a user authenticating (e.g. https://github.com/yandex-cloud/k8s-csi-s3)
      • Test the following access levels: (1) public read (2) group restricted read (3) group restricted write once (versioned buckets) for outputs
      • Test allowing early access to results with another group
      • Test performance of POSIX style access using a fuse mounted filesystem and a casa storage benchmark
      • Test both local and remote access (i.e. test storage is hosted in one UK site, and can be pulled from another UK site)

      Attachments

        Issue Links

          Structure

            Activity

              People

                b.mort Mort, Ben
                D.Watson Watson, Duncan
                Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (66.67%)

                  Feature Estimate: 1.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   12.0
                  Complete24.0
                  Total36.0

                  Dates

                    Created:
                    Updated:

                    Structure Helper Panel