Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Spike
Priority: Should have
Fix Version/s: PI23
Component/s: None
Labels:

ARTs:

SRCnet
Benefit hypothesis:

Hide

For environments where users have full control to develop novel workflows (e.g. the user has root access), we cannot use POSIX permissions on a shared filesystem to isolate data, because the user would be able to bypass these restrictions (although submounts could work around some of those problems).

Object storage is an industry standard way of securing access to a large amount of data in a secure way. Ceph is already known to scale well.

We need to do work to better understand if Ceph RGW S3 can provide a secure datalake solution compatible with standard tooling that assumes filesystems, such as CASA and CARTA.

We need to understand and document how this compares to other approaches to accessing a data lake, such as VOSpaces, OneData.org and a local shared file system (e.g. NFS).

Show
For environments where users have full control to develop novel workflows (e.g. the user has root access), we cannot use POSIX permissions on a shared filesystem to isolate data, because the user would be able to bypass these restrictions (although submounts could work around some of those problems). Object storage is an industry standard way of securing access to a large amount of data in a secure way. Ceph is already known to scale well. We need to do work to better understand if Ceph RGW S3 can provide a secure datalake solution compatible with standard tooling that assumes filesystems, such as CASA and CARTA. We need to understand and document how this compares to other approaches to accessing a data lake, such as VOSpaces, OneData.org and a local shared file system (e.g. NFS).
Acceptance criteria:

Hide

Timeboxed to 1 FP.

SPIKE AC: Document and communicate back to the ART all findings gathered, such as:

How to manually mount S3 storage, using user login via SKA IAM

Permissions for: public read-only, group specific read-only bucket setup

Permissions that work for writing outputs (ADPs) that are not publicly readable until the end of the competitive phase agreed for that data.

How to share some outputs (ADPs) with other groups of users, ahead of the data being made public.

The performance of CASA and CARTA using these S3 buckets.

Ability to automatically mount S3 storage buckets into Azimuth Workstation, Slurm platforms and JuypterHub, such that the "data lake" is available to all these platforms, using permissions that match the chosen project.

Storage types that can be made available across sites

Show
Timeboxed to 1 FP. SPIKE AC: Document and communicate back to the ART all findings gathered, such as: How to manually mount S3 storage, using user login via SKA IAM Permissions for: public read-only, group specific read-only bucket setup Permissions that work for writing outputs (ADPs) that are not publicly readable until the end of the competitive phase agreed for that data. How to share some outputs (ADPs) with other groups of users, ahead of the data being made public. The performance of CASA and CARTA using these S3 buckets. Ability to automatically mount S3 storage buckets into Azimuth Workstation, Slurm platforms and JuypterHub, such that the "data lake" is available to all these platforms, using permissions that match the chosen project. Storage types that can be made available across sites
Feature Points:
1
Initial Size:
1
WSJF:
0
Epic Link:
Data Lake Integration
Agile Teams:

Team_DAAC
Due Sprint:
Sprint 2
Story Point Burn-up:
Overdue:

Labels_MIRO:
SRC23-PB data-lifecycle team_DAAC

Description

The online storage at the largest SRCNet sites are expected to need a minimum of around 200PB.

We are assuming that data received into the data lake goes directly into the online storage, and is then written to tape (or similar) at the next available opportunity. From that point onwards the data can be deleted to make room for other data, with the option to restore from and restored from tape.

Currently the most widely adopted storage technology for this scale of data is object storage. However the currently available libraries, e.g. casa, generally assume a POSIX like filesystem interface (although there have been some efforts to look at HDF5).

For now we are assuming:

Data written into S3 by Rucio/FTS
Appropriate bucket permissions can be setup by Rucio/FTS
Data discovery interface helps locate the dataset within the S3 based "datalake"
Ceph RGW S3 is OpenStack Keystone integrated (the alternative is OIDC direct to SKA IAM, but there a no plans to test that in this feature)

The specific things we want to test:

Link SKA IAM group membership to access token used to access S3 object storage, documenting how to mount the datalake using a user that can authenticate with SKA IAM
Azimuth can mount the appropriate S3 storage, with the correct access token, ahead of a user authenticating (e.g. https://github.com/yandex-cloud/k8s-csi-s3)
Test the following access levels: (1) public read (2) group restricted read (3) group restricted write once (versioned buckets) for outputs
Test allowing early access to results with another group
Test performance of POSIX style access using a fuse mounted filesystem and a casa storage benchmark
Test both local and remote access (i.e. test storage is hosted in one UK site, and can be pulled from another UK site)

Attachments

Issue Links

mentioned in: Page Loading...

Structure

Activity

People

Assignee:: Mort, Ben

Reporter:: Watson, Duncan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Feature Progress

Story Point Burn-up: (66.67%)

Feature Estimate: 1.0

	Issues	Story Points
To Do	0	0.0
In Progress	1	2.0
Complete	2	4.0
Total	3	6.0

Dates

Created:: 10/May/24 7:31 AM

Updated:: 6 days ago 8:38 AM

Due Sprint Date:: 09/Jul/24

Demonstrate using CephRGW S3 as storage within the Data lake and evaluate viability