Details
-
Spike
-
Should have
-
None
-
SRCnet
-
-
-
1
-
1
-
0
-
Team_DAAC
-
Sprint 2
-
-
-
SRC23-PB data-lifecycle team_DAAC
Description
The online storage at the largest SRCNet sites are expected to need a minimum of around 200PB.
We are assuming that data received into the data lake goes directly into the online storage, and is then written to tape (or similar) at the next available opportunity. From that point onwards the data can be deleted to make room for other data, with the option to restore from and restored from tape.
Currently the most widely adopted storage technology for this scale of data is object storage. However the currently available libraries, e.g. casa, generally assume a POSIX like filesystem interface (although there have been some efforts to look at HDF5).
For now we are assuming:
- Data written into S3 by Rucio/FTS
- Appropriate bucket permissions can be setup by Rucio/FTS
- Data discovery interface helps locate the dataset within the S3 based "datalake"
- Ceph RGW S3 is OpenStack Keystone integrated (the alternative is OIDC direct to SKA IAM, but there a no plans to test that in this feature)
The specific things we want to test:
- Link SKA IAM group membership to access token used to access S3 object storage, documenting how to mount the datalake using a user that can authenticate with SKA IAM
- Azimuth can mount the appropriate S3 storage, with the correct access token, ahead of a user authenticating (e.g. https://github.com/yandex-cloud/k8s-csi-s3)
- Test the following access levels: (1) public read (2) group restricted read (3) group restricted write once (versioned buckets) for outputs
- Test allowing early access to results with another group
- Test performance of POSIX style access using a fuse mounted filesystem and a casa storage benchmark
- Test both local and remote access (i.e. test storage is hosted in one UK site, and can be pulled from another UK site)
Attachments
Issue Links
- mentioned in
-
Page Loading...