Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-3892

Tuning the Japanese SRC CANFAR platform infrastructure to support 10-15 external users

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Feature
    • Not Assigned
    • PI21
    • None
    • None
    • SRCnet
    • Hide

      This feature will identify potential risks on the deployment of SRCNet science platform, particularly the cost of the deployment such as the number of engineer's working days. The precise estimation of this cost is critical to meet the schedule of SRCNet v0.1.

      This feature will demonstrate the optimization of the science platform deployments, for relatively-small SRCNet nodes which host 20 users in average, or about 2% share supposing 1,000 users based on the number of SWG members.

      This feature will provide details of the latency in pulling the images and caching them. The latency is a measure to consider whether a local repository is useful/required or not.

      Show
      This feature will identify potential risks on the deployment of SRCNet science platform, particularly the cost of the deployment such as the number of engineer's working days. The precise estimation of this cost is critical to meet the schedule of SRCNet v0.1. This feature will demonstrate the optimization of the science platform deployments, for relatively-small SRCNet nodes which host 20 users in average, or about 2% share supposing 1,000 users based on the number of SWG members. This feature will provide details of the latency in pulling the images and caching them. The latency is a measure to consider whether a local repository is useful/required or not.
    • Hide

      AC1: OpenStack control & compute nodes are deployed on the JP servers which are used for the tutorial school, and report the cost.
      AC2: K8s is deployed on the VMs on the JP servers , and report the cost.
      AC3: CANFAR is deployed on the K8s cluster, in a stand-alone manner, where SRC IAM and distributed file systems (e.g., Rucio) are not incorporated, and report the cost.
      AC4: The CANFAR is successfully used by workshop participants.
      AC5: Documentation on what was done, including the latency information.

      From AC1 to AC3, "the cost" should include the amount of time and labor required to complete the work, and the processes of automated parts and manual parts.

      Show
      AC1: OpenStack control & compute nodes are deployed on the JP servers which are used for the tutorial school, and report the cost. AC2: K8s is deployed on the VMs on the JP servers , and report the cost. AC3: CANFAR is deployed on the K8s cluster, in a stand-alone manner, where SRC IAM and distributed file systems (e.g., Rucio) are not incorporated, and report the cost. AC4: The CANFAR is successfully used by workshop participants. AC5: Documentation on what was done, including the latency information. From AC1 to AC3, "the cost" should include the amount of time and labor required to complete the work, and the processes of automated parts and manual parts.
    • 1
    • 1
    • 0
    • Team_LAVENDER
    • Sprint 5
    • Hide

      AC1 is met and AC2 is partly met. K8s is deployed on the JP servers, but Road Balancer does not work appropriately. The team have addressed this issue. A report is found here: https://confluence.skatelescope.org/display/SRCSC/%28LAV-254%29OpenStack+and+K8s+implementations+on+JPSRC+Mark+I+and+cost+report

      AC3-AC5 are not met.  

      Show
      AC1 is met and AC2 is partly met. K8s is deployed on the JP servers, but Road Balancer does not work appropriately. The team have addressed this issue. A report is found here: https://confluence.skatelescope.org/display/SRCSC/%28LAV-254%29OpenStack+and+K8s+implementations+on+JPSRC+Mark+I+and+cost+report AC3-AC5 are not met.  
    • 24.3
    • PI24 - UNCOVERED

    Description

      The CANFAR deployment of the science platform can support many users simultaneously, but the number of nodes/cores/memory have evolved with time. Given a new setup, there will be many iterations as we solve deployment issues as well as we tune the deployment of both k8s and user storage. The cost (easiness) of the iterations is not clear yet, leaving it as a risk. Just in time, JPSRC has an opportunity to host 10-15 users. As the first site in the portable CANFAR to deploy at scale, it must be useful to share experiences of tuning the infrastructure to support external users of this scale. It is also useful to investigate whether the latency in pulling the images and caching them is reasonable. If it is not, then we would have to look into deploying a local repository.

      Attachments

        Issue Links

          Structure

            Activity

              People

                r.bolton Bolton, Rosie
                T.Akahori Akahori, Takuya
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Feature Progress

                  Story Point Burn-up: (100.00%)

                  Feature Estimate: 1.0

                  IssuesStory Points
                  To Do00.0
                  In Progress   00.0
                  Complete46.0
                  Total46.0

                  Dates

                    Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel