Uploaded image for project: 'SAFe Program'
  1. SAFe Program
  2. SP-4609

Make CAR and Harbor (Software Repositories) resilient to failovers

Change Owns to Parent OfsSet start and due date...
    XporterXMLWordPrintable

Details

    • Services
    • Hide

      Given the recent CAR outage (https://confluence.skatelescope.org/display/SE/2024-08-12+CAR+Upgrade+Failure+and+Outage) we need to find a way to get us unstuck and have CAR and Harbor in a HA state to not cause more outages incase things go wrong. By addressing some technical debt (update packages, code), creating new deployments and migrating the CAR as part of this feature, we can reduce the risk of a total outage when we are upgrading CAR and Harbor in future.

      Show
      Given the recent CAR outage ( https://confluence.skatelescope.org/display/SE/2024-08-12+CAR+Upgrade+Failure+and+Outage ) we need to find a way to get us unstuck and have CAR and Harbor in a HA state to not cause more outages incase things go wrong. By addressing some technical debt (update packages, code), creating new deployments and migrating the CAR as part of this feature, we can reduce the risk of a total outage when we are upgrading CAR and Harbor in future.
    • Hide
      • Use AWS S3 as a CAR blob store and for Harbor storage
      • Have a synced backup of CAR and Harbor so that we can do the upgrades on the blue machine and promote it or use it when things go wrong
      • Have Harbor as a HA deployed configuration in a k8s cluster
      • Have Caches and Nexus deployed as containers in datacentres
      • Run a chaos monkey test scenario for incident management and to verify restore capabilities
        • take one of the caches or Harbor/CAR down and see how we fix it

      Note: k8s cluster parts are open to discussion as discussions are needed to decide whether we need new clusters or not

      Show
      Use AWS S3 as a CAR blob store and for Harbor storage Have a synced backup of CAR and Harbor so that we can do the upgrades on the blue machine and promote it or use it when things go wrong Have Harbor as a HA deployed configuration in a k8s cluster Have Caches and Nexus deployed as containers in datacentres Run a chaos monkey test scenario for incident management and to verify restore capabilities take one of the caches or Harbor/CAR down and see how we fix it Note: k8s cluster parts are open to discussion as discussions are needed to decide whether we need new clusters or not
    • 2
    • 2
    • 0
    • Team_SYSTEM
    • Sprint 3
    • Overdue
    • PI24 - UNCOVERED

    • Team_SYSTEM

    Description

      https://confluence.skatelescope.org/display/SE/2024-08-12+CAR+Upgrade+Failure+and+Outage 

      Given the recent upgrade processes and issues we have seen, this feature is created to follow up on the actions on the above page

       

      The objective is to have CAR and Harbor in HA setup with a maintenance schedule that's well communicated and agreed with a minimal downtime and rollback capability in place.

      A nice blog on HA system design: https://blog.bytebytego.com/p/how-do-we-design-for-high-availability 

       

       

      There has been discussions on this in the P&I Workshop: (INS-69)

      Have a Green/Blue deployment of CAR so that:

      1. Have them in their completely own isolated environments
      2. Green is the one that LB pointing to as prod
      3. When a maintenance work is scheduled:
        1. Have LB to be read only (disable PUT, PUSH requests)
        2. Do the upgrade on the blue deployment
        3. Test it
        4. Switch the LB to the blue one 
        5. NOW: blue is green and vice versa
      4. If things go wrong, revert LB to the blue one (previous green)
      5. Repeat this process for the next maintenance

      From ST-2106: 

      Caveats and comments:
      1. ansible-thoteam.nexus3-oss is not being updated anymore, certain configurations don't work and were commented out
      2. All previous tasks still work, but some won't be setup on new installs due to 1.
      3. Created a backup H2 DB task manually due to 1.
      4. oci-cache and apt-cache on the datacentres aws2 don't seem to be doing anything. Consider removing
      5. What is the purpose of the HAProxy here? This is a single instance, I don't think it's doing anything. Consider removing.

      Attachments

        Structure

          Activity

            People

              P.Harding Harding, Piers
              U.Yilmaz Yilmaz, Ugur
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Feature Progress

                Story Point Burn-up: (0%)

                Feature Estimate: 2.0

                IssuesStory Points
                To Do00.0
                In Progress   00.0
                Complete00.0
                Total00.0

                Dates

                  Created:
                  Updated:

                  Structure Helper Panel