Loading...

Change Owns to Parent Ofs

Set start and due date...

Xporter

XML

Word

Printable

Details

Type: Feature
Priority: High
Fix Version/s: PI24
Component/s: COM Software Services
Labels:
- Team_SYSTEM

ARTs:

Services
Benefit hypothesis:

Hide

Given the recent CAR outage (https://confluence.skatelescope.org/display/SE/2024-08-12+CAR+Upgrade+Failure+and+Outage) we need to find a way to get us unstuck and have CAR and Harbor in a HA state to not cause more outages incase things go wrong. By addressing some technical debt (update packages, code), creating new deployments and migrating the CAR as part of this feature, we can reduce the risk of a total outage when we are upgrading CAR and Harbor in future.

Show
Given the recent CAR outage ( https://confluence.skatelescope.org/display/SE/2024-08-12+CAR+Upgrade+Failure+and+Outage ) we need to find a way to get us unstuck and have CAR and Harbor in a HA state to not cause more outages incase things go wrong. By addressing some technical debt (update packages, code), creating new deployments and migrating the CAR as part of this feature, we can reduce the risk of a total outage when we are upgrading CAR and Harbor in future.
Acceptance criteria:
Hide

Use AWS S3 as a CAR blob store and for Harbor storage

Have a synced backup of CAR and Harbor so that we can do the upgrades on the blue machine and promote it or use it when things go wrong

Have Harbor as a HA deployed configuration in a k8s cluster

Have Caches and Nexus deployed as containers in datacentres

Run a chaos monkey test scenario for incident management and to verify restore capabilities

take one of the caches or Harbor/CAR down and see how we fix it

Note: k8s cluster parts are open to discussion as discussions are needed to decide whether we need new clusters or not
Show
Use AWS S3 as a CAR blob store and for Harbor storage Have a synced backup of CAR and Harbor so that we can do the upgrades on the blue machine and promote it or use it when things go wrong Have Harbor as a HA deployed configuration in a k8s cluster Have Caches and Nexus deployed as containers in datacentres Run a chaos monkey test scenario for incident management and to verify restore capabilities take one of the caches or Harbor/CAR down and see how we fix it Note: k8s cluster parts are open to discussion as discussions are needed to decide whether we need new clusters or not
Feature Points:
2
Initial Size:
2
WSJF:
0
Epic Link:
CICD usability
Agile Teams:

Team_SYSTEM
Due Sprint:
Sprint 3
Story Point Burn-up:
Overdue:
Overdue

Requirement Status:

PI24 - UNCOVERED
Labels_MIRO:
Team_SYSTEM

Description

https://confluence.skatelescope.org/display/SE/2024-08-12+CAR+Upgrade+Failure+and+Outage

Given the recent upgrade processes and issues we have seen, this feature is created to follow up on the actions on the above page

The objective is to have CAR and Harbor in HA setup with a maintenance schedule that's well communicated and agreed with a minimal downtime and rollback capability in place.

A nice blog on HA system design: https://blog.bytebytego.com/p/how-do-we-design-for-high-availability

There has been discussions on this in the P&I Workshop: (INS-69)

Have a Green/Blue deployment of CAR so that:

Have them in their completely own isolated environments
Green is the one that LB pointing to as prod
When a maintenance work is scheduled:
1. Have LB to be read only (disable PUT, PUSH requests)
2. Do the upgrade on the blue deployment
3. Test it
4. Switch the LB to the blue one
5. NOW: blue is green and vice versa
If things go wrong, revert LB to the blue one (previous green)
Repeat this process for the next maintenance

From ST-2106:

Caveats and comments:
1. ansible-thoteam.nexus3-oss is not being updated anymore, certain configurations don't work and were commented out
2. All previous tasks still work, but some won't be setup on new installs due to 1.
3. Created a backup H2 DB task manually due to 1.
4. oci-cache and apt-cache on the datacentres aws2 don't seem to be doing anything. Consider removing
5. What is the purpose of the HAProxy here? This is a single instance, I don't think it's doing anything. Consider removing.

Attachments

Structure

Activity

People

Assignee:: Harding, Piers

Reporter:: Yilmaz, Ugur

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Feature Progress

Story Point Burn-up: (0%)

Feature Estimate: 2.0

	Issues	Story Points
To Do	0	0.0
In Progress	0	0.0
Complete	0	0.0
Total	0	0.0

Dates

Created:: 19/Aug/24 9:21 AM

Updated:: 04/Sep/24 8:02 AM

Due Sprint Date:: 22/Oct/24

Make CAR and Harbor (Software Repositories) resilient to failovers