Details
-
Feature
-
High
-
Services
-
-
-
2
-
2
-
0
-
Team_SYSTEM
-
Sprint 3
-
-
Overdue
-
-
Team_SYSTEM
Description
https://confluence.skatelescope.org/display/SE/2024-08-12+CAR+Upgrade+Failure+and+Outage
Given the recent upgrade processes and issues we have seen, this feature is created to follow up on the actions on the above page
The objective is to have CAR and Harbor in HA setup with a maintenance schedule that's well communicated and agreed with a minimal downtime and rollback capability in place.
A nice blog on HA system design: https://blog.bytebytego.com/p/how-do-we-design-for-high-availability
There has been discussions on this in the P&I Workshop: (INS-69)
Have a Green/Blue deployment of CAR so that:
- Have them in their completely own isolated environments
- Green is the one that LB pointing to as prod
- When a maintenance work is scheduled:
- Have LB to be read only (disable PUT, PUSH requests)
- Do the upgrade on the blue deployment
- Test it
- Switch the LB to the blue one
- NOW: blue is green and vice versa
- If things go wrong, revert LB to the blue one (previous green)
- Repeat this process for the next maintenance
From ST-2106:
Caveats and comments:
1. ansible-thoteam.nexus3-oss is not being updated anymore, certain configurations don't work and were commented out
2. All previous tasks still work, but some won't be setup on new installs due to 1.
3. Created a backup H2 DB task manually due to 1.
4. oci-cache and apt-cache on the datacentres aws2 don't seem to be doing anything. Consider removing
5. What is the purpose of the HAProxy here? This is a single instance, I don't think it's doing anything. Consider removing.