Our goal is to provide reliable CI. Towards that end, we have made a variety of choices to reduce instability.
Deployment
Nothing is more frustrating than a tool breaking or changing behavior while you are in the middle of working. For software engineering, it can be hard to immediately discern that a failure was caused by an external change. Understanding this, we avoid stealth changes.
- deployments take place outside of core North American working hours
- changes are communicated, at minimum that a change occurred
- when possible, customers control rollout
Monitoring
A variety of monitoring systems, alarms, and manual checks are in-place to ensure our systems are working as intended. These systems are continuously evaluated for effectiveness and potential expansion.
When an issue is identified we have a dedicated on-call rotation responsible for triaging. Once verified, problems are escalated as appropriate. Generally, if a change caused the problem it is immediately reverted and resolved in a development environment.
Recovery
In the event of a catastrophic failure, we have backups in distinct physical locations and can deploy everything from Infrastructure as Code (IaC). We make a point to never deploy outside IaC to ensure it is representative.