About / Reliability

Our goal is to provide reliable CI. Towards that end, we have made a variety of choices to reduce instability.

Deployment 

Nothing is more frustrating than a tool breaking or changing behavior while you are in the middle of working. For software engineering, it can be hard to immediately discern that a failure was caused by an external change. Understanding this, we avoid stealth changes.

deployments take place outside of core North American working hours
changes are communicated, at minimum that a change occurred
when possible, customers control rollout

Monitoring 

A variety of monitoring systems, alarms, and manual checks are in-place to ensure our systems are working as intended. These systems are continuously evaluated for effectiveness and potential expansion.

When an issue is identified we have a dedicated on-call rotation responsible for triaging. Once verified, problems are escalated as appropriate. Generally, if a change caused the problem it is immediately reverted and resolved in a development environment.

Recovery 

In the event of a catastrophic failure, we have backups in distinct physical locations and can deploy everything from Infrastructure as Code (IaC). We make a point to never deploy outside IaC to ensure it is representative.