Our goal is to provide reliable CI. Towards that end, we have made a variety of choices to reduce instability.

    Deployment 

    Nothing is more frustrating than a tool breaking or changing behavior while you are in the middle of working. For software engineering, it can be hard to immediately discern that a failure was caused by an external change. Understanding this, we avoid stealth changes.

    • deployments take place outside of core North American working hours
    • changes are communicated, at minimum that a change occurred
    • when possible, customers control rollout

    Monitoring 

    A variety of monitoring systems, alarms, and manual checks are in-place to ensure our systems are working as intended. These systems are continuously evaluated for effectiveness and potential expansion.

    When an issue is identified we have a dedicated on-call rotation responsible for triaging. Once verified, problems are escalated as appropriate. Generally, if a change caused the problem it is immediately reverted and resolved in a development environment.

    Recovery 

    In the event of a catastrophic failure, we have backups in distinct physical locations and can deploy everything from Infrastructure as Code (IaC). We make a point to never deploy outside IaC to ensure it is representative.