This incident was caused by a combination of changes in our Kubernetes cluster. On Friday, we were working on routing certain requests to happo.io through a caching layer (to improve image loading performance). That work never finished and left our loadbalancer in an unfortunate state where it couldn’t update its configuration. At this point though, the service was working fine (and it kept working well over the weekend). On Monday, we deployed a new version of the main happo.io server. This update caused the load balancer routing to fail, because it was holding on to references to old services, that were now being killed. It took us a while to figure this out, and one of our attempts to get the service back up involved restarting the load balancer. This however, led to the whole cluster getting a new IP address with no way of recovering the old IP address. An update to our DNS (pointing happo.io to the new load balancer IP address) made things work again, but it took a few hours before the DNS update had propagated to different parts of the world.
Happo is often a required step for many organizations and the fact that it wasn’t available for a few hours definitely caused disruption to many users. We’re truly sorry this happened and we’ll do our best to prevent similar scenarios in the future.
If you have questions, don’t hesitate to reach out directly to me at henric.trotzig@happo.io
– Henric Trotzig