Happo.io cluster unavailable

Incident Report for Happo

Postmortem

This incident was caused by a combination of changes in our Kubernetes cluster. On Friday, we were working on routing certain requests to happo.io through a caching layer (to improve image loading performance). That work never finished and left our loadbalancer in an unfortunate state where it couldn’t update its configuration. At this point though, the service was working fine (and it kept working well over the weekend). On Monday, we deployed a new version of the main happo.io server. This update caused the load balancer routing to fail, because it was holding on to references to old services, that were now being killed. It took us a while to figure this out, and one of our attempts to get the service back up involved restarting the load balancer. This however, led to the whole cluster getting a new IP address with no way of recovering the old IP address. An update to our DNS (pointing happo.io to the new load balancer IP address) made things work again, but it took a few hours before the DNS update had propagated to different parts of the world.

Happo is often a required step for many organizations and the fact that it wasn’t available for a few hours definitely caused disruption to many users. We’re truly sorry this happened and we’ll do our best to prevent similar scenarios in the future.

If you have questions, don’t hesitate to reach out directly to me at henric.trotzig@happo.io

– Henric Trotzig

Posted Jun 07, 2021 - 15:07 UTC

Resolved

As far as our monitoring can detect, this issue is now fully resolved. Reach out to support@happo.io if you're still affected by this!

Posted Jun 07, 2021 - 12:20 UTC

Update

DNS updates are still propagating, but the service is back up and running in a majority of areas. We're still seeing some workers struggling (especially ios-safari and ipad-safari) but we're hoping these will be back up in the next hour as well.

Posted Jun 07, 2021 - 11:54 UTC

Update

All services are slowly starting to get back to normal. One part of the fix involved updating DNS entries for the happo.io domain. These updates can take a little while to propagate through different networks, so things will continue to be a little bit bumpy for another hour or so.

Posted Jun 07, 2021 - 10:59 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 07, 2021 - 10:51 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 07, 2021 - 10:35 UTC

Investigating

We are currently investigating downtime on our whole happo.io cluster.

Posted Jun 07, 2021 - 09:00 UTC

This incident affected: API and Web UI.