Unusually high build queues on MacOS machines

Incident Report for Bitrise CI

Postmortem

As more developers rely on Bitrise and as those developers trust our platform to handle more of the tasks involved in testing and deploying their apps, our responsibility to ensure that always works, grows as well. In that regard, we failed you and we’re sorry.

What Happened

On September 12th around 16:00 UTC, as many of you were preparing for the Apple Event, we experienced an issue with our build system - actions related to our virtual machines started to fail. After some digging, we found that the issue was caused by the storage volume under the database. The disk was full and the automatic measures pruning unnecessary information from the database weren’t aggressive enough to prevent this problem. After we extended the size of the volume and restarted the appliance, the API came back online.

How Did We Get Here?

We’re currently performing a root cause analysis to find that out. You’ll know that we experienced similar issues on August 14th, which prompted us to institute manual and automatic checks to catch early warning signs. These measures failed to flag this occurrence in the time we needed to be able to prevent the problem from escalating, though.

To ensure we do better, we are investigating the underlying cause, its impact and additional preventive measures. We will share our findings with you through https://status.bitrise.io/

Added Transparency

Starting today, we commit to being extra transparent. This means regular public posts - including our notes and insights - as our investigation and our plans progress, but also a commitment to post mortems on issues with customer impact going forward.

For now, we want to thank you for your patience, your support and your trust as we get back to helping you build amazing apps.

Posted Sep 13, 2018 - 14:45 UTC

Resolved

System remained stable, build queues depleted, new builds can start without any delay. We're terribly sorry for the inconvenience.

Posted Sep 12, 2018 - 19:07 UTC

Monitoring

System remained stable, every part working as expected. Builds which were running during the incident might be in a hanging state (still running on the UI, but no logs are generated). Our system will abort these automatically in the next two hours, if you'd find one you can also abort it manually and then initiate a rebuild, that will go through immediately.

Posted Sep 12, 2018 - 18:14 UTC

Update

We rolled out another fix, and the system once again seems to recover. We're checking every part of the system to ensure everything is working as expected.

Posted Sep 12, 2018 - 17:49 UTC

Update

We identified another issue, working on fixing it.

Posted Sep 12, 2018 - 17:32 UTC

Update

Fix applied and we see the system recovering. We're still checking the details and making sure that everything is working as expected.

Posted Sep 12, 2018 - 17:18 UTC

Update

We're in the process of rolling out the fix, will provide update as soon as possible.

Posted Sep 12, 2018 - 16:57 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 12, 2018 - 16:12 UTC

Investigating

We detected unusually high build queues on our MacOS systems. This causes build delays. We're investigating the root cause.

Posted Sep 12, 2018 - 15:43 UTC

This incident affected: Build Processing (OS X Stack (iOS)).