Cloudflare 40-Hour Outage: A Rookie Night Shift Technician on Duty for Just One Week

“Can’t blame me, a rookie who has only been on the job for a week, right?” Cloudflare, one of the world’s most renowned network service providers, occasionally experiences interruptions, which are quite common.

Typically, Cloudflare has various redundancy strategies in place, and even when an outage occurs, the impact is relatively small.

However, the recent technical glitch at Cloudflare lasted a whopping 40 hours, making it the longest interruption in Cloudflare’s history.

Consequently, Cloudflare swiftly released a blog post to analyze the causes and consequences of this incident.

Cloudflare 40-Hour Outage: A Rookie Night Shift Technician on Duty for Just One Week

The outage occurred from November 2, 2023, at 11:44 UTC to November 4, 2023, at 04:25 UTC. It’s important to note that all times mentioned are in UTC.

Direct Causes: Data Center Power Failure and High-Voltage Grounding Issue

The interruption affected several of Cloudflare’s products, with the most severe impact on the Cloudflare Control Panel and Analytics Service. The Control Panel is where customers log in to operate Cloudflare, and the Analytics Service provides logs and analysis reports.

Although Cloudflare had redundancy in place, the immediate cause was an unexpected power maintenance event at the Flexential data center rented by Cloudflare. This resulted in an interruption of the utility power supply to the data center. However, data centers usually have backup generators and uninterruptible power supplies (UPS) to mitigate such situations.

Flexential’s data center had Tier III certification, but after an unplanned power maintenance event by General Electric, several issues arose. When power issues occurred, Flexential initiated backup generators but did not notify their customers, including Cloudflare. Cloudflare was unaware of the power issues at its core data center.

In an unusual practice, Flexential continued to run the remaining utility power facility and internal generators simultaneously. Typically, in such situations, it is advisable to switch to backup generators directly because the remaining utility power facility might also be affected after a power supply issue. However, Flexential failed to inform customers and didn’t understand why they were still using the remaining utility power facility.

Unfortunately, the remaining utility power facility experienced a grounding issue around 11:40 UTC, just minutes before the Cloudflare outage (at that time, there was no outage because the backup generators were still operational). This grounding issue affected the front-end transformer of the remaining utility power facility, which operated at 12kV, a high-voltage system. A high-voltage grounding issue is a serious problem.

The grounding issue triggered an automatic shutdown protection in the electrical system to ensure the safety of electrical facilities. Unfortunately, this protection also shut down all generators. Consequently, both the utility power and backup generator supply to the data center were cut off.

Fortunately, there was a set of UPS batteries that could provide power for about 10 minutes. If utility power or the generators could be restored within those 10 minutes, the UPS would shut down, ensuring minimal disruption. However, these UPS batteries failed after just 4 minutes, and Flexential had not yet repaired the generators. As a result, the data center lost power completely.

Three factors hindered the generator’s restoration:

Due to the high-voltage grounding issue, the circuits tripped, requiring physical access and manual restart of various facilities.
Flexential’s access control system also lacked backup power, making it inaccessible in offline mode.
Flexential’s data center night shift had only security personnel and a technician who had been on the job for just one week, without experienced operators or electrical experts.

Cloudflare received its first alert at 11:44 UTC, which was 4 minutes after the UPS battery failure. At this point, Cloudflare became aware of the issue and began contacting Flexential, hoping to deploy its own local engineers to enter the data center.

By 12:28 UTC, Flexential finally sent its first notification to customers, indicating that the data center had experienced an issue, and engineers were actively working on a solution.

At 12:48 UTC, Flexential restarted the generators, and some facilities started to receive power. However, coincidentally, the circuit breaker on Cloudflare’s power line was damaged, whether due to the grounding issue or a surge was unclear. Flexential then attempted to replace the circuit breaker, but due to the extent of the damage, they needed to purchase new ones, which might have posed difficulties.

Since Flexential couldn’t provide an estimated recovery time, Cloudflare decided to activate its European backup site at 13:40 UTC to restore services.

For a large system to recover quickly through redundant sites, it requires thorough testing. Otherwise, issues are likely to arise during the transition. In this case, the problem was now on Cloudflare’s end.

Cloudflare’s Issues:

While the direct cause was related to the data center, there were also indirect factors.

Cloudflare’s fast-paced innovation allowed teams to implement new features without extensive testing.

During the failover process, failed API calls surged due to the technical glitch. With too many failed API calls, Cloudflare had to restrict request rates until around 17:57 UTC when the backup site was mostly operational.

Some newer products had not undergone full disaster recovery testing, which meant that certain services remained unavailable.

At 22:48 UTC on November 2, Flexential finally replaced the circuit breakers and began supplying power from the utility. Cloudflare’s team, who had been working tirelessly, decided to take a break, as the backup site could now handle most services.

Starting on November 3, Cloudflare began the process of restoring the Flexential data center. This involved physically starting network equipment, booting thousands of servers, and restoring services. The servers also required reconfiguration, which took an additional three hours to rebuild management configuration servers. Some services depended on others, so the order of operations was crucial.

Once the configuration servers were operational, engineers started working on other servers, with each server taking between 10 minutes and 2 hours to rebuild. The entire service was finally restored at 04:25 UTC on November 4.

For users interested in operations, it is recommended to read Cloudflare’s original post-mortem to learn from the lessons: CloudFlare Post-Mortem on Control Plane and Analytics Outage.