The API Guys
Diagram showing cascading service failures radiating outward from a single failed storage dependency
·9 min read·The API Guys

Cloudflare's Workers KV Outage - A Single Point of Failure That Took Down Everything

Cloud InfrastructureIncident AnalysisArchitectureDevOpsResilience

On 12 June 2025, Cloudflare suffered a significant service outage lasting 2 hours and 28 minutes. This was not a DDoS attack, not a security breach, and not a botched deployment. A third-party cloud provider experienced a failure, and because Cloudflare's Workers KV service depended on that provider as its sole central data store, the blast radius was enormous. Workers KV powers configuration, authentication, and asset delivery across dozens of Cloudflare products - and when it went down, nearly everything went with it.

This is the third major Cloudflare incident we have covered in 2025, following the two R2 outages in February and March. Each has a different proximate cause, but they share a common thread: architectural decisions that allow a single failure to cascade far beyond its origin.

What is Workers KV and Why Does It Matter?

Workers KV is Cloudflare's globally distributed key-value store. It is designed as a "coreless" service, meaning it runs independently in each of Cloudflare's data centre locations worldwide. There is no single server or region that acts as a master node. In theory, this makes it highly resilient - if one location fails, others continue operating.

However, Workers KV relies on a central data store as its source of truth. Every location caches data locally, but when a cache miss occurs - what Cloudflare calls a "cold read" - the request goes back to that central store. Writes always go to the central store first. This is the architectural reality beneath the "coreless" marketing: the distributed edge layer is only as reliable as its centralised backend.

What makes Workers KV critical is not just its own functionality but how deeply embedded it is across Cloudflare's platform. Access uses it for authentication. Gateway uses it for policy configuration. WARP uses it for device registration. Turnstile uses it for challenge verification. The dashboard uses it for session management. When Workers KV fails, it does not just affect key-value storage customers - it takes down the tools Cloudflare's own teams and customers need to manage their infrastructure.

What Happened on 12 June

At 17:52 UTC, Cloudflare's internal monitoring detected that new device registrations in WARP were failing. Within minutes, error rates began spiking across multiple services. By 18:06, the engineering team had traced the root cause to Workers KV - specifically, to a failure in the third-party cloud provider that backed KV's central data store.

The incident was escalated to P1 at 18:05 and upgraded to P0 - Cloudflare's highest severity level - at 18:21. The scale of impact became clear quickly: 91% of all Workers KV requests were failing. Every service that depended on KV for cold reads or writes was either completely down or severely degraded.

What followed was a scramble to decouple critical services from the failing KV backend. The Access team began re-architecting their service to avoid KV dependency at 18:43. The Gateway team started removing their KV dependencies at 19:09. Load-shedding began at 19:32, with non-critical KV traffic being deliberately dropped to protect whatever capacity remained for essential operations.

Recovery did not begin until 20:23, when the third-party storage provider came back online. Access and Device Posture resumed normal operation at 20:25, and the incident was fully resolved at 20:28. The total duration was 2 hours and 28 minutes of global impact.

The Cascade of Failures

The list of affected services illustrates just how deeply Workers KV is woven into Cloudflare's platform:

  • Workers KV: 91% of requests failed. Only cached content continued to serve normally.
  • Access: 100% failure for all identity-based logins, including SaaS applications, self-hosted services, and SSH connections. SCIM directory synchronisation returned 500 errors.
  • WARP: No new devices could register. Existing sessions with active tokens continued working, but any session requiring re-authentication failed.
  • Gateway: DNS resolution continued for non-identity queries, but proxy services, TLS decryption, and identity-based DNS-over-HTTPS queries all failed.
  • Workers AI: Every inference request failed. AutoRAG was unavailable as a consequence.
  • AI Gateway: 97% of requests failed at peak.
  • Stream: Video playlists were unreachable. Stream Live experienced 100% failure.
  • Turnstile: CAPTCHA verification failed entirely. Kill switches were activated to prevent users being locked out, though this introduced a temporary risk of token reuse.
  • Dashboard: Login was blocked due to cascading failures in Turnstile, Access, and KV.
  • Pages: 100% build failure rate. Asset delivery saw minor error spikes.
  • Zaraz: 100% failure. Configuration updates made during the incident were lost for at least one customer.

Core network services - DNS resolution, CDN caching, WAF, DDoS protection, and Magic Transit - continued operating normally. These services do not depend on Workers KV, which is precisely why they survived. The contrast is instructive: the services that were architected independently remained available, while every service coupled to KV fell over together.

The Architectural Problem

Cloudflare's own post-mortem is admirably transparent about the underlying issue. Workers KV was in the process of being migrated to more resilient infrastructure - specifically, to Cloudflare's own R2 object storage - but the migration was incomplete. During the transition, one of KV's original third-party storage providers had been removed to prevent data consistency issues and to support data residency requirements. This left KV depending on a single remaining third-party provider.

The engineering team knew this was a risk. The migration to R2 was actively in progress. But on 12 June, the gap in coverage was exposed. The remaining provider failed, and there was no fallback.

This is a pattern we see repeatedly in production systems. An architecture is designed with redundancy, but during a transition or migration period, that redundancy is temporarily reduced. The "temporary" state becomes the actual state for longer than intended, and eventually the risk materialises. It is not that anyone made a reckless decision - it is that the timeline for removing the risk was longer than the time it took for the risk to appear.

Dogfooding as a Double-Edged Sword

Cloudflare has a principle of building on their own platform wherever possible. In normal circumstances, this is a strength - it means Cloudflare's engineers experience the same product their customers use, which creates strong incentives to keep it reliable. Workers KV is a prime example: dozens of internal services use it, which normally ensures it receives serious engineering attention.

But dogfooding also means that a failure in one foundational service cascades through everything built on top of it. When your authentication system, your dashboard, your AI inference platform, and your CAPTCHA service all depend on the same key-value store, a failure in that store does not produce isolated outages - it produces a platform-wide collapse.

This is not unique to Cloudflare. Any organisation that centralises shared services - databases, authentication providers, configuration stores, message queues - faces the same risk. The convenience and consistency of a shared service comes at the cost of a shared failure mode.

What Cloudflare Is Doing About It

Cloudflare has committed to several remediation workstreams. They are accelerating the migration of Workers KV's backend to their own R2 infrastructure, removing the dependency on any single external provider. They are building tooling to progressively restore services during storage outages, preventing the traffic surges that can overwhelm recovering systems and cause secondary failures. And they are auditing service dependencies across the platform to reduce blast radius when any single component fails.

They have also declared a "Code Orange: Fail Small" initiative, focused on ensuring that the architectural pattern behind their recent global outages is eliminated permanently. This includes making individual products resilient to failures in underlying components like Workers KV, so that a KV outage degrades rather than destroys dependent services.

What This Means for Your Architecture

You do not need to be operating at Cloudflare's scale for this incident to be relevant to your architectural decisions. The principles at play apply to any system that depends on shared services.

Map your actual dependency chain, not your intended one. If your application uses a service that depends on another service, you have a transitive dependency. It does not matter that you chose Cloudflare for reliability - if Cloudflare depends on Google Cloud, and Google Cloud goes down, your reliability guarantee is only as strong as Google's. Understand what is beneath the services you rely on.

Ask what happens when your central data store is unavailable. For Laravel applications, this typically means your database. Can your application serve cached responses? Can it queue writes for later? Does your authentication system fail open (allowing access but losing audit logging) or fail closed (blocking access but maintaining security)? There is no universally correct answer, but you need to have made a deliberate choice rather than discovering the behaviour during an incident.

Be cautious during migration periods. The gap between removing an old dependency and fully establishing a new one is when your system is most vulnerable. If you are migrating databases, switching cloud providers, or replacing a third-party service, the transition period is when redundancy is at its lowest. Plan for the possibility that the risk materialises before the migration is complete.

Separate your control plane from your data plane. One of the most damaging aspects of this outage was that Cloudflare's dashboard - the tool customers needed to manage the incident from their side - was itself affected. If your monitoring, deployment, or management tools share dependencies with your production application, a failure in those dependencies takes away your ability to respond. Keep your operational tooling on a separate stack where possible.

Design for degradation, not just availability. The services that weathered this outage best were those with graceful degradation paths. Gateway could continue serving DNS while proxy failed. WARP kept existing sessions alive while blocking new registrations. The services that failed completely were those with an all-or-nothing dependency on KV. When designing your own systems, consider what partial functionality looks like and whether that is acceptable for your users during an incident.

The Broader Pattern

This is Cloudflare's third significant outage of 2025. The February and March incidents hit R2 storage due to manual operational errors. The June incident hit Workers KV due to a third-party dependency failure. The causes are different, but the outcome is the same: a single point of failure, whether human or architectural, took down services far beyond its immediate scope.

Cloudflare deserves credit for the transparency of their post-mortems. They consistently take responsibility for their architectural decisions rather than deflecting to third-party vendors, and they publish detailed technical accounts that the wider engineering community can learn from. But transparency after the fact does not help your business during the 2 hours and 28 minutes when your authentication, your API gateway, and your dashboard are all unreachable.

The takeaway is not to avoid Cloudflare - their core network services remained rock solid throughout this incident. The takeaway is to understand what you are depending on, what that dependency depends on, and what your plan is when the chain breaks. Because it will break. The question is whether you have designed your systems to bend rather than shatter when it does.

Ready to Start Your Project?

Get in touch with our Leeds-based team to discuss your Laravel or API development needs.