The API Guys
Illustration of network congestion between cloud providers representing the Cloudflare and AWS interconnect incident
·8 min read·The API Guys

Cloudflare vs AWS - The Congestion Incident (21 August)

Cloud InfrastructureCloudflareAWSMulti-RegionDevOpsResilience

On 21 August 2025, a traffic surge from a single customer saturated every available peering connection between Cloudflare and Amazon Web Services us-east-1. For nearly four hours, customers with origin servers in that region experienced high latency, packet loss, and connection failures. This was not a DDoS attack and it was not a BGP hijack. It was legitimate traffic that simply overwhelmed the physical links connecting two of the internet's largest infrastructure providers.

If your applications sit behind Cloudflare with origins in AWS, this incident deserves your attention. Not because it was catastrophic on a global scale - Cloudflare's wider network continued operating normally - but because it exposes a class of risk that most teams never think about: what happens when the connection between your providers becomes the bottleneck.

What actually happened

At approximately 16:27 UTC, a customer began pulling a large volume of cached objects from Cloudflare via servers in AWS us-east-1. The response traffic generated by these requests was enough to saturate all direct peering connections between Cloudflare and AWS in the Ashburn, Virginia data centre region.

This alone would have been manageable had conditions been ideal. But they were not. One of the direct peering links was already operating at half capacity due to a pre-existing hardware fault. A separate Data Centre Interconnect link that connected Cloudflare's edge routers to an offsite peering switch was overdue for a capacity upgrade.

When AWS noticed the congestion on their side, they attempted to help by withdrawing BGP route advertisements from the most congested links. The intention was to redirect traffic onto less loaded paths. Instead, this pushed traffic onto the secondary interconnect paths - the very ones that were already capacity-constrained. Those links promptly saturated as well, making the situation significantly worse.

The result was a cascading congestion event. Network queues on Cloudflare's edge routers grew until they began dropping packets, including high-priority traffic. Customers saw their requests timing out, responses arriving slowly, or connections failing entirely.

How it was resolved

Cloudflare's network team was alerted to internal congestion at 16:44 UTC, roughly 17 minutes after the surge began. However, resolving the issue was complicated by the BGP prefix withdrawals from AWS, which had removed the very routing paths that could have been used to spread the load more evenly.

The fix required close coordination between both providers. Cloudflare rate-limited the single customer responsible for the traffic surge, which began reducing congestion from around 19:05 UTC. Additional traffic engineering actions from Cloudflare's network team resolved the remaining congestion by 19:27 UTC. AWS then gradually restored the BGP prefix advertisements they had withdrawn, completing the process by 20:07 UTC. Residual latency continued until 20:18 UTC as routing tables stabilised.

From first impact to full resolution: just under four hours.

The invisible risk: inter-provider dependencies

Most teams think about resilience in terms of their own infrastructure. Is the application healthy? Are the servers responding? Is the database replicating correctly? These are important questions, but they only cover what you directly control.

This incident highlights a different category of failure entirely. Both Cloudflare and AWS were individually operating normally. Cloudflare's global network was fine. AWS us-east-1 was fine. The problem existed solely in the physical links connecting them - links that neither provider fully controls and that most customers never think about.

When you place a CDN in front of an origin server, you are implicitly depending on the peering arrangements between those two providers. These are physical connections with finite bandwidth, and they are shared across every customer whose traffic flows over the same paths. A single large customer can, as this incident proved, consume enough of that shared capacity to affect everyone else.

This is not a theoretical risk. It happened, in production, to one of the most robust CDN providers on the internet.

Why us-east-1 specifically

It is worth noting that AWS us-east-1, located in northern Virginia, is by far the most popular AWS region. It was the first region launched, it hosts the widest range of services, and many customers default to it without considering alternatives. This concentration means that peering links into us-east-1 carry disproportionate traffic loads compared to other regions.

The incident did not affect customers using other AWS regions. If your origin servers had been in eu-west-1, eu-west-2, or us-west-2, for example, you would have experienced no impact whatsoever. The congestion was entirely localised to the Cloudflare-to-AWS interconnect in Ashburn.

This is a straightforward argument for geographic distribution. If all your eggs are in us-east-1, you are exposed not only to region-level AWS outages but also to congestion events on the links feeding into that region from upstream providers.

What made this worse than it needed to be

Several factors compounded the initial congestion into a prolonged incident. A pre-existing hardware fault meant one peering link was already at reduced capacity before the surge began. Deferred infrastructure upgrades left the secondary interconnect paths without enough headroom to absorb overflow traffic. When AWS withdrew BGP routes as a mitigation measure, it inadvertently redirected traffic onto paths that were least able to handle it.

Individually, none of these would have caused a significant incident. Together, they created a situation where every available path between the two providers was simultaneously overwhelmed. This is a pattern we see repeatedly in infrastructure failures: single causes rarely bring systems down, but the combination of degraded capacity, deferred maintenance, and well-intentioned but counterproductive mitigations creates cascading failures.

We covered similar cascading dynamics in our earlier write-up on the Cloudflare R2 outage in February, where a single operator action brought down an entire service because recovery tooling depended on the very system that had failed.

Practical steps for your architecture

If this incident concerns you - and it should if you are running production workloads behind Cloudflare on AWS - there are concrete steps you can take to reduce your exposure.

Distribute your origins across regions. Rather than concentrating everything in us-east-1, spread your origin servers across multiple AWS regions. Use us-east-1 as your primary if you must, but have warm standby origins in a secondary region like us-west-2 or eu-west-2. Cloudflare's load balancing can route traffic to healthy origins automatically if one region becomes unreachable.

Consider multi-CDN strategies for critical services. If your revenue depends on uptime, having a secondary CDN provider configured and ready to receive traffic gives you an escape route when your primary CDN's connectivity to your cloud provider degrades. DNS-based failover can switch traffic between CDN providers within minutes.

Monitor the paths, not just the endpoints. Standard uptime monitoring checks whether your application responds. It does not tell you whether the network path between your CDN and your origin is healthy. Tools like Cloudflare's own analytics, combined with synthetic monitoring from multiple geographic locations, can reveal latency increases and packet loss before they become outages.

Understand your CDN's peering arrangements. This is harder to action but worth investigating. Where does your CDN peer with your cloud provider? Are those peering points in the same facility as your origin servers? If you are using a niche region, is there direct peering at all, or does traffic transit through intermediate networks? The answers affect your exposure to exactly this kind of incident.

Build with Laravel's queue system for resilience. If your application makes API calls or processes data that can tolerate slight delays, pushing work onto queues rather than handling it synchronously means a period of elevated latency does not translate directly into user-facing errors. Laravel's queue system with database or Redis drivers gives you a buffer against transient connectivity problems.

What Cloudflare is doing about it

To their credit, Cloudflare published a detailed and transparent post-mortem within days of the incident. They have committed to developing mechanisms to deprioritise individual customer traffic when it begins congesting shared links, expediting the data centre interconnect upgrades that were already planned, and coordinating with AWS to ensure their respective BGP traffic engineering actions do not conflict during future incidents.

These are sensible steps, but they are mitigations on Cloudflare's side. They do not eliminate the fundamental risk that peering links between providers have finite capacity and can be overwhelmed. That risk is inherent to how the internet works, and it is your responsibility to architect around it.

The broader lesson

This incident is a reminder that the internet is not a cloud. It is a collection of physical cables, routers, and switches connecting distinct networks. When two networks meet, there is a finite amount of bandwidth available, and that bandwidth is shared. Your application's reliability depends not just on your code, your servers, and your cloud provider, but on every physical link your traffic traverses.

Multi-region architecture is not an enterprise luxury. It is a practical necessity for any business that takes uptime seriously. The cost of running warm standby infrastructure in a second region is almost certainly less than the cost of four hours of degraded service for your customers.

If you are unsure whether your current architecture would have weathered this incident, get in touch. We help businesses design and implement resilient infrastructure that survives not just individual provider failures, but the connections between them.

Ready to Start Your Project?

Get in touch with our Leeds-based team to discuss your Laravel or API development needs.