Cloudflare R2 Outage (6 Feb) - What Happened and What It Means for Your Infrastructure
On Thursday 6 February 2025, Cloudflare's R2 object storage service went completely dark for 59 minutes. Every single operation - uploads, downloads, metadata requests - returned errors. And it wasn't caused by a sophisticated attack or a catastrophic hardware failure. It was caused by a human being clicking the wrong thing during a routine abuse report.
If your applications depend on R2 (or any single cloud storage provider), this incident is worth understanding properly. Not just the "what", but the "so what" - because the lessons here apply to every team building on cloud infrastructure.
What actually happened
Cloudflare received a report about a phishing site hosted on R2. Standard stuff - their abuse team handles these regularly. But when an operator went to disable the specific endpoint associated with the phishing report, the system allowed them to disable the entire R2 Gateway service instead. That Gateway is the HTTP frontend responsible for the entire R2 API.
At 08:12 UTC, the R2 Gateway was shut off. By 08:14 UTC, every R2 customer worldwide was experiencing a 100% failure rate. The service did not partially degrade - it fell over completely.
It took until 08:42 UTC for the on-call team to identify the root cause, primarily by reviewing deployment history and configuration changes. Then came a painful discovery: the internal admin tooling they would normally use to reverse the action itself depended on R2, so it was also down. The team had to escalate to an operations team with lower-level system access, who finally re-enabled the Gateway and triggered a redeployment at 09:09 UTC. Service recovered by 09:13 UTC.
The cascade effect
R2 going down didn't just affect R2. Cloudflare's own services that depend on R2 fell like dominoes. Stream and Images experienced 100% failure during the outage window. Cache Reserve customers saw increased requests hitting their origins directly. Log Delivery suffered delays of up to an hour and data loss of up to 13.6% for R2-destined jobs. Vectorize saw 75% query failures and 100% write failures.
Even after R2 came back online, the stampede of reconnecting clients caused a secondary spike in errors on Durable Objects, adding another 23 minutes of residual impact.
This is a textbook example of cascading failure. When a foundational service goes down, everything built on top of it crumbles.
Why this matters for your architecture
Let's be clear - this could happen to any cloud provider. AWS S3 has had outages. Google Cloud Storage has had outages. Azure Blob Storage has had outages. The question isn't whether your storage provider will have an incident, it's what happens to your application when it does.
Here are the practical takeaways we think every team should consider.
Don't put all your eggs in one basket
If your entire application's file storage, asset delivery, logging, and caching all depend on a single service from a single provider, you have a single point of failure. R2 is excellent - we use Cloudflare services ourselves - but relying on it exclusively for every storage need means a single outage can take down everything.
Consider spreading critical storage across providers. Your primary asset storage might live on R2, but your backups could sit on S3 or a dedicated backup service. Your logs could be shipped to a separate destination. The goal isn't to avoid Cloudflare - it's to ensure no single failure takes down your entire operation.
Build for graceful degradation
When R2 went down, Cloudflare's Cache Reserve actually handled it reasonably well - requests that couldn't be served from cache simply fell back to the origin server. That's graceful degradation in action. Your application should behave similarly.
Ask yourself: if your object storage returns a 500 error right now, what does the user see? If the answer is a broken page or a completely unusable application, you have work to do. Consider implementing local caching layers, fallback storage providers, or at minimum, user-friendly error states that communicate the issue without exposing a stack trace.
Your recovery tools shouldn't depend on the thing that's broken
One of the most striking details in Cloudflare's post-mortem was that their admin tooling for re-enabling R2 depended on R2 itself. When the service went down, so did their ability to quickly fix it. This added precious minutes to the recovery time.
Audit your own recovery and deployment pipelines. If your CI/CD stores build artefacts in the same storage that just went down, can you still deploy a fix? If your monitoring dashboards depend on the service that's failing, can you still diagnose the problem? Keep your recovery tools independent of the systems they're designed to recover.
Safeguards against human error are not optional
This outage wasn't caused by a bug in R2's storage engine or a network partition. It was caused by a single operator action that the system should never have allowed. Cloudflare themselves acknowledged that their abuse processing systems lacked safeguards to identify internal production accounts and block destructive actions against them.
If you run infrastructure of any scale, think about what safeguards exist between a human operator and a catastrophic action. Two-person approval for destructive operations, clear labelling of production vs. development environments, and restricted permissions for high-risk actions are not bureaucratic overhead. They're your safety net.
Monitor and alert on dependencies, not just your own code
Many teams affected by this outage likely had green dashboards for their own application code while their users were seeing errors. If you depend on a third-party service, you need to monitor that dependency explicitly. Health checks against your R2 buckets, latency tracking on storage operations, and alerts on error rate spikes from external services should all be part of your observability stack.
What Cloudflare are doing about it
To their credit, Cloudflare published a thorough and transparent post-mortem. They've committed to deploying additional guardrails in their Admin API, disabling high-risk manual actions in the abuse review interface, implementing two-party approval for critical actions, and improving internal account tagging to prevent production services from being accidentally disabled.
These are sensible steps. But as consumers of cloud services, we shouldn't rely solely on our providers getting it right every time. We need to build our own resilience.
The bottom line
The Cloudflare R2 outage is a reminder that cloud services - no matter how reliable their track record - can and will experience incidents. The 59-minute window might seem short, but for businesses running real-time applications, processing payments, or serving media, it's an eternity.
Build your infrastructure with the assumption that any single component can fail at any time. Diversify your dependencies, design for graceful degradation, keep your recovery tools independent, and invest in proper safeguards. Your future self will thank you when the next outage hits - and it will.
If you're unsure about the resilience of your current setup or want to discuss how to architect your API infrastructure for high availability, get in touch with us. It's what we do.
