
This morning, we experienced service disruption that affected our customers. Our Load Balancer service, the component we designed to be our most reliable, failed to meet that promise. While several other services were impacted (Analytics, Logs, and Transaction Monitoring), I want to focus on the Load Balancer failure, as its reliability is fundamental to our service commitments.
Load Balancer Design Context
The Ironforge Load Balancer was architected with reliability as its foundation. Our design philosophy emphasized minimal dependencies for serving RPC requests, with the only external requirements being access to routing strategies and RPC endpoints data. All non-critical operations (analytics, logging, alerts, etc) were intentionally decoupled from the core request-serving path.
Additionally, to ensure resilience, we implemented a caching system that stores routing strategies and RPC endpoints data in memory and in Cloudflare's regional data centers for up to six hours, with data marked as stale after 15 seconds and revalidated every 15 seconds (but still usable for up to six hours in case our database is down).
Over the recent weeks, we had been working to enhance this system further by replicating critical data across three regional clusters (NY, AMS, SIG). This improvement was meant to increase our theoretical reliability by removing single-region dependencies. In other words, if our AMS database cluster is down, we could still revalidate the cached data from our NY cluster.
All this meant that all Load Balancer requests could serve RPC requests with stale configuration data for up to six hours, even if our database cluster is down, which, in the case of three independent clusters, is very, very unlikely. This, coupled with our gradual blue/green deployment strategy for our Load Balancer codebase, provides our customers with a very reliable load balancer service, something that we have showcased over the past year.
The Incident
During this morning's deployment of our more reliable/distributed database clusters solution, we made a critical error in our DNS configuration. This caused our database API to return 404 errors. Our system incorrectly interpreted these 404s as indicating invalid API keys. When our background process attempted to refresh the cache data, it encountered these 404 errors and began systematically purging cache data for all API keys.
We failed to anticipate this scenario in our error handling logic. Our assumption that all 404 responses indicated invalid API keys was fundamentally flawed, and this oversight led to a cascading failure that affected our customers' operations.
Resolution
While the technical fix was straightforward – implementing proper error handling to explicitly identify invalid API keys rather than relying on a Not Found instruction – the incident's duration was extended by DNS propagation times.
In total, our Load Balancer was not serving requests for about 10 to 15 minutes, and our Analytics, Logs, and Transactions Monitoring were affected for about 1 hour.
Learning and Commitment
This incident has taught us several humbling lessons about our system design and deployment practices:
Our error handling made dangerous assumptions. We failed to distinguish between different causes of seemingly identical errors.
Our cache invalidation system lacked crucial safeguards. We should have had mechanisms to prevent mass cache invalidation.
Our deployment procedures for DNS changes need significant improvement. We underestimated the potential impact of DNS changes on our system's reliability.
We are immediately reviewing and implementing fixes across all the affected components and systems. All feature development has stopped until we feel confident that we have done everything in our power to avoid this from happening again, which may take a few days.
Moving Forward
I deeply regret the impact this incident had on our customers. While we designed our Load Balancer to be our most reliable service, we failed to uphold that standard today. Our customers trusted us with their critical infrastructure, and we fell short of both their expectations and our own standards. We understand the gravity of this failure and are taking concrete steps to rebuild that trust through improved system design, better safeguards, and more robust deployment procedures. We are committed to learning from this failure and implementing systemic changes to prevent similar incidents in the future.
Italo.