Explained: The 'basic software bug' that caused Amazon Web Services outage, bringing half the internet across the world down

2 hours ago 4
ARTICLE AD BOX

 The 'basic software bug' that caused Amazon Web Services outage, bringing half the internet across the world down

Amazon Web Services (AWS) – Amazon’s cloud unit – has published a detailed explanation, pinpointing a bug in its automation software as the root cause of the hours-long outage this week that took down thousands of popular sites and apps.

AWS confirmed that a cascading series of events was triggered by a basic software flaw in the system that manages its vast digital database.According to the company, the outage began when customers lost the ability to connect to DynamoDB, AWS's core database system.This was due to a “latent defect within the service’s automated Domain Name System (DNS) management system.”

Here’s an explanation of what went wrong with AWS infrastructure

The outage actually involved multiple problems happening over a 15-hour period, as per the explanation.AWS says that DynamoDB relies on automation to manage hundreds of thousands of DNS records, ensuring that server capacity is constantly updated and traffic is efficiently distributed. Think of DynamoDB as an extremely fast, massive database that many other Amazon services (and customer applications) rely on to store and quickly look up crucial data. The core problem was a bug in DynamoDB's automated system that manages its thousands of internal addresses (like a digital phone book).

The specific cause of the issue was an empty DNS record for the critical US-East-1 (Virginia) data centre region. DNS, or Domain Name System, is the internet's directory that converts human-readable domain names into machine-readable IP addresses, which allow computers to locate and connect to websites.Services like DynamoDB use a huge, constantly changing list of addresses, called DNS records, to help route traffic correctly to all their servers.

Automated systems, called Enactors, are constantly running to keep this list fresh, adding new servers and removing broken ones.In AWS systems, two of these automated Enactors were working on updating the address list at the same time. While Enactor 1 got stuck with unusual delays, Enactor 2 sped through and successfully applied a newer, correct plan to all the addresses. When it finished, it cleaned up the system by deleting all the very old, outdated plans.AWS stated that the bug failed to automatically self-repair and ultimately required manual operator intervention to correct the error. In the wake of the disruption, AWS immediately disabled the DynamoDB automation systems.The second problem was with Network Load Balancer (NLB). Consider it as a "traffic cop" that distributes incoming network requests across a group of healthy servers or resources. During the outage period, some NLBs started failing to properly check the health of their backend servers. When a load balancer thinks its servers are unhealthy (even if they aren't), it stops sending traffic, leading to increased connection errors for users trying to reach those services.In short, the entire regional infrastructure suffered from a domino effect, starting with a subtle, timing-dependent bug in the core DynamoDB service that triggered massive instability.

Read Entire Article