The flowing outages across the internet on December 7th affected many organizations and establishments that it was seen as evil stone crippling businesses, from Disney amusement parks and Netflix videos to robot vacuums and Adele ticket sales.
We had reported that Amazon’s cloud computing network AWS on Tuesday had a major outage that significantly disrupted services of major companies in the U.S. for more than 5 hours.
Amazon has now explained the cause of the outage. According to Amazon.com, the automated processes in Amazon’s cloud computing business were the major cause of the blackout.
The company in a statement on Friday said the outage started on December 7 when an automated computer program that was designed in such a way as to make its network more reliable, ended up causing a “large number” of its systems to behave strangely unexpectedly. Amazon added that this therefore created a wave of activity on Amazon’s networks, ultimately preventing users from accessing some of its cloud services.
“Basically, a bad piece of code was executed automatically, and it caused a snowball effect,” Forrester analyst Brent Ellis said.
The outage persisted because their internal controls and monitoring systems were taken offline by the storm of traffic caused by the original problem,” he noted.
The company further noted that the nature of the failure prevented teams from pinpointing and fixing the problem. “They had to use logs to find out what happened, and internal tools were also affected. The rescuers were “extremely deliberate” in restoring service to avoid breaking still-functional workloads and had to contend with a “latent issue” that prevented networking clients from backing off and giving systems a chance to recover,” it noted.
The Amazing Web Service (AWS), has now disabled temporarily the scaling that caused the problem and it won’t be enabled until solutions are found.
The company notified the public that a solution for the latest glitch is coming within two week, adding that there will be an extra network configuration to protect devices in the event of a repeat failure.
Mr. Corey Quinn, a cloud economist at Duckbill Group, in his reaction, said that “Amazon didn’t explain what this unexpected behaviour was, and they didn’t know what it was. So they were guessing when trying to fix it, which is why it took so long.”
The Amazon cloud division last suffered a major outage like this in 2017, when an employee mistakenly turned off more servers than intended during repairs of a billing system.
Amazon has of recent been enmeshed in controversies, most notable of them the $1.28b fine slashed on the company by Italian regulator ACGM, after it accused Amazon of illegally abusing market dominance.