This weekly feature from S&P Global Market Intelligence, in collaboration with internet-service monitoring company ThousandEyes, aims to give remote workers insights into internet service disruptions.
Please note this feature will be on holiday hiatus during the remainder of 2020; it will return Monday, Jan. 11.
The number of internet outages worldwide dropped by a third last week, and by 52% within the U.S.
That improvement did not mean smooth sailing for work-from-home employees, however, due to massive failures at Alphabet Inc.'s Google and Microsoft Corp.'s online Office applications.
Within the U.S., outages among ISPs dropped to 67 for the week ended Dec. 18, from 139 the week prior, according to data from ThousandEyes, a network-monitoring service owned by Cisco Systems Inc. U.S.-based outages accounted for 37% of the worldwide total last week, down from 52% a week earlier.
The drop in outages should have indicated a week of good performance from the point of view of employees working from home. A well-running network does not guarantee the applications people rely on are also working well, however, as was demonstrated last week, according to Angelique Medina, director of product marketing for ThousandEyes.
Despite the drop in ISP outages last week, outages in applications from two of the cloud providers on which work-from-home users depend caused a high rate of disruption during business hours.
On Dec. 14, Google suffered a 47-minute outage beginning at 3:43 a.m. PT on Dec. 14 that took down Gmail, Google Docs, Sheets, Classroom, and every other service listed on Google's Workspace Status Dashboard that day.
The outage lasted from 3:45 a.m. PT to 4:35 a.m. PT, according to Google's incident-report page, which said the problem arose from the imperfect application of a new quota system.
A more detailed explanation Dec. 18 said that Google had flubbed the installation of the new quota system, which is used to make the operation of hyper-scale data centers more efficient by making sure every application gets only the amount of computing power it needs to run correctly. Google installed the system in October, but it did not remove all of the old one, and it did not realize that the two systems were not able to trade information about how much computing power was needed by the applications for which each system was responsible.
Google built in a grace period during which quotas were not enforced, but it did not anticipate that, once the grace period ran out, the conflict between the two systems would cause them to dial the amount of resources available for the Google User ID Service down close to zero. Google User ID gives each user account a unique identification number, which is attached to every request, email and bit of data sent through those accounts.
Most, if not all of Google's customer-facing applications use the User ID Service, which also connects to other applications and cloud services to allow customers to log in once to all their applications, not just those from Google.
With the quota controlling resources available to Google's User ID service dialed down too far for the service to operate correctly, however, Google servers could not verify that any emails or user requests were genuine, even those that had been authenticated ahead of time.
So Google treated almost every user request and data packet as unauthenticated and illegitimate, which meant bouncing every email and refusing every user request from every application that depended on the User ID service, according to Google's Dec. 18 post-mortem discussion of the incident.
Google reported the incident as resolved less than an hour after it started; reports of failed logins and bounced emails at services connected to Google's remained high for two days afterward, however.
Microsoft also had problems with user access and authentication last week. On Dec. 17, at around 1:45 a.m. ET, Microsoft experienced an outage that impacted access to some Microsoft services, including Office365. The outage lasted around 13 minutes and centered on Microsoft infrastructure in Des Moines, Iowa, according to ThousandEyes data. The outage cleared around 2:00 a.m. ET.
There was a significant U.S.-based internet outage on Dec. 14 as well. NTT America, the North American subsidiary of Japanese telecommunications provider Nippon Telegraph and Telephone Corp. (NTT) suffered an outage beginning at about 8:30 a.m. ET on Dec. 14 in parts of the NTT networks in Los Angeles and Seattle. The outage quickly spread to other portions of NTT's U.S. network and to Germany, Brazil, the U.K. and Canada. The outage lasted just over 19 minutes and was cleared at around 8:50 a.m. ET
The NTT outage was a genuine, physical problem preventing data from flowing through the network, which could have completely cut off portions of the U.S. based internet completely.
The physical internet is designed for failure, however, Medina said. There is almost never just one pathway between two points on the internet. When there is a blockage, routers automatically redirect internet traffic to avoid the problem even if it takes a few extra microseconds to get to their destinations.
Cloud-based applications rarely fail because they typically run on a large number of servers and can keep running even when some of the hardware on which they depend fails, Medina said.
Cloud-based applications also depend on other applications also owned by the cloud providers, however.
When one of those is so critical that other applications cannot function without it, as was the case with Google User ID, that single point of failure is able, for a short time at least, to counter all the resilience, fail-over capability and reliability functions built into both the internet and the cloud, Medina said.