When the Cloud Evaporates
Mon, October 31, 2016 at 15:59
generalnetworkerror in cloud, high availability
Recent DDoS reports against a major DNS provider have brought to light weaknesses when too much dependency is put on sole providers of cloud services.  What became even more apparent — and shocking to say the least — was the level of dependency major Internet content providers such as Twitter and NetFlix, to name a couple of them, placed on these SaaS and IaaS providers without sufficient redundancy.  Expecting major cloud services to not fail on the assumption that they are too big to fail and should have adequate high availability in place to prevent catastrophic failure begged for trouble.  Just as in real life, clouds form, drift along, then evaporate, so should our expectations be with cloud providers being there 100% of the time.
… a basic tenet of redundancy was ignored by engineers expecting this large-scale service to be too big to fail.
Dyn’s DNS infrastructure experienced a massive DDoS attack on Oct 21 2016.  Major Internet content providers rely on Dyn’s name servers and their large, global footprint to provide the high availability you would expect.  However, a basic tenet of redundancy was ignored by engineers expecting this large-scale service to be too big to fail.  In the case of Twitter, all of their published name servers easily obtained through WHOIS or DIG information showed they were (and continue to be at the time of this writing) entirely dependent on Dyn’s name servers.  No one expected that a [near] record-setting DDoS against a single provider could have such a large impact across many large-scale sites. 


In retrospect of the attacks, seven lessons became apparent:

  1. Embrace the distributed design of the Internet
  2. The original design of the Internet, long before WWW was anything more than a typo caused by someone accidentally pressing two too many W’s on their keyboard, grew out of the desire to have multiple paths to reach services on the Internet.  Web sites should not do anything contrary to this design and expecting large-scale services to not fail based on size is.

  3. Consolidation of services is a hacker’s target
  4. The old adage of the larger they are, the harder they fall played out perfectly on Oct 21 much to the delight of the hackers orchestrating the attack.  One large DNS provider affected so many sites that many bloggers and journalists falsely claimed half the Internet was down and the sky was falling.  To achieve maximum uptime, engineers know to eliminate Single Points of Failure, but these same engineers often don’t think it’s possible that an entire cloud service could go offline.

  5. Major sites like Twitter shouldn’t use a single DNS provider
  6. Sites like Twitter need to use multiple providers.  I know it’s simple and elegant to only have to deal with one large service provider where deeper discounts and leverage may exist, but the very issue with Dyn wiped Twitter off the map (depending on where you were on that map).

  7. All the redundancy that availability zones or regions a SaaS or IaaS provides, it’s still vulnerable by actors who put their victims in the sights of their botnets
  8. NetFlix, Amazon’s largest customer, saw impact on Oct 21 from Amazon’s reliance on 3rd party DNS providers like Dyn.  NetFlix should learn from this experience that Amazon could not provide the level of service required and should seek out a higher-level of availability both around DNS itself as well as cloud computing.  NetFlix is large enough that other cloud providers augmenting their services on Amazon may serve them well, but it’s possible they’ve already completed this exercise and concluded that it’s more economical for them to take an occasional hit with attacks than to maintain the additional overhead of multiple DNS, compute, and storage providers.

  9. Ticketing systems in the cloud can become victims, preventing internal communicate on issues
  10. Some of the impacted sites were unable to communicate internally because their ticketing systems or chat services also relied on the same name servers.  During triage of events, it’s important to understand that a major event may not only knock your customer’s service offline, but also your ability to manage that event with your internal staff.

  11. IoT devices need firmware bandwidth rate limiting and ACLs
  12. As it’s becoming more apparent with recent attacks, IoT (Internet of Things) devices are prevalent and cheap which translates into having a lot of them to use in a botnet and are easily hackable.  The tech industry needs standards around how much bandwidth consumer devices should be allowed to consume either through firmware itself on those devices or by upstream devices enforcing these — or both preferably.  And ACLs (access control lists) should easily be provisioned on home routers so a web camera cannot be allowed to send massive DNS queries out via udp/53 to a large list of addresses.   The camera’s DNS query should be restricted to using the resolver on the router.  I’d like to see a standard protocol developed where IoT devices maintain a public profile of what external services they need to operate (within the LAN and on the Internet), and this protocol makes such requests against home routers asking permission via ACLs for those to exist.

  13. DDoS needs to be stopped at the source
  14. A multi-prong effort to stop DDoS at the source is needed; today, much of the mitigation effort occurs at the destinations.  Besides bandwidth rate limiting and ACLs, reverse path forwarding should be implemented to prevent spoofing of IP addresses which helps more with TCP than UDP attacks.  For the cases where millions of IoT devices are used with their valid source addresses targeting a single victim, we need more intelligence in the packets themselves to identify the type of application so service rate limiting can be employed.  There’s no reason a web camera or a connected refrigerator needs to make so many DNS queries as these devices usually have one or two names to query and those get cached for some period of time; any device not caching the results per the TTL or making too many queries should be suspect.


Attacks of the scale and nature of the one seen on Oct 21 will continue to rise especially as the proliferation of IoT devices increases logarithmically.  Our reliance on connected devices continues to encroach into our daily lives where our dependencies on technology is reaching a point of no return.  As such, the tech industry with government need to come together better than we’ve done so far to ensure critical services go unaffected. Perhaps we need an Internet of Internets so the Internet of web cams, for example, doesn’t take out the Internet of everything else.  I know this sounds far fetched, but it can be implemented with overlay networks based on application types embedded in packets that are set in firmware and not easily changed by hackers.


Article originally appeared on general network error (http://generalnetworkerror.com/).
See website for complete article licensing information.