… a basic tenet of redundancy was ignored by engineers expecting this large-scale service to be too big to fail.
In retrospect of the attacks, seven lessons became apparent:
- Embrace the distributed design of the Internet
- Consolidation of services is a hacker’s target
- Major sites like Twitter shouldn’t use a single DNS provider
- All the redundancy that availability zones or regions a SaaS or IaaS provides, it’s still vulnerable by actors who put their victims in the sights of their botnets
- Ticketing systems in the cloud can become victims, preventing internal communicate on issues
- IoT devices need firmware bandwidth rate limiting and ACLs
- DDoS needs to be stopped at the source
The original design of the Internet, long before WWW was anything more than a typo caused by someone accidentally pressing two too many W’s on their keyboard, grew out of the desire to have multiple paths to reach services on the Internet. Web sites should not do anything contrary to this design and expecting large-scale services to not fail based on size is.
The old adage of the larger they are, the harder they fall played out perfectly on Oct 21 much to the delight of the hackers orchestrating the attack. One large DNS provider affected so many sites that many bloggers and journalists falsely claimed half the Internet was down and the sky was falling. To achieve maximum uptime, engineers know to eliminate Single Points of Failure, but these same engineers often don’t think it’s possible that an entire cloud service could go offline.
Sites like Twitter need to use multiple providers. I know it’s simple and elegant to only have to deal with one large service provider where deeper discounts and leverage may exist, but the very issue with Dyn wiped Twitter off the map (depending on where you were on that map).
NetFlix, Amazon’s largest customer, saw impact on Oct 21 from Amazon’s reliance on 3rd party DNS providers like Dyn. NetFlix should learn from this experience that Amazon could not provide the level of service required and should seek out a higher-level of availability both around DNS itself as well as cloud computing. NetFlix is large enough that other cloud providers augmenting their services on Amazon may serve them well, but it’s possible they’ve already completed this exercise and concluded that it’s more economical for them to take an occasional hit with attacks than to maintain the additional overhead of multiple DNS, compute, and storage providers.
Some of the impacted sites were unable to communicate internally because their ticketing systems or chat services also relied on the same name servers. During triage of events, it’s important to understand that a major event may not only knock your customer’s service offline, but also your ability to manage that event with your internal staff.
As it’s becoming more apparent with recent attacks, IoT (Internet of Things) devices are prevalent and cheap which translates into having a lot of them to use in a botnet and are easily hackable. The tech industry needs standards around how much bandwidth consumer devices should be allowed to consume either through firmware itself on those devices or by upstream devices enforcing these — or both preferably. And ACLs (access control lists) should easily be provisioned on home routers so a web camera cannot be allowed to send massive DNS queries out via udp/53 to a large list of addresses. The camera’s DNS query should be restricted to using the resolver on the router. I’d like to see a standard protocol developed where IoT devices maintain a public profile of what external services they need to operate (within the LAN and on the Internet), and this protocol makes such requests against home routers asking permission via ACLs for those to exist.
A multi-prong effort to stop DDoS at the source is needed; today, much of the mitigation effort occurs at the destinations. Besides bandwidth rate limiting and ACLs, reverse path forwarding should be implemented to prevent spoofing of IP addresses which helps more with TCP than UDP attacks. For the cases where millions of IoT devices are used with their valid source addresses targeting a single victim, we need more intelligence in the packets themselves to identify the type of application so service rate limiting can be employed. There’s no reason a web camera or a connected refrigerator needs to make so many DNS queries as these devices usually have one or two names to query and those get cached for some period of time; any device not caching the results per the TTL or making too many queries should be suspect.
Attacks of the scale and nature of the one seen on Oct 21 will continue to rise especially as the proliferation of IoT devices increases logarithmically. Our reliance on connected devices continues to encroach into our daily lives where our dependencies on technology is reaching a point of no return. As such, the tech industry with government need to come together better than we’ve done so far to ensure critical services go unaffected. Perhaps we need an Internet of Internets so the Internet of web cams, for example, doesn’t take out the Internet of everything else. I know this sounds far fetched, but it can be implemented with overlay networks based on application types embedded in packets that are set in firmware and not easily changed by hackers.