Welcome!

Eclipse Authors: Liz McMillan, David H Deans, JP Morgenthal, Mano Marks, Yeshim Deniz

Related Topics: Artificial Intelligence, Machine Learning , @CloudExpo

Artificial Intelligence: Article

AWS Broke the Internet Again or, Better, a Typo | @CloudExpo #AI #ML #DL

An AI-defined infrastructure can help to avoid service disruptions

Amazon Web Services (AWS) broke the Internet again or better "a typo". On February 28, 2017, an Amazon S3 service disruption in AWS' oldest region US-EAST-1 shuts down several major websites and services like Slack, Trello, Quora, Business Insider, Coursera and Time Inc. Other users were reporting that they were also unable to control devices which were connected via the Internet of Things since IFTTT was also down. Those kinds of disruptions are becoming more and more business critical for today's digital economy. To prevent these situations, cloud users should always consider the shared responsibility model in the public cloud. However, there are also ways where Artificial Intelligence (AI) can help. This article describes that an AI-defined Infrastructure respectively an AI-powered IT management system can help to avoid service disruptions of public cloud providers.

Amazon S3 Service Disruption - What has happened
After every service disruption AWS writes a summary of what was going on during an incident. This is what happened on the morning of February 28.

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable."

Read more under "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region".

Bottom line, a typo crashed the AWS powered Internet! AWS outages already have a long history and the more AWS customers running their web infrastructure on the cloud giant, the more issues end customers will experience in the future. According to SimilarTech only Amazon S3 is already used by 152,123 websites and 124,577 unique domains.

However, following the philosophy of "Everything fails all the time (Werner Vogels, CTO Amazon.com)" means if you are using AWS you must "Design for Failure".  Something cloud role model and video on demand provider Netflix is doing in perfection. In doing so, Netflix has developed its Simian Army an open source toolset everyone can use to run a cloud infrastructure on AWS high-available.

Netflix "simply" uses the two levels of redundancy AWS offers. Multiple regions and multiple availability zones (AZ). Multiple regions are the masterclass of using AWS, very complex and sophisticated since you must build and manage entire separated infrastructure environments within AWS' worldwide distributed cloud infrastructure. Multiple AZs are the preferred and "easiest" way for high availability (HA) on AWS. In this case, the infrastructure is built within more than one data center (AZ). In doing so, a single region HA architecture is deployed in at least two or more AZs - a load balancer in front of it is controlling the data traffic.

However, even if "typos" shouldn't happen the recent accident shows, that human error is still the biggest issue running IT systems. In addition, you can blame AWS only to a certain extend since the public cloud is about shared responsibility.

Shared Responsibility in the Public Cloud
An important public cloud detail is the self-service. Depending on its DNA the providers are only taking responsibility for specific areas. The customer is responsible for the rest. In the public cloud, it is about sharing responsibilities - this model is called Shared Responsibility. The provider and its customers divide the field of duties among themselves. In doing so, the customer's self-responsibility plays a major role. In the context of IaaS utilization, the provider is responsible for the operations and security of the physical environment. He is taking care of:

  • Set up and maintenance of the entire data center infrastructure.
  • Deployment of compute power, storage, network and managed services (like databases) and other micro services.
  • Provisioning the virtualization layer customers are using to demand virtual resources at any time.
  • Deployment of services and tools customers can use to manage their areas of responsibility.

The customer is responsible for the operations and security of the logical environment. This includes:

  • Set up of the virtual infrastructure.
  • Installation of operating systems.
  • Configuration of networks and firewall settings.
  • Operations of own applications and self-developed (micro) services.

Thus, the customer is responsible for the operations and security of his own infrastructure environment and the systems, applications, services, as well as stored data on top of it. However, providers like Amazon Web Services or Microsoft Azure provide comprehensive tools and services customers can use e.g. to encrypt their data as well as ensure identity and access controls. In addition, enablement services (micro services) exist that customers can adopt to develop own applications more quickly and easily.

In doing so, the customer is all alone in its area of responsibility and thus must take self-responsibility. However, this part of the shared responsibility can be done by an AI-defined IT management system respectively an AI-defined Infrastructure.

An AI-defined Infrastructure can help to avoid Service Disruptions
An AI-defined Infrastructure can help to avoid service disruptions in the public cloud. However, the basis of this kind of infrastructure is a General AI that combines three major human abilities that enable enterprises to tackle IT and business process challenges.

  • Understanding: By creating a semantic data map the General AI understands the world of the company in which its IT and business exists.
  • Learning: By creating Knowledge Items the General AI learns best practices and reasoning from experts. Knowledge is taught in atomic pieces of information (Knowledge Items) that represent separate steps of a process.
  • Solving: With machine reasoning problems are solved in ambiguous and changing environments. The General AI dynamically reacts to the ever-changing context, selecting the best course of action. Based on machine learning the results are optimized through experiments.

To put this into the context of an AWS service disruption:

  • Understanding: The General AI creates a semantic map of the AWS environment as part of the world in which the company exists.
  • Learning: IT experts create Knowledge Items while they are configuring and working with AWS from what the General AI learns best practices. Thus, the experts teach the General AI contextual knowledge that includes what, when, where and why something needs to be done - for example when a specific AWS service is not responding.
  • Solving: The General AI dynamically reacts to incidents based on the learned knowledge. Thus, the AI (probably) knows what to do at this very moment - even if no high availability setup was considered from the beginning.

Frankly speaking, everything described above is no magic. Like every new born organism an AI-defined Infrastructure needs to be trained but afterwards can work autonomously as well as can detect anomalies as well as service disruptions in the public cloud and solve them. Therefore, you need the knowledge of experts who have a deep understanding of AWS and how the cloud works in general. These experts need to teach the General AI with their contextual knowledge that includes not only what, when and where but also why. They have to teach the AI with atomic pieces (Knowledge Items, KI) that can be indexed and prioritized by the AI. Context and indexing enable this KIs to be combined to form many solutions.

KIs created by various IT experts create pooled expertise that is further optimized by machine selection of best knowledge combinations for problem resolution. This type of collaborative learning improves process time task by task. However, the number of possible permutations grows exponentially with added knowledge. Connected to a knowledge core, the General AI continuously optimizes performance by eliminating unnecessary steps and even changing routes based on other contextual learning. And the bigger the semantic graph and knowledge core gets, the better and more dynamically the infrastructure can act in terms of service disruptions.

On a final note, do not underestimate the "power of we"! Our research at Arago revealed that with an overlap of 33 percent in basic knowledge, this knowledge can and is used outside a specific organizational environment, i.e. across different client environments. The reuse of knowledge within a client is up to 80 percent. Thus, exchanging basic knowledge within a community becomes imperative from an efficiency perspective and improve the abilities of the General AI.

More Stories By Rene Buest

Rene Buest is Director of Market Research & Technology Evangelism at Arago. Prior to that he was Senior Analyst and Cloud Practice Lead at Crisp Research, Principal Analyst at New Age Disruption and member of the worldwide Gigaom Research Analyst Network. At this time he was considered a top cloud computing analyst in Germany and one of the worldwide top analysts in this area. In addition, he was one of the world’s top cloud computing influencers and belongs to the top 100 cloud computing experts on Twitter and Google+. Since the mid-90s he is focused on the strategic use of information technology in businesses and the IT impact on our society as well as disruptive technologies.

Rene Buest is the author of numerous professional technology articles. He regularly writes for well-known IT publications like Computerwoche, CIO Magazin, LANline as well as Silicon.de and is cited in German and international media – including New York Times, Forbes Magazin, Handelsblatt, Frankfurter Allgemeine Zeitung, Wirtschaftswoche, Computerwoche, CIO, Manager Magazin and Harvard Business Manager. Furthermore he is speaker and participant of experts rounds. He is founder of CloudUser.de and writes about cloud computing, IT infrastructure, technologies, management and strategies. He holds a diploma in computer engineering from the Hochschule Bremen (Dipl.-Informatiker (FH)) as well as a M.Sc. in IT-Management and Information Systems from the FHDW Paderborn.

@ThingsExpo Stories
SYS-CON Events announced today that EnterpriseTech has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. EnterpriseTech is a professional resource for news and intelligence covering the migration of high-end technologies into the enterprise and business-IT industry, with a special focus on high-tech solutions in new product development, workload management, increased effi...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists looked at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deliver...
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
SYS-CON Events announced today that CHEETAH Training & Innovation will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CHEETAH Training & Innovation is a cloud consulting and IT training firm specializing in improving clients cloud strategies and infrastructures for medium to large companies.
SYS-CON Events announced today that TMC has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo and Big Data at Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Global buyers rely on TMC’s content-driven marketplaces to make purchase decisions and navigate markets. Learn how we can help you reach your marketing goals.
SYS-CON Events announced today that Conference Guru has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organi...
"MobiDev is a Ukraine-based software development company. We do mobile development, and we're specialists in that. But we do full stack software development for entrepreneurs, for emerging companies, and for enterprise ventures," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
IoT solutions exploit operational data generated by Internet-connected smart “things” for the purpose of gaining operational insight and producing “better outcomes” (for example, create new business models, eliminate unscheduled maintenance, etc.). The explosive proliferation of IoT solutions will result in an exponential growth in the volume of IoT data, precipitating significant Information Governance issues: who owns the IoT data, what are the rights/duties of IoT solutions adopters towards t...
With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, discussed some of the security challenges of the IoT infrastructure and related how these aspects impact Smart Living. The material was delivered interac...
No hype cycles or predictions of zillions of things here. IoT is big. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, Associate Partner at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He discussed the evaluation of communication standards and IoT messaging protocols, data analytics considerations, edge-to-cloud tec...
When growing capacity and power in the data center, the architectural trade-offs between server scale-up vs. scale-out continue to be debated. Both approaches are valid: scale-out adds multiple, smaller servers running in a distributed computing model, while scale-up adds fewer, more powerful servers that are capable of running larger workloads. It’s worth noting that there are additional, unique advantages that scale-up architectures offer. One big advantage is large memory and compute capacity...
New competitors, disruptive technologies, and growing expectations are pushing every business to both adopt and deliver new digital services. This ‘Digital Transformation’ demands rapid delivery and continuous iteration of new competitive services via multiple channels, which in turn demands new service delivery techniques – including DevOps. In this power panel at @DevOpsSummit 20th Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, panelists examined how DevOps helps to meet the de...
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Datanami has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datanami is a communication channel dedicated to providing insight, analysis and up-to-the-minute information about emerging trends and solutions in Big Data. The publication sheds light on all cutting-edge technologies including networking, storage and applications, and thei...
SYS-CON Events announced today that Silicon India has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Published in Silicon Valley, Silicon India magazine is the premiere platform for CIOs to discuss their innovative enterprise solutions and allows IT vendors to learn about new solutions that can help grow their business.
The Internet giants are fully embracing AI. All the services they offer to their customers are aimed at drawing a map of the world with the data they get. The AIs from these companies are used to build disruptive approaches that cannot be used by established enterprises, which are threatened by these disruptions. However, most leaders underestimate the effect this will have on their businesses. In his session at 21st Cloud Expo, Rene Buest, Director Market Research & Technology Evangelism at Ara...