Blog

An open letter to AWS

Dear AWS,

Let me be clear.  I make a living from AWS.  As an independent consultant, I work all day, every day in the AWS space.  I sing the praises of the Well-Architected Framework, help to write certification exams, and I've missed only 2 re:invent conferences in 10 years.  You could say I'm a fan; that I want you to be wildly successful. I've drank more AWS Kool-aid than you can possibly imagine.  I get excited when I talk about all the capabilities and possibilities in your ecosystem.  My clients look to me for AWS subject matter expertise and guidance.  

So enough is enough.  Even before 3 (but, really, 4) outages in less than a month, I've been struggling with what I've been seeing happening with your increasingly complex array of service offerings and 'shiny new object' syndrome that is blinding you from actually helping your customers.

You are already going to lose some market share over the outages these last few weeks.  More importantly, you are losing trust.  As I tell my teenagers:  Trust is easy to lose - and much, much harder to get back.

So it's time to stop chugging along like there is nothing wrong.  There is plenty that needs a fixin’, and a lot of trust to work on.

It's easy to pick out flaws, and armchair quarterback the last few weeks.  It's much harder to suggest corrective courses of action.  So here is what you, AWS, need to do:

Stop your feature development.  I'm serious. Stop it.  Innovation for innovation's sake is not progress.  Stop having teams step on each other to come up with variations of the same product, stop adding service offerings, and OMG, get some product managers to create something resembling a consistent and decent dashboard experience. Your customers are trying to run their businesses on your infrastructure.  They need stability, reliability and consistency.  Take your incredible pool of talent and focus on the massive backlog of refinements and needed improvements that are building up. 

Heal the tech wounds:  Forget calling it tech debt.  Wounds fester, and if ignored, only get worse.  Perhaps there's been a loss of institutional knowledge over the years as massive, unprecedented growth has collided with insanely increased complexity.   Step back, untangle your environment, figure out what you don't know and focus on mending the broken parts.

Be transparent: Here's one of many elephants in the room for you:  All the outages started within days of the discovery of one the most impactful 0 days in the history of the internetAWS needs to communicate very clearly if there is any connection between that and the outages.  Why?  Because so many AWS customers are talking about it and seeing a correlation.  Suspicion is not a good foundation for customer loyalty.   Address these issues head on.  While you are at it, make your RCA's easier to find as soon as they are available, and allow customers to ask hard questions and to challenge your designs.

Eat your own dog food:  When architecting regional infrastructure, there are foundational principles long touted by AWS to insulate against failures, and AWS is violating those principles.  This is not only inexcusable.  It is dangerous and makes it extremely difficult if not impossible to design defensible resilient architectures if reality does not match the dogma.  

The recent outages make it clear that AWS has elements of infrastructure BOTH WITHIN REGIONS AND ACROSS REGIONS that represent single points of failure.  I saw dozens of examples of this over the course of these several outages:

December 7, 2021:  Entire region inaccessible due to network issues. Boy that was a long day. RCA analysis here.
According to the analysis, the root cause was:
"...an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network." 

This is the most concerning part:
"This code path has been in production for many years but the automated scaling activity triggered a previously unobserved behavior."

This sentence opens the door to a lot of questions, starting with:  How many more code paths have been untested in the face of the increased complexity of the environment?

December 15, 2021:  Outbound internet issue in us-west-1 (N. Cal) and us-west-2 (Oregon) simultaneously.  NO RCA provided to date. 

While this was a short episode, it was disruptive.  Internet disruption across multiple regions that are supposed to be completely isolated from one another is anathema to the design principles AWS espouses.  It should not be possible.

December 22, 2021:  AWS reported an issue related to power failure across EBS in one Availability Zone.  However, systems were failing across multiple availability zones, and api calls to the entire region were intermittently failing.  This completely violates the concept of isolation between Availability zones.  Boy, that was another long day.  NO RCA provided to date.

If Regions aren't demonstrating both physical and logical isolation, and neither are Availability Zones, things are broken and you need to fix them.

Please don't tell us that the answer to any of the recent problems is to focus on a multi-region strategy.  Just…don't do that.   Instead, you need to dig up the roots of your well architected pillars, practice some chaos engineering, find the gremlins and align practice with principle.  

If your customers cannot trust the basic principles of boundary isolation in AWS, they will be forced to shift to another vendor or go multi-cloud, which is so much more costly and complicated than what should be needed for so many that just want a reliable platform for their workloads.

Many mantras that exist in technology often lose their impact with too much repetition.  Your idea of customer obsession is excellent, but in practice, perhaps you need to revisit the conversations you are having with your customers. I guarantee that if, right now,  you give customers a choice between building new features or ensuring a reliable environment on which they can build products and services, that your decision points would look very different moving forward.

You have been the pioneer.  You have brought the cloud to the masses and transformed the technology landscape.  Your offerings and ideas have inspired generations of technologists to build better, think differently and find passion in their day to day roles.  You have certainly inspired me over the years.

Now it's time to recognize your shifting role and adapt accordingly.  No longer the pioneer or settler, you are now the town planner, and we need to trust you to improve the structures on which we can all depend. Do not miss this opportunity to reflect and take a step back so you can ultimately continue to move forward and regain our trust.  The fans among us would greatly appreciate it.

Darren WeinerComment