The Consumer Technical Risk Reduction (TRR) team seeks to continuously reduce the risks associated with the development operations required to build and run the technical systems that are critical to delivering the best online retail customer experience in the world - without interruption. We implement mechanisms to detect and rapidly recover from shopping experience anomalies, and we drive teams to adopt practices and mechanisms for preventing incidents from occurring in the first place. The work we do influences the way that Consumer technical teams deliver results to customers. If you’re interested in playing a high-visibility and high-impact role to help Amazon become recognized as the world leader in large-scale development operational excellence then TRR is the place for you.
We envision a future state in which Amazon teams continuously deliver new features that delight customers with no risk of interruption to a world-class customer experience. We envision that all build, deployment, test execution, problem detection and recovery, and fleet scaling and de-scaling activities are all fully automated thus eliminating human error, enabling business agility, and assuring business continuity. We envision that peak retail events do not require special treatment, and the overall developer experience is vastly improved by focusing on new value delivery instead of reacting to problems.
Chaos Engineering is the discipline of intentionally injecting failures into distributed systems in order to improve the system's ability to withstand these failures in production. Many of the extended outages in distributed systems occur when each individual service is behaving as intended, but unexpected interactions between services leads to a chaotic outcome.
The Chaos Engineering team is building a world-class failure-injection tool which service owners use to run chaos experiments on their systems. Our tool create real-world failures like network disruption, resource exhaustion or dependency failure in a controlled environment. This allows service owners to determine whether their mitigation strategies for these failures are effective and to continuously improve their system until they are.
We are a software development team, not a testing team. Since our tool has the ability to disrupt production services, it is critical that it performs predictably, halts experiments automatically when things go wrong, and scales to support the thousands of hosts in the Amazon infrastructure.
If you work on Site Reliability Engineering in your current role and/or are interested in creating chaos, we would love to have you on our team. We want to leverage your expertise to make Amazon the most reliable e-commerce experience in the world.