Returning Candidate?

Software Development Engineer, Chaos Engineering

Software Development Engineer, Chaos Engineering

Job ID 
Posted Date 
Amazon Corporate LLC
Position Category 
Software Development
Recruiting Team 

Job Description

The Consumer Technical Risk Reduction (TRR) team seeks to continuously reduce the risks associated with the development operations required to build and run the technical systems that are critical to delivering the best online retail customer experience in the world - without interruption. We implement mechanisms to detect and rapidly recover from shopping experience anomalies, and we drive teams to adopt practices and mechanisms for preventing incidents from occurring in the first place. The work we do influences the way that Consumer technical teams deliver results to customers. If you’re interested in playing a high-visibility and high-impact role to help Amazon become recognized as the world leader in large-scale development operational excellence then TRR is the place for you.

We envision a future state in which Amazon teams continuously deliver new features that delight customers with no risk of interruption to a world-class customer experience. We envision that all build, deployment, test execution, problem detection and recovery, and fleet scaling and de-scaling activities are all fully automated thus eliminating human error, enabling business agility, and assuring business continuity. We envision that peak retail events do not require special treatment, and the overall developer experience is vastly improved by focusing on new value delivery instead of reacting to problems.

Chaos Engineering is the discipline of intentionally injecting failures into distributed systems in order to improve the system's ability to withstand these failures in production. Many of the extended outages in distributed systems occur when each individual service is behaving as intended, but unexpected interactions between services leads to a chaotic outcome.

The Chaos Engineering team is building a world-class failure-injection tool which service owners use to run chaos experiments on their systems. Our tool create real-world failures like network disruption, resource exhaustion or dependency failure in a controlled environment. This allows service owners to determine whether their mitigation strategies for these failures are effective and to continuously improve their system until they are.

We are a software development team, not a testing team. Since our tool has the ability to disrupt production services, it is critical that it performs predictably, halts experiments automatically when things go wrong, and scales to support the thousands of hosts in the Amazon infrastructure.

If you work on Site Reliability Engineering in your current role and/or are interested in creating chaos, we would love to have you on our team. We want to leverage your expertise to make Amazon the most reliable e-commerce experience in the world.


Basic Qualifications

• Bachelor’s Degree in Computer Science or related field
• Equivalent experience to a Bachelor's degree based on 3 years of work experience for every 1 year of education
• 4+ years professional experience in software development
• Computer Science fundamentals in object-oriented design
• Computer Science fundamentals in data structures
• Computer Science fundamentals in algorithm design, problem solving, and complexity analysis
• Proficiency in, at least, one modern programming language such as C, C++, C#, Java, or Perl

Preferred Qualifications

• Experience taking a leading role in building complex software systems that have been successfully delivered to customers
• Experience with distributed computing and enterprise-wide systems
• Experience with agile software development practices
• Experience mentoring junior software engineers to improve their skills, and make them more effective, product software engineers
• Experience with software engineering best practices including coding standards, code reviews, source control management, build processes, testing, and operations
• Experience with building systems with multiple layers of redundancy to withstand failures in software, hardware, network infrastructure.

Amazon is an Equal Opportunity-Affirmative Action Employer – Minority / Female / Disability / Veteran / Gender Identity / Sexual Orientation