AWS Tagging is a set of highly available and scalable service that will help customers organize, manage and control resources across the entire AWS enterprise.. We operate some of the largest distributed systems in the world, and we need smart people to help us make and keep them excellent.
To assist with that mission, Amazon Web Services is seeking an experienced DevOps Engineer to work as part of our engineering team in Seattle, Washington. Are you excited about complex automation, self-healing fleets, tight monitoring and health checks? Then, we should talk!
We are looking for a seasoned DevOps Engineer to join our energetic, fast-moving and passionate team. The ideal candidate will have experience and talent for solving complex problems of scalability and availability in massively distributed systems, working as an integral part of engineering team as a DevOps engineer responsible for automation at scale.
This is a unique opportunity to join a fast-paced team, help drive design decisions for every feature and help us in deploying, operating, monitoring and further scaling a massive always-on distributed system that is core to all of AWS.
In this role, you will:
- Significant impact on design, development, testing, deployment and operation of these services end-to-end.
- Draw from your deep and broad technical expertise to hire and mentor engineers, complete hands-on technical work and provide leadership on complex operational and design issues.
- Be responsible for delivering automation for some of our most strategic technical projects in AWS and work on systems at the cutting edge of distributed storage and database technologies.
- Have a significant bottom-line impact on our business and competitive position by accelerating expansion to new AWS regions.
- Lead running and maintaining a 24x7 Internet-oriented production environment, preferably across multiple data centers, involving (preferably) thousands of machines.
- Drive specifying, designing, and implementing system health, performance monitoring tools, and software management tools for 24x7 environments.
- Be responsible for solving challenges surrounding efficient operations and failure mode analysis in large complex distributed system.
- Develop and improve existing application and system management tools and processes that reduce manual efforts and increase overall efficiency.
- Participate in the design and execution of production acceptance tests and new hardware evaluations.
- Monitor the health of the fleet, automating system health, maintenance tasks, and reporting systems as needed.