Site Reliability Engineering Manager for Alexa Services owns the end-to-end availability and reliability of our critical cloud services. You will work on proactively identifying the factors impacting the availability of our critical endpoints, and engineering solutions to ensure the reliability of those services. You will go deep into cloud technologies and extend the limits of existing infrastructure to solve unique scaling and efficiency problems. You will be dealing with Services that operate at a latency scale of micro seconds and that handle millions of transactions per day. You should be comfortable and knowledgeable troubleshooting issues, and interested in developing tools to automate that knowledge.
- Be responsible for the overall uptime and performance of critical Alexa cloud services.
- Manage departmental resources, staffing, mentoring, and enhancing and maintaining a best-of-class engineering team
- Manage and execute against project plans and delivery commitments.
- Work with internationally distributed teams and manage 24x7 on call resources.
- Design, write, and deliver software to improve the reliability, scalability, capacity, and latency of Alexa services.
- Identify recurring problems and build the tools and processes to prevent problems from recurring.
- Build the tools and processes to help quickly triage issues and identify the component(s) that need to be fixed.
- Identify and build monitoring and alarming solutions
- Work with distributed teams to ensure that components are properly instrumented to be reliably used, monitored, and debugged in the service.
- Conduct periodic on-call duties.