Returning Candidate?

Site Reliability Engineering Manager – Alexa

Site Reliability Engineering Manager – Alexa

Job ID 
Posted Date 
Amazon Corporate LLC
Position Category 
Software Development
Recruiting Team 

Job Description

Site Reliability Engineering Manager for Alexa Services owns the end-to-end availability and reliability of our critical cloud services. You will work on proactively identifying the factors impacting the availability of our critical endpoints, and engineering solutions to ensure the reliability of those services. You will go deep into cloud technologies and extend the limits of existing infrastructure to solve unique scaling and efficiency problems. You will be dealing with Services that operate at a latency scale of micro seconds and that handle millions of transactions per day. You should be comfortable and knowledgeable troubleshooting issues, and interested in developing tools to automate that knowledge.


  • Be responsible for the overall uptime and performance of critical Alexa cloud services.
  • Manage departmental resources, staffing, mentoring, and enhancing and maintaining a best-of-class engineering team
  • Manage and execute against project plans and delivery commitments.
  • Work with internationally distributed teams and manage 24x7 on call resources.
  • Design, write, and deliver software to improve the reliability, scalability, capacity, and latency of Alexa services.
  • Identify recurring problems and build the tools and processes to prevent problems from recurring.
  • Build the tools and processes to help quickly triage issues and identify the component(s) that need to be fixed.
  • Identify and build monitoring and alarming solutions
  • Work with distributed teams to ensure that components are properly instrumented to be reliably used, monitored, and debugged in the service.
  • Conduct periodic on-call duties.

Basic Qualifications

  • Bachelor’s degree in Computer Science or related field
  • 7+ years of experience building production software systems
  • 5+ years of people management experience

Preferred Qualifications

  • Experience using AWS cloud services for compute, networking, storage, and load balancing.
  • Love of analytics and focus on metrics.
  • Understanding of web services, web application development, SQL, REST/JSON.
  • Knowledge and understanding of network theory and concepts such as TCP/IP, UDP, DNS, and load balancing.
  • Working knowledge of both scripting languages (Bash, Python, etc..) and high level programming languages (such as Java and C/C++)
  • Strong Unix/Linux fundamentals.
  • Understanding of distributed systems and how to deal with very large datasets.
  • Ability to systematically troubleshoot issues. is an Equal Opportunity-Affirmative Action Employer – Minority / Female / Disability / Veteran / Gender Identity / Sexual Orientation