At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess over getting the details right. We love what we do and are proud of our work to secure clouds and container environments for thousands of B2B customers worldwide. We are looking for a highly skilled Site Reliability Engineering (SRE) Manager to lead our SRE team in building scalable, reliable, and secure infrastructure that ensures the highest levels of availability and performance. Job Summary: As an SRE Manager, you will be responsible for leading a team of Site Reliability Engineers who design, build, and maintain resilient systems. You will play a critical role in enhancing system reliability, improving incident response, automating operations, and driving best practices in infrastructure management. The ideal candidate will have a strong background in software engineering, cloud infrastructure, and operational excellence. Key Responsibilities:
- Lead, mentor, and grow a team of Site Reliability Engineers.
- Develop and implement strategies to improve system reliability, observability, and automation.
- Establish and maintain SLIs, SLOs, and SLAs to ensure high availability and performance.
- Drive incident response, root cause analysis, and postmortem processes.
- Collaborate with software engineering teams to improve application architecture and resiliency.
- Manage cloud-based infrastructure (AWS) and ensure best practices for security and scalability.
- Collaborate with cross-functional teams, including developers, security, and product teams.
- Stay updated with industry trends and introduce new tools and methodologies to enhance reliability and efficiency.
Required Qualifications:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 7+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles.
- 3+ years of experience in a leadership or managerial role within an SRE or DevOps team.
- Extensive experience with Infrastructure as Code (Terraform, etc.), as well as supporting tooling (Atlantis, ArgoCD, etc.).
- Extensive experience with Kubernetes and supporting tooling (Helm, operators, etc.).
- Extensive experience with a variety of cloud-managed services and providers.
- AWS: EKS, EC2, S3, RDS, Secrets Manager, etc.
- Experience building production-quality cloud infrastructure that enables reliable and rapid deployment of microservices with effective monitoring and built-in high availability and/or fault tolerance.
- Strong cross-team communication skills.
- Experience with the building blocks of large-scale systems, including load balancing, distributed/cloud computing, containers, instrumentation, and monitoring.
- Knowledge of cloud networking, including VPC configuration and cross-cloud connectivity.
- Familiarity with one or more programming languages (Python, Golang, etc.).
- Deep understanding of observability tools (Prometheus, Grafana, Splunk, ELK Stack).
- Excellent communication and collaboration abilities.
|