SRE Team Lead
Chennai
8–12 Years
Domain: Fintech | Cloud-native | Microservices
Role Summary
- We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices
- This role combines technical leadership, architecture ownership, and deep hands-on execution
- You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering
Reliability & Architecture
- Own platform availability, latency, scalability, and resilience across environments
- Define and enforce SLOs, SLIs, error budgets, and operational KPIs
- Design and review resilience patterns: circuit breakers, retries, rate limiting, graceful degradation
- Drive chaos engineering, fault-injection, and disaster-recovery readiness
Hands-on Engineering
- Actively contribute code (Java / Node) for reliability tooling
- Platform automation
- Observability integrations
- Review microservice architecture with engineering teams to eliminate single points of failure
Cloud & DevOps Leadership
- Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)
- Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)
- Improve CI/CD pipelines for reliability, speed, and safety
Incident & Operations
- Lead production incident response, root cause analysis (RCA), and postmortems
- Establish blameless postmortem culture
- Reduce MTTR through automation and better observability
- Participate in escalation/on-call strategy (not firefighting 24×7)
People & Process
- Mentor SRE DevOps and SRE Full-Stack engineers
- Define operational standards, runbooks, and SRE practices
- Work closely with product, security, and engineering leaders
Required Skills & Experience
- 8+ years of experience in SRE / Platform / DevOps engineering
- Strong hands-on experience with AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)
- Kubernetes & Docker
- Microservices architectures
- Strong programming background in Java and/or Node.js
- Deep understanding of distributed systems, production debugging, and capacity planning
- Experience in fintech or regulated environments is a strong plus
Nice to Have
- Experience with chaos engineering tools
- Security & compliance exposure (PCI-DSS, SOC2, ISO)
- Prior experience building or scaling SRE teams