SRE Team Lead

Chennai
8–12 Years
Domain: Fintech | Cloud-native | Microservices
Role Summary
  • We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices
  • This role combines technical leadership, architecture ownership, and deep hands-on execution
  • You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering
Reliability & Architecture
  • Own platform availability, latency, scalability, and resilience across environments
  • Define and enforce SLOs, SLIs, error budgets, and operational KPIs
  • Design and review resilience patterns: circuit breakers, retries, rate limiting, graceful degradation
  • Drive chaos engineering, fault-injection, and disaster-recovery readiness
Hands-on Engineering
  • Actively contribute code (Java / Node) for reliability tooling
  • Platform automation
  • Observability integrations
  • Review microservice architecture with engineering teams to eliminate single points of failure
Cloud & DevOps Leadership
  • Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)
  • Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)
  • Improve CI/CD pipelines for reliability, speed, and safety
Incident & Operations
  • Lead production incident response, root cause analysis (RCA), and postmortems
  • Establish blameless postmortem culture
  • Reduce MTTR through automation and better observability
  • Participate in escalation/on-call strategy (not firefighting 24×7)
People & Process
  • Mentor SRE DevOps and SRE Full-Stack engineers
  • Define operational standards, runbooks, and SRE practices
  • Work closely with product, security, and engineering leaders
Required Skills & Experience
  • 8+ years of experience in SRE / Platform / DevOps engineering
  • Strong hands-on experience with AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)
  • Kubernetes & Docker
  • Microservices architectures
  • Strong programming background in Java and/or Node.js
  • Deep understanding of distributed systems, production debugging, and capacity planning
  • Experience in fintech or regulated environments is a strong plus
Nice to Have
  • Experience with chaos engineering tools
  • Security & compliance exposure (PCI-DSS, SOC2, ISO)
  • Prior experience building or scaling SRE teams