JobDir — Jobs Directory

Incident Management & Response 1. Act as part of the first-response team during production incidents (24x7 readiness as required) 2. Lead incident resolution efforts, including coordination across multiple teams 3. Drive root cause analysis (RCA) and post-mortem documentation 4. Ensure proper incident tracking, communication, and closure Reliability Engineering & Improvements 1. Proactively identify and implement system reliability improvements (e.g., circuit breakers, failover strategies, resiliency patterns) 2. Optimize traffic routing, load balancing, and service availability 3. Design and implement automation to reduce manual operational effort System Observability & Monitoring 1. Define and maintain monitoring, alerting, and logging strategies 2. Analyze system performance metrics (latency, throughput, error rates) 3. Improve observability using tools such as Datadog, Splunk, or Dynatrace Capacity Planning & System Analysis 4. Conduct service profiling (latency, resource utilization, throughput) 5. Develop capacity planning models and scaling strategies 6. Analyze system “blast radius” and implement risk mitigation strategies 7. Documentation & Knowledge Management 8. Maintain up-to-date technical documentation (architecture, runbooks, SOPs) 9. Document incident learnings and operational best practices 10. Contribute to knowledge sharing across teams Technical Skills 1. Hands-on experience with at least one major cloud platform such as Amazon Web Services, Google Cloud Platform, or Huawei Cloud 2. Strong experience with container orchestration platforms like Kubernetes 3. Familiarity with DevSecOps practices and CI/CD pipelines 4. Experience with monitoring, logging, and observability tools 5. Solid understanding of traffic management, load balancing, and traffic pattern analysis Engineering & Operations 6. Proven experience in incident management, including troubleshooting and post-mortem analysis 7. Background in software development (any major programming language) 8. Strong understanding of distributed systems and microservices architecture 9. Experience with infrastructure automation (e.g., scripting, Infrastructure as Code) Documentation & Communication 10. Ability to produce clear and structured technical documentation 11. Strong analytical and problem-solving skills 12. Effective communication skills for cross-team collaboration

Site Realibility Engineer

Skills & Technologies

Job Description

Interested in this position?

Similar Jobs