
Site Realibility Engineer
Dans Multi Pro
South Jakarta, DKI Jakarta, IndonesiaContractSeniorEngineering
Posted
Yesterday
May 5, 2026
Source
Kalibrr
Skills & Technologies
gogoogle cloudkubernetesci/cdmicroservices
Job Description
Incident Management & Response
1. Act as part of the first-response team during production incidents (24x7 readiness as required)
2. Lead incident resolution efforts, including coordination across multiple teams
3. Drive root cause analysis (RCA) and post-mortem documentation
4. Ensure proper incident tracking, communication, and closure
Reliability Engineering & Improvements
1. Proactively identify and implement system reliability improvements (e.g., circuit breakers, failover strategies, resiliency patterns)
2. Optimize traffic routing, load balancing, and service availability
3. Design and implement automation to reduce manual operational effort
System Observability & Monitoring
1. Define and maintain monitoring, alerting, and logging strategies
2. Analyze system performance metrics (latency, throughput, error rates)
3. Improve observability using tools such as Datadog, Splunk, or Dynatrace
Capacity Planning & System Analysis
4. Conduct service profiling (latency, resource utilization, throughput)
5. Develop capacity planning models and scaling strategies
6. Analyze system “blast radius” and implement risk mitigation strategies
7. Documentation & Knowledge Management
8. Maintain up-to-date technical documentation (architecture, runbooks, SOPs)
9. Document incident learnings and operational best practices
10. Contribute to knowledge sharing across teams
Technical Skills
1. Hands-on experience with at least one major cloud platform such as Amazon Web Services, Google Cloud Platform, or Huawei Cloud
2. Strong experience with container orchestration platforms like Kubernetes
3. Familiarity with DevSecOps practices and CI/CD pipelines
4. Experience with monitoring, logging, and observability tools
5. Solid understanding of traffic management, load balancing, and traffic pattern analysis Engineering & Operations
6. Proven experience in incident management, including troubleshooting and post-mortem analysis
7. Background in software development (any major programming language)
8. Strong understanding of distributed systems and microservices architecture
9. Experience with infrastructure automation (e.g., scripting, Infrastructure as Code)
Documentation & Communication
10. Ability to produce clear and structured technical documentation
11. Strong analytical and problem-solving skills
12. Effective communication skills for cross-team collaboration
Interested in this position?
Apply directly on Kalibrr to submit your application.