About the Role

Senior Site Reliability Engineer (SRE)
Who We Are?
Softensity is a US-based IT outsourcing company with global software teams. We are headquartered in Atlanta, GA, USA with development teams in LATAM, Eastern Europe and Türkiye. When you have better teams, you build better software. Let’s do this together!
The Opportunity
Location: Hyderabad, India (Hybrid)
Hiring Method: Contractor

The Role
Key Responsibilities:
Reliability and Performance Management
Design, implement, and maintain highly available, scalable, and resilient cloud-native architectures for mission-critical SaaS products.
Develop and implement SLOs, SLIs, and SLAs to measure and improve service reliability.
Continuously optimize system performance and resource utilization across multiple cloud platforms.
Finetune/Optimize Application performance by analyzing the code, traces and database queries.
Incident Management and Troubleshooting
Lead incident response efforts, effectively troubleshooting complex issues to minimize downtime and impact.
Reduce Mean Time to Recover (MTTR) through proactive monitoring, automated alerting, and efficient problem-solving techniques.
Conduct thorough Root Cause Analysis (RCA) for all major incidents and implement preventive measures.
Observability and Monitoring
Design and implement end-to-end observability solutions across our distributed systems.
Develop and maintain comprehensive monitoring strategies using tools like ELK Stack, Prometheus, Grafana.
Create and optimize product status dashboards to provide real-time visibility into system health and performance.
Automation and Infrastructure as Code (IaC)
Implement Infrastructure as Code practices using tools like Terraform.
Develop and maintain automated deployment pipelines and CI/CD workflows.
Create self-healing systems and automate routine operational tasks to reduce manual intervention.
Cloud-Agnostic Architecture
Design and implement cloud-agnostic solutions that can operate efficiently across multiple cloud providers.
Develop expertise in event-driven architectures and related technologies (e.g., Apache Kafka/Eventhub, Redis, Mongo Atlas, IoTHub).
Implement and manage containerized applications using Kubernetes across different cloud environments.
Continuous Improvement
Regularly review and refine operational practices to enhance efficiency and reliability.
Stay updated with the latest industry trends and technologies in SRE, cloud computing, and DevOps.
Contribute to the development of internal tools and frameworks to support SRE practices.
Main Qualifications
7+ years of experience in cloud infrastructure management and optimization.
Solid experience with automating infrastructure processes.
Hands-on technical expertise in CI/CD pipeline implementation.
Strong background in microservices and Docker.
Mid-level experience supporting Java or .NET applications deployment.
Expertise in cloud platforms (AWS, Azure, GCP) and their associated services.
Strong understanding of networking concepts, load balancing, and security practices.
Why Join Us
We are passionate about top quality talent and giving our employees the tools they need in order for them to keep on growing and learning.
The sky is truly the limit and we want you to feel challenged and motivated in every single project that you're a part of all while working with cutting edge technologies and amazing clients.
What to expect?
• Measurable goals
• Remote work
• Paid-time-off
• Coursera Credentials

Site Reliability Engineer

About the Role

Apply for this position

Log in or Sign up to Apply

Application Status

Similar Jobs

Earn Credits

Share Job Post

Add Recruiter