Loading...

TRUGlobal

Senior Site Reliability Engineer

TRUGlobal
Bengaluru, Karnataka On-site
On-Site Full-Time Bengaluru, Karnataka India

Skills

Python (Programming Language) Site Reliability Engineering Reliability Infrastructure Automation Incident Management Power System Reliability System Monitoring

About the Role

Job Title: Site Reliability Engineer (SRE) with Python Development Expertise

Position Overview: We are seeking a skilled Site Reliability Engineer (SRE) with strong Python development experience to join our team. The ideal candidate will be responsible for ensuring the reliability, availability, and performance of our services across both on-premises and cloud platforms. This role involves developing scalable infrastructure automation, monitoring solutions, and collaborating closely with development teams to enhance system reliability.

Key Responsibilities:

Infrastructure Automation and Monitoring:
Design and implement scalable, robust service-oriented infrastructure automation and monitoring solutions for both on-premises and cloud environments.
Develop and maintain tools to enhance system reliability and performance.
Incident Management and Support:
Provide internal customer support, promptly resolving issues within established Service Level Agreements (SLAs).
Participate in on-call rotations to address incidents impacting service availability.
System Reliability and Performance:
Ensure 99.99% availability and reliability of platforms by proactively identifying and addressing potential issues.
Perform performance tuning, disaster recovery planning, break-fix solutions, and security patches as necessary.
Collaboration and Continuous Improvement:
Work closely with platform users and development teams to identify issues proactively and develop code-based solutions.
Participate in sprint planning sessions, delivering projects within planned timelines.
Document processes, system configurations, and troubleshooting procedures to enhance team knowledge and system transparency.

Qualifications:

Technical Expertise:
Strong experience in Python development, particularly in automating infrastructure tasks and developing monitoring tools.
Proficiency in working with infrastructure automation tools and monitoring systems.
Experience with cloud platforms (e.g., AWS, Azure, GCP) and on-premises infrastructure.
Problem-Solving Skills:
Demonstrated ability to analyze complex systems, identify issues, and implement effective solutions.
Strong debugging and troubleshooting skills.
Collaboration and Communication:
Excellent communication skills, with the ability to work effectively in cross-functional teams.
Experience in participating in agile development processes and sprint planning.
Education and Experience:
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Previous experience in a Site Reliability Engineering role or similar position.

Preferred Qualifications:

Experience with configuration management tools (e.g., Ansible, Chef, Puppet).
Familiarity with containerization technologies (e.g., Docker, Kubernetes).
Knowledge of networking concepts and protocols.
Understanding of CI/CD pipelines and related tools.

Why Join Us:

Opportunity to work on cutting-edge technologies and complex systems.
Collaborative and inclusive work environment.
Professional development opportunities and career growth.
Competitive compensation and benefits package.

Apply for this position

Log in or Sign up to Apply

Access the application form by logging in or creating an account.

Application Status

Application Draft

In Progress

Submit Application

Pending

Review Process

Expected within 5-7 days

Similar Jobs