Site Reliability Operations Engineer Iii - San José, Costa Rica - Zuora

Zuora

Empresa verificada

San José, Costa Rica

hace 1 semana

Publicado por:

Andrea Rodríguez

beBee Recruiter

Descripción

Company Overview
At Zuora, we do Modern Business.

We're helping people subscribe to new ways of doing business that are better for people, companies and ultimately the planet.

It's an approach resulting from the shift to the Subscription Economy that puts customers first by building recurring relationships instead of one-time product sales and focuses on sustainable growth.

Through our leading expertise and multi-product suite, we are transforming all industries and working with the world's most innovative companies to monetize new business models, nurture subscriber relationships and optimize their digital experiences.

THE TEAM

Responsible For:

Service Operations & Impacting issue Restoration
Driving Command Center Incident Bridges for customer issues to resolution
Responding to Observability Alerts/Alarms
Responding to escalated issues from Customer support
Write & Automate runbooks and drive alerts/incidents and service requests reduction by automation
Being a liaison for a service and partner with service owner to make the service rock solid and efficient

WHAT YOU'LL ACHIEVE

As a SRO, you will be a member of a team that understands the configuration, technical dependencies, and overall behavioral characteristics of production services.

In partnership with developers, you have the responsibility to ensure services are designed and delivered with focus on security, resiliency, scale, and performance.

SROs are the ultimate authority and are accountable for end-to-end performance and operability of the services they own.

Champion service reliability operations and incidents prevention

You will be part of the team whose mission is the shared ownership of a collection of services and technology areas, in partnership with developer teams.
You are a key escalation point for issues that have been documented as Standard Operating Procedures (SOPs) or issues that needed indepth troubleshooting and analysis. You will help maintain uptodate documentation on deployments, processes and SOP runbooks.
You are a key escalation point in leading incidents and working with Subject Matter Expert (SME) for performing realtime incident handling tasks to support operations. You will help develop and implement the incident management process.
You will have the deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Once you have expertly mitigated an incident, you will immediately work with SME on how to more quickly resolve the issue next time, with the goal to prevent the problem from recurring. You will help develop and implement the problem management process.
You will manage the full lifecycle of infrastructure and change management, including planned maintenance, standart, normal, and emergency changes. You will help develop and implement change management processes to ensure developers and SRO can easily manage system configurations, deploy new code quickly and fix incidents faster.

Service design and implementation

You will partner with development SCRUM teams in defining and implementing improvements to service architecture, both current and future. You will be an expert at articulating technical characteristics of services and their dependencies, and guide development teams to engineer highly reliable and performant services.
You will frequently partner with developer SCRUM teams and actively participate in the execution of tasks required to meet milestones and deliverables set by the team throughout a release cycle.

Operations Engineering

You will take part in a shared oncall rotation that won't cripple your life or kill your soul.

Job Involves:

Resolution of complex and critical issues, participation in Major incidents as a SME
Service expert ensuring expertise is reflected in SOP's documentation are shared
Instrumentation and metrics that clearly describe the service behaviors
Scaling requirements and patterns
Resiliency and recoverability, ensuring that backup / restore and disaster recovery capabilities are implemented, tested and maintained
Driving and escalating gaps in automation, solutions and documentation

WHAT YOU'LL NEED TO BE SUCCESSFUL

SROs are a rare mix of sysadmins and development engineers, and as such you have the ability to understand and explain the effect of product architecture decisions on the ability to run as distributed systems.

You are driven by professional curiosity and a desire to develop a deep understanding of the services and the technologies they depend upon.

You demonstrate competence in shell scripting and high-level programming languages such as Bash, Ansible, Python, Terraform and low-level / no-code programming languages and solutions such as Google Apps Scripts, Jenkins Pipelines Groovy scripts, Jira Automation, Rundeck.

You are proactive, self-motivated, customer-focused, organized, and a good communicator.

You have over 4 years experience r