Staff Site Reliability Engineer - Confluent Incident Management & Reliability

IBM

IBM

Software Engineering

California, USA · Texas City, TX, USA

Posted on May 17, 2026
Introduction

At IBM Software, we transform client challenges into solutions. Building the world’s leading AI-powered, cloud-native products that shape the future of business and society. Our legacy of innovation creates endless opportunities for IBMers to learn, grow, and make an impact on a global scale. Working in Software means joining a team fueled by curiosity and collaboration. You’ll work with diverse technologies, partners, and industries to design, develop, and deliver solutions that power digital transformation. With a culture that values innovation, growth, and continuous learning, IBM Software places you at the heart of IBM’s product and technology landscape. Here, you’ll have the tools and opportunities to advance your career while creating software that changes the world. With Confluent, data doesn’t sit still. We put information in motion, streaming in near real time so organizations can react faster, build smarter, and deliver experiences as dynamic as the world around them.

Your role and responsibilities

About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability - Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do:

  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence

  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack

  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments

  • Own standards, practices, and continuous improvement of incident response across engineering

  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity

  • Develop and deliver training programs; coach teams through post-mortems

  • Partner with engineering leaders to elevate reliability practices org-wide

  • Deep experience with observability: metrics, logging, tracing

  • Kubernetes and container orchestration experience

  • Understanding of CI/CD pipelines and release processes

  • Strong written communication (design docs, runbooks, post-mortems)

  • Experience driving org-wide process and cultural changes

Required education
Bachelor's Degree
Preferred education
Master's Degree
Required technical and professional expertise
  • 10+ years of relevant experience in SRE, incident management, or reliability engineering

  • Cloud experience with at least one of AWS, GCP, or Azure (we run all three)

  • Experience navigating reliability/incident programs at 500+ engineer organizations

  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)

  • Strong understanding of distributed systems and failure modes at scale

  • Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Preferred technical and professional experience

• Advanced Cloud Knowledge: Experience with cloud-based infrastructure and its application in reliability and resiliency engineering. • Specialized Scripting Skills: Proficiency in scripting languages and automation tools to optimize system reliability and performance.

ABOUT BUSINESS UNIT

IBM Software infuses core business operations with intelligence—from machine learning to generative AI—to help make organizations more responsive, productive, and resilient. IBM Software helps clients put AI into action now to create real value with trust, speed, and confidence across digital labor, IT automation, application modernization, security, and sustainability. Critical to this is the ability to make use of all data, because AI is only as good as the data that fuels it. In most organizations data is spread across multiple clouds, on premises, in private datacenters, and at the edge. IBM’s AI and data platform scales and accelerates the impact of AI with trusted data, and provides leading capabilities to train, tune and deploy AI across business. IBM’s hybrid cloud platform is one of the most comprehensive and consistent approach to development, security, and operations across hybrid environments—a flexible foundation for leveraging data, wherever it resides, to extend AI deep into a business.

YOUR LIFE @ IBM

In a world where technology never stands still, we understand that, dedication to our clients success, innovation that matters, and trust and personal responsibility in all our relationships, lives in what we do as IBMers as we strive to be the catalyst that makes the world work better.

Being an IBMer means you’ll be able to learn and develop yourself and your career, you’ll be encouraged to be courageous and experiment everyday, all whilst having continuous trust and support in an environment where everyone can thrive whatever their personal or professional background.

Our IBMers are growth minded, always staying curious, open to feedback and learning new information and skills to constantly transform themselves and our company. They are trusted to provide on-going feedback to help other IBMers grow, as well as collaborate with colleagues keeping in mind a team focused approach to include different perspectives to drive exceptional outcomes for our customers. The courage our IBMers have to make critical decisions everyday is essential to IBM becoming the catalyst for progress, always embracing challenges with resources they have to hand, a can-do attitude and always striving for an outcome focused approach within everything that they do.

Are you ready to be an IBMer?

ABOUT IBM

IBM’s greatest invention is the IBMer. We believe that through the application of intelligence, reason and science, we can improve business, society and the human condition, bringing the power of an open hybrid cloud and AI strategy to life for our clients and partners around the world.

Restlessly reinventing since 1911, we are not only one of the largest corporate organizations in the world, we’re also one of the biggest technology and consulting employers, with many of the Fortune 500 companies relying on the IBM Cloud to run their business.

At IBM, we pride ourselves on being an early adopter of artificial intelligence, quantum computing and blockchain. Now it’s time for you to join us on our journey to being a responsible technology innovator and a force for good in the world.

IBM is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, genetics, pregnancy, disability, neurodivergence, age, or other characteristics protected by the applicable law. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.

OTHER RELEVANT JOB DETAILS

Must have the ability to work in Canada without sponsorship.

This role will involve working with technology that is covered by Export Regulations sanctions. If you are a Foreign National from any of the following US sanctioned countries (Cuba, Iran, North Korea, Syria, and the Crimea, Luhansk, Donetsk, Kherson, and Zaporizhia regions of Ukraine) on a work permit, you are not eligible for employment in this position.

The salary range for the position is based on a full-time schedule. Your ultimate salary within this range may vary depending on your job-related skills and experience for this position.