Find your next role

Discover amazing opportunities across our network of companies committed to gender equality in the workplace.

Senior Site Reliability Engineer - Storage Operations

IBM

IBM

Software Engineering, Operations
Heredia Province, Heredia, Costa Rica
Posted on Feb 19, 2025
Introduction

IBM Cloud Computing is a one-stop shop which provides all the cloud solutions & cloud tools the industries need. IBM Cloud portfolio includes infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) offered through public, private and hybrid cloud delivery models, in addition to the components that make up those clouds.

IBM Cloud ensures seamless integration into public and private cloud environments. The infrastructure is secure, scalable, and flexible, providing customized enterprise solutions that have made IBM Cloud the Hybrid Cloud Market leader with our market leading IAAS and PAAS Platforms. The IBM Cloud platform is the public cloud offering from IBM providing services to global enterprises. IBM Cloud is the Cloud for Smarter Business, built on Open Technology with Developer Tools and supports solutions by Industry. We run the services and workloads from Watson, Blockchain, Services, Security, and IoT.

Ready to help drive IBM's success in the Cloud market? This is your chance to research and learn new Cloud related technology products and services, as well as to design and implement quick Cloud based prototypes while advancing your career in leading edge technology.

Your role and responsibilities

As a Site Reliability Engineering (SRE) and DevOps Engineer in Storage, you will ensure that the designed solution responds to non-functional requirements such as reliability, availability, performance, security, and maintainability. You will closely work with the development and other related Release and L2 teams.

As a storage operations lead, you will ensure that the storage fleet maintains reliability, availability, performance, and security. You will closely work with vendors, development teams, datacenter staff, and support staff to keep the storage environment stable and growing. This includes performing expansions, upgrades and assisting vendors with installation of new hardware.

Responsibilities:

  • A Storage Support Engineer is responsible for diagnosing and troubleshooting technical issues related to NetApp storage hardware and software, providing timely solutions to customers through phone, email, and remote sessions, acting as a primary point of contact for resolving complex technical problems, and collaborating with other teams to deliver optimal customer support, often requiring in-depth knowledge of NetApp's OnTap operating system, RAID concepts, Ethernet, FC, and iSCSI protocols, as well as familiarity with NetApp hardware like FAS and AFF arrays
  • Keeping the service up and running or getting it back up and running quickly when failure occurs
  • Working closely with internal partners and teams to ensure that our infrastructure meets security, SLA, and performance requirements
  • Writing, updating, and using documentation, including runbooks/playbooks
  • Debugging complex problems across an entire stack and creating solid solutions
  • Persistent testing of application and infrastructure resiliency over a variety of error conditions.
  • Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
  • Develop, communicate, and monitor standard processes to promote the long-term health of sustainability and health of operational development tasks.

· Provide mentorship to junior engineers and contribute to knowledge-sharing across teams.

Required education
Bachelor's Degree
Preferred education
Bachelor's Degree
Required technical and professional expertise

  • 6+ yrs of total experience
  • Extensive experience with administering NetApp ONTAP storage clusters
  • A solid understanding of Cloud infrastructure/operations is a must
  • Knows their way around a Unix/Linux shell, can write shell scripts, and understands Linux internals
  • Experience debugging complex problems
  • Experience with DevOps engineering or SRE
  • Experience with standard industry tools for monitoring and observability
  • Experience automating infrastructure, configuration management, testing, and deployments using tools like Ansible, Chef and can explain the Infrastructure as Code paradigm
  • A strong understanding of diverse infrastructure platforms and infrastructure concepts required.
  • Has hands-on experience using source control and feature branching strategies
  • Understands networking and messaging, especially between services
  • Must have good experience in Infrastructure Operations automation and IT Service Management with hands on exposure in data center administration, configuration, Incident management and support
  • Strong communication skills

· Advocate for DevOps/SRE best practices, including blameless postmortems, incident retrospectives, and operational readiness reviews.

Preferred technical and professional experience

· Strong familiarity with one of C, C++, golang, python, or Java

· PHP and Perl development experience

· IBM Cloud API knowledge

· Experience in Monitoring applications such as Grafana, ELK stack, Prometheus, Nagios, and Sysdig

· Familiarity with cloud deployment tooling such as razee and launch darkly.