Find your next role

Discover amazing opportunities across our network of companies committed to gender equality in the workplace.

Site Reliability Engineering Intern

IBM

IBM

Software Engineering
Spain · Heredia Province, Heredia, Costa Rica
Posted on Feb 28, 2025
Introduction

Working in IBM Cloud gives you the platform to learn, develop and utilize your skills everyday by working on the latest cloud related technology products and services. You'll be working in an environment where we understand how we can thrive best when we play to our strengths. That's why developing our people is key to our success, the door is always open for those ready to advance their career.
Curiosity and courageous thinking are both vital when working in IBM Cloud, as we continue our dedication in guaranteeing that we are at the forefront of cloud technology. Our renowned legacy means we are leading the way in everything from analytics and security through to unmatched hardware & software designs. We provide our clients with the full end-to-end transformation as we build IBM's next generation cloud platform which is focused around delivering performance and predictability at a global scale. IBM's product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.

Your role and responsibilities

As a Site Reliability Engineer Intern, you will play a crucial role in supporting, maintaining, and operationally improving the cloud infrastructure. Working closely with various teams, your focus will be on ensuring the health and reliability of production and test systems. Your proactive approach will be essential in responding promptly to issues and alerts, contributing to the development of new capabilities, and collaborating with other SRE teams and program managers to deliver mission-critical services to the market.

Key Duties:
* 24x7 System Monitoring: Monitor the health of production and test systems around the clock, ensuring continuous reliability.
* Rapid Issue Response: Respond promptly to production issues and alerts, providing swift resolution and maintaining system availability.
* Capability Development: Support the development of new and existing capabilities for compute, storage, and network services.
* Collaborative Partnership: Partner with other SRE teams and program managers, contributing to the seamless delivery of mission-critical services to the market.
* Automation Execution: Execute changes in the production environment through automation, ensuring efficiency and minimizing downtime.
* Cross-Functional Troubleshooting: Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively.
* Integration Planning: Work with support and development teams to identify and resolve issues. Discuss and plan integration tasks to enhance overall system performance.

Required education
High School Diploma/GED
Preferred education
None
Required technical and professional expertise

*Currently pursuing a university degree with a history of academic success in careers such as: Computer Engineering, Systems Engineering, Software Engineering or other related careers;
*Availability of time to do internships;
*Knowledge of Python or other programming languages;
*Knowledge of the English language;
*System Monitoring and Troubleshooting: knowledge in monitoring/observability, issue response, and troubleshooting for optimal system performance;
*Automation Proficiency: knowledge in automation for production environment changes,streamlining processes for efficiency, and reducing toil;
*Linux: Knowledge of Linux operating systems;
*Operation and Support Experience: Understanding in handling day-to-day operations, alert management, incident support, migration tasks, and break-fix support.

Preferred technical and professional experience

Knowledge of:

• Kubernetes/OpenShift: knowledge or experience of Kubernetes/OpenShift
environments.
• Automation/Scripting: knowledge or experience of Ansible, Python, Terraform, and CI/CD tools such as Jenkins, IBM Continuous Delivery, ArgoCD.
• Monitoring/Observability: knowledge or experience crafting alerts and dashboards using tools such as Instana, New Relic, Grafana/Prometheus.
• DBA: Interest or experience configuring and maintaining SQL, NoSQL, and data streaming technologies (e.g. PostgreSQL, CouchDB, Redis, Kafka, Spark, etc.)