Solving complexity. Accelerating results.

At Penguin Solutions, we understand the boundless potential of technology and support our customers in turning cutting-edge ideas into outcomes—faster, and at any scale.

With over two decades of experience as trusted advisors, Penguin Solutions is an end-to-end technology company solving the industry’s most complex challenges in computing, memory, and LED solutions. Penguin designs, builds, deploys, and manages high-performance, high-availability enterprise solutions, allowing customers to achieve their breakthrough innovations.

Solving complexity. Accelerating results.

At Penguin Solutions, we understand the boundless potential of technology and support our customers in turning cutting-edge ideas into outcomes—faster, and at any scale.

With over two decades of experience as trusted advisors, Penguin Solutions is an end-to-end technology company solving the industry’s most complex challenges in computing, memory, and LED solutions. Penguin designs, builds, deploys, and manages high-performance, high-availability enterprise solutions, allowing customers to achieve their breakthrough innovations.

Sr. DevOps Engineer

Date Posted:  Mar 7, 2025
Requisition ID:  1507
Location: 

US

Brand:  Penguin Solutions

Overview

We are seeking a Senior DevOps Engineer to join our Software Managed Services team. This role will play a critical role in designing, automating, optimizing, and maintaining large-scale AI and HPC clusters. You will collaborate with industry leaders, AI researchers, and enterprise customers to enhance infrastructure performance and ensure scalability, reliability, and security. You will also contribute to CI/CD automation, advanced networking, and distributed system monitoring in highly dynamic, high-compute environments.

 

Responsibilities

  • Optimize and manage large-scale HPC & AI cluster environments, including NVIDIA GPUs, InfiniBand, and RDMA networking.
  • Design and maintain scalable automation pipelines using Ansible, Terraform, and CI/CD frameworks to streamline cluster deployment and updates.
  • Monitor and troubleshoot cluster performance across compute, networking, and storage layers, leveraging tools such as Grafana, Prometheus, and ELK.
  • Automate infrastructure management with Infrastructure as Code (IaC), ensuring scalability, resilience, and consistency.
  • Enhance security and compliance in HPC/AI environments, including support for SELinux, FIPS 140-2, and STIGs.
  • Collaborate with AI researchers and engineering teams to fine-tune cluster resources for distributed AI workloads.
  • Optimize file systems and data pipelines to accelerate large-scale dataset processing.
  • Work with data center teams to deploy, maintain, and troubleshoot physical infrastructure.
  • Document configurations, processes, and troubleshooting procedures for knowledge sharing and operational efficiency.
  • Participate in an on-call rotation to provide critical support for AI and HPC operations.

 

Qualifications

  • Bachelor’s Degree in Computer Science, Computer/Electrical Engineering, or equivalent education and experience.
  • 8+ years of hands-on experience in Linux-based DevOps, HPC, or AI infrastructure environments.
  • US Citizenship is required for this role.
  • Expertise in Ansible & Infrastructure as Code (IaC) – Experience with Terraform, automation pipelines, and large-scale system orchestration.
  • Deep Linux administration skills, including kernel tuning, system security, and package management.
  • Strong networking knowledge – OSI layers, InfiniBand, RDMA, Ethernet, and low-latency/high-bandwidth networking performance optimization.
  • Proficiency in Bash & Python (40% Bash, 60% Python) – Experience automating system operations and debugging.
  • Experience in GPU-based environments, including CUDA, TensorRT, NCCL, and AI workload acceleration.
  • Familiarity with HPC storage (Lustre, Ceph) and distributed compute frameworks.
  • Strong troubleshooting skills, ability to diagnose complex infrastructure issues at scale.
  • Excellent communication and documentation skills, working effectively in customer-facing and internal technical collaboration.

Preferred Qualifications

  • Experience working with AI training frameworks (PyTorch, TensorFlow, Horovod) and distributed compute environments.
  • Kubernetes and containerized workload management experience.
  • Cloud computing expertise (AWS, GCP, Azure).
  • Linux certifications (RHCSA, RHCE) and cloud certifications (AWS Solutions Architect, GCP Professional Cloud Engineer) are highly desirable.

Location

This is a remote position in the United States.

 

Travel

Occasional travel as needed (10-25%)

 

Compensation & Benefits

The base pay range that the Company reasonably expects to pay for this position in the United States is $140,000 - $165,000; the pay ultimately offered may vary based on business considerations, including job-related knowledge, skills, experience, and education. The position is bonus-eligible, and there are medical, dental, and vision benefits available. There is a 401k saving plan and other benefits, such as Paid Time Off, Life Insurance, and an Employee Assistance Plan.   

 

Diversity and Inclusion Statement

We are committed to creating a diverse environment that embraces differences and fosters inclusion.

 

Equal Opportunity Statement

We are an Affirmative Action/Equal Opportunity Employer and strongly committed to all policies which will afford equal opportunity employment to all qualified persons without regard to age, national origin, race, ethnicity, creed, gender, disability, veteran status, or any other characteristic protected by law.