Regions
Location
- Remote
Disciplines
Job types
- Remote Work
Industry
- Technology, Media, & Telecommunication
Salary
Market related
Functions
- Engineer
Seniority
- Mid-level
- Senior
Technologies
- C
Job reference
113424
Senior GPU Cluster Engineer
Location: remote (Europe / North America / Israel)
Industry: AI | HPC | Cloud Computing
Why Work With Us
We’re building the next generation of cloud infrastructure purpose-built for the global AI economy. Our platform enables organizations to tackle complex, real-world challenges-without the cost of legacy infrastructure or the need to maintain large in-house AI/ML teams.
You’ll join a team of experienced engineers and innovators working at the forefront of high-performance computing, distributed systems, and AI cloud infrastructure.
Our Global Presence
Headquartered in Amsterdam and listed on Nasdaq, we operate R&D hubs across Europe, North America, and Israel. Our global team includes over 800 employees-more than 400 of whom are highly skilled engineers working across hardware design, systems software, networking, and AI/ML infrastructure.
The Role
We’re looking for a Senior HPC Cluster Engineer to join our GPU & InfiniBand Engineering Team. You’ll work on the core components of our hyperscale platform, with a focus on GPU computing, InfiniBand networking, and hardware virtualization technologies like KVM/QEMU.
This role is highly technical and hands-on. You’ll be responsible for integrating new hardware, tuning performance, resolving complex system issues, and building automated monitoring and fault-resolution systems for large-scale GPU clusters.
Responsibilities
- Optimize GPU cluster and InfiniBand network performance for HPC and AI workloads
- Analyze and resolve low-level hardware/software issues in GPU and InfiniBand environments
- Integrate new GPU hardware and support it through the Kubernetes, QEMU, and KVM stacks
- Build and enhance automation tools for system monitoring, diagnostics, and fault recovery
- Configure and maintain GPU devices and InfiniBand fabrics for reliability and scale
Requirements
- 5+ years in system-level software engineering (performance, infrastructure, low-level development)
- 3+ years hands-on with Linux systems (tuning, debugging, admin)
- Solid understanding of server hardware architecture, including PCIe, NICs, and Linux internals
- Proficiency in performance-oriented languages such as C/C++, Go, or Python
Preferred Qualifications
- Experience with GPU cluster validation and testing over InfiniBand
- Proven performance tuning of HPC or AI/ML workloads
- Knowledge of RDMA, RoCE, and InfiniBand protocols
- Familiarity with Software-Defined Networking (SDN) and high-performance networking
- Understanding of QEMU/KVM, virtualization technologies, and driver integration
- Experience with PyTorch, TensorFlow, or other deep learning frameworks
- Familiarity with MPI, NCCL, or other collective communication libraries
What We Offer
- Competitive compensation and full benefits package
- Clear technical career path and professional development support
- A collaborative, high-impact engineering environment focused on innovation and growth