Regions
Location
  • Greater London
Job types
  • Hybrid Working
  • Office Based
  • Permanent
Industry
  • IT Services 
  • Telecommunications
Salary

£120k - 180k per year + BONUS

Functions
  • IT Networks & Infrastructures
  • Network Engineer
Seniority
  • Mid-level
Technologies
  • Arista Networks
  • CISCO Switches
  • Go
  • GoLang
Posted

1 month ago

Job reference

118783

Benefits

BONUS

Job Benefits: BONUS
Network Site Reliability Engineer – Python/GO, Observability, Monitoring, HPC
 
Within the Network Engineering Team, this role is critical in ensuring our clients High-Performance Computing (HPC) environments are supported by a resilient, data-driven, and software-defined network foundation.
 
We are seeking a Networks focused Site Reliability Engineer (SRE) with a focus on Observability, Telemetry, and Monitoring. In this role, you will apply a software engineering mindset to network operations, bridging the gap between traditional networking and modern Site Reliability Engineering (SRE).
You will be responsible for ensuring our high-performance network infrastructure is not just functional, but deeply visible. You will build the tooling and automation that allow the team to move from reactive troubleshooting to proactive, automated remediation and “self-healing” infrastructure.
Key Responsibilities:
  • Reliability Engineering: Apply SRE principles to the network; define and maintain SLIs, SLOs, and Error Budgets for network latency, packet loss, and availability.
  • HPC Connectivity & Performance: Support low-latency, high-throughput network architectures (e.g., RDMA, RoCE) designed for intensive HPC and financial data workloads.
  • Advanced Telemetry: Design and manage high-cardinality telemetry pipelines to collect and analyze flow logs, metrics, and traces at scale.
  • Network Automation (Python/Go): Build and maintain internal software tools, APIs, and “self-healing” scripts to automate routine operations and complex failure recoveries.
  • Infrastructure-as-Code (IaC): Use Terraform to manage complex network configurations and observability stacks (Prometheus, Grafana, OpenSearch) as code.
  • Observability & Monitoring: Implement automated alerting and dashboarding that provide real-time insights into network health and traffic patterns.
  • Incident Management & Post-Mortems: Lead technical troubleshooting for complex outages and conduct “blameless post-mortems” to drive systemic improvements.
 
Your Present Skillset
  • 3+ years of experience in a Network Reliability (NRE), SRE, or Network Operations role within a high-performance environment.
  • Software Engineering Mindset: Strong proficiency in Python and Go for building automation, custom exporters, or network management tools.
  • Observability Stack Expertise: Hands-on experience with Prometheus, Grafana, OpenSearch/Elasticsearch, and distributed tracing.
  • Networking Fundamentals: Deep knowledge of TCP/IP, BGP, EVPN, and routing/switching concepts in a high-bandwidth environment.
  • Infrastructure as Code: Proven experience using Terraform to ensure scalable, repeatable, and version-controlled network deployments.
  • HPC Awareness: Familiarity with the networking requirements of high-performance computing, such as non-blocking fabrics and low-latency interconnects.
 
Desirable Experience
  • Streaming Telemetry: Experience with gNMI, gRPC, or Kafka for real-time network data streaming.
  • CI/CD for Networking: Familiarity with “NetDevOps” workflows, including automated testing (Pytest/Go test) and pipeline validation for network changes.
  • Container Networking: Knowledge of Kubernetes networking, CNI plugins, and Service Mesh (e.g., Istio or Cilium).
  • Traffic Engineering: Experience with segment routing or advanced load-balancing strategies for high-performance workloads.

,

Apply for job

You can apply to this job and others using your online CV. Click the link below to start