Site Reliability Engineer

PolarGrid

PolarGrid

Software Engineering

Toronto, ON, Canada · Ottawa, ON, Canada · Edmonton, AB, Canada · Winnipeg, MB, Canada · Halifax, NS, Canada · Saskatoon, SK, Canada · Montreal, QC, Canada · Vancouver, BC, Canada · Remote

Posted on May 8, 2026

The role is responsible for scaling PolarGrid's edge inference network from a small cluster to a geographically distributed system, owning the full lifecycle of node provisioning and readiness. This includes standardizing how new GPU nodes are brought online through our Helm-based deployment flow, ensuring consistent environments across heterogeneous hardware (today: Blackwell-class GPUs running Triton multipod deployments), and defining clear criteria for when a node is safe to receive production traffic. The SRE will automate cluster bring-up, validate network and model health, and ensure new locations integrate cleanly into the global network without introducing latency or reliability regressions.

A core responsibility is operating and continuously improving the global routing and inference layer to optimize for TTFT and tail latency. This involves maintaining real-time signals such as network RTT, GPU utilization, and model availability, and feeding them into our autorouter to make dynamic routing decisions that outperform static region-based approaches.

The role also owns model serving reliability across our Triton-based stack, including deployment consistency across LLM and voice pods, concurrency control on shared GPUs co-locating multi-stage voice pipelines (STT → LLM → TTS) with LLM workloads, and mitigation of cold starts or memory contention from on-demand model loading, ensuring that performance remains predictable under varying load conditions.

The SRE will build the observability, resilience, and operational systems required to run a distributed GPU network at scale. This includes instrumenting end-to-end latency across the full request path (LiveKit ingress through TTS egress for voice; gateway through Triton for LLM), diagnosing performance regressions, handling partial failures, and implementing automated recovery and traffic shifting between regions. They will also develop internal tooling and deployment workflows while extending our S3-driven edge-agent pipeline and AWS CDK stacks in order to safely roll out changes across all nodes, manage capacity and cost efficiency, and ensure the network can scale to 20+ locations without operational overhead growing linearly.

Requirements

  • Experience operating distributed systems across multiple regions or edge locations
  • Strong understanding of networking fundamentals (latency, routing, TCP/UDP behavior, load balancing, anycast vs custom application-layer routing)
  • Hands-on experience with Kubernetes in production, including cluster bring-up, node lifecycle management, and multi-cluster coordination. Helm-based GitOps a strong plus.
  • Familiarity with GPU-based workloads (CUDA, model serving frameworks such as Triton / vLLM / TensorRT-LLM, memory constraints, concurrency tradeoffs on shared GPUs)
  • Experience running or supporting inference systems (LLMs, ASR, TTS, or similar), especially around TTFT, throughput, and cold start behavior
  • Deep observability skills: building metrics, tracing, and debugging systems focused on p95/p99 latency, not just uptime
  • Experience designing failure handling and resilience systems (graceful degradation, failover, circuit breaking, cross-region traffic shifting)
  • Strong automation and infrastructure-as-code mindset (Terraform or AWS CDK, scripting, CI/CD pipelines for infra)
  • Ability to reason about capacity planning and performance under constrained edge compute resources
  • Experience building internal tooling or control planes to manage distributed infrastructure at scale

Nice to Have

  • Real-time voice / WebRTC stack experience (LiveKit, Pipecat, or equivalent)
  • Triton Inference Server, multipod deployments, or on-demand model loading
  • CDN or edge compute background (Cloudflare, Fastly, custom anycast)
  • ACME / Let's Encrypt wildcard cert automation across many subdomains