Site Reliability Engineer - Network Team

  • Full Time Job
  • On-site
  • $130,000 - $150,000 nzd
Halter
At Halter, we’re building more than software - we’re transforming the way the world farms. Our smart collars let farmers shift, monitor, and care for their cattle via deep integrations & insights. Behind it all is the Network Team, powering one of New Zealand’s largest private IoT networks with 400,000+ connected devices and counting.
 
We’re looking for a Site Reliability Engineer to help scale our systems to a million animals and beyond. You’ll apply cloud-scale NRE practices to a wildly distributed, rural IoT network across multiple countries.
 
Our vision is to become the OS for farming globally. This isn’t your average backend gig - this one moos.
 
You’re not just writing code — you’re ensuring uptime for hundreds of thousands of animals and farmers who rely on Halter every single day.
 
What you'll do
    • Build & run observability for gateways, towers, and backend/edge services (metrics, logs, tracing, alerts; strong signal / low noise).
    • Automate ops: golden configs, zero-touch provisioning, safe canaries/rollbacks, scheduled maintenance, and self-healing where sensible.
    • Lead incidents end-to-end (runbooks, comms, mitigation, post-mortems) and drive fixes into code, configs, and process.
    • Harden deploys: progressive rollouts for firmware/agent/service changes across thousands of devices and multi-region backends.
    • Performance tuning: reduce command/telemetry latency, smooth OTA pipelines, and de-risk noisy/unreliable links with back-pressure & retries.
    • Capacity & readiness: plan headroom for spikes and growth; game-days/chaos for failover paths (cellular ↔ satellite, region failover).
    • Own runbooks & SOPs that enable field teams and on-call to respond quickly and consistently.
    • Partner with Network/RF engineers on coverage/capacity changes, interference hunts, and carrier/satellite escalations.
    • Champion observability: better logs, metrics, tracing, and signal-to-noise alerting.
    • Mentor teammates on NRE mindset, tools, and operational excellence.
Who we're looking for:
    • SRE/NRE/large-scale ops experience (cloud + distributed systems).
    • Strong automation & scripting (Python/Go/etc.) and IaC (Terraform/Ansible/etc.).
    • Solid networking fundamentals (TCP/IP, routing, VPNs, firewalls) + RF awareness (LoRa/LTE/sat a plus).
    • Hands-on with observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
    • Proven incident management for high-availability systems.
    • Performance tuning for latency-sensitive, unreliable-link environments.
    • Comfortable in Linux across cloud and edge devices.
    • Data-driven: able to turn noisy telemetry into decisions (SQL or notebooks a plus).
    • Pragmatic problem-solver who balances reliability, speed, and cost.
    • Bonus: IoT/off-grid/field deployments experience.
    • Network awareness (baseline, not deep-dive). You don’t need to be a routing/RF guru — we have those. You should be comfortable with:
    • Basic L3 troubleshooting: ping/traceroute, IP/subnetting, DNS/DHCP/NAT basics, reading simple routes.
    • Reading link health: interpreting RSSI/SNR (LoRa) or RSRP/SINR (LTE) at a high level; spotting “link looks bad vs service is bad.”
    • Backhaul pragmatics: understanding failover states (cellular ↔ satellite), cost/perf trade-offs, and safe config rollout patterns.
    • Topology literacy: knowing what a gateway/tower/backhaul path looks like and where to put probes and alerts.
Our Office First Approach
 
There’s a reason you visit your friends in person, live with your family and don’t do dinners over Zoom. Humans are wired for connection. We believe a world-class, in-person office culture is the best way for high-performing teams. 
Being office first is a core pillar of our culture.
 
We believe in-person connections are key to driving your own growth, learning, impact, and building genuine long-lasting relationships. Strong relationships make it easier to disagree, give feedback, and do meaningful and aligned work. We don’t like having heaps of rules or policies, but this means having strong, trusted relationships is critical.
 
We’re office first, not office only. This means working from the office everyday is our default setting, but we flex when we need to.  We have a high-trust culture, so everyone is trusted to do what’s best for Halter. 
 
Our office vibe is something special, it’s hard to describe until you’re here, but people at Halter who have come from fully remote or hybrid companies say they could never go back - the high energy and spectacular people they are now surrounded by everyday makes work so enjoyable.
 
Your growth, your learning and your impact is truly unlimited here, and a big part of that comes from being together solving problems, innovating, building context, and constantly learning from each other.
Halter
1 Follower