Partner Company
Senior DevOps / Platform Engineer
The company is focused on building data and AI infrastructure solutions for enterprise and government clients.
Location
Onsite
Employment Type
Full-time
About This Role
- •Manage RKE2 Kubernetes clusters across multiple node roles (control plane, compute workers, GPU workers, storage nodes), including cluster bootstrapping, upgrades, etcd backup and restore, and node lifecycle management
- •Build and maintain GitOps pipelines using ArgoCD and GitLab CI/CD: Helm chart packaging, image promotion workflows (dev to staging to production), and environment-specific configuration management
- •Operate and configure platform services: Kong (API gateway), Keycloak (identity), Vault (secrets), and GitLab (CI/CD and container registry)
- •Implement observability: Prometheus for metrics, Grafana for dashboards, centralized logging (Loki or Elasticsearch), and alerting with on-call routing
- •Manage Ceph integration at the Kubernetes level: CSI driver, StorageClass definitions, and PVC troubleshooting
- •Manage GPU workload scheduling: NVIDIA device plugin, resource quotas for GPU pods, and utilization monitoring
- •Automate operational tasks: certificate rotation, secret rotation, backup verification, and capacity planning
What We're Looking For
- ✓5+ years in DevOps, SRE, or Platform engineering, with at least 2 years managing self-hosted production Kubernetes clusters (not managed K8s)
- ✓Has operated stateful workloads on Kubernetes in production (databases, Kafka, Ceph, or equivalent)
- ✓Production experience operating at least one of: Kong, Keycloak, Vault, or equivalent service in each category (API gateway, identity provider, secret manager)
- ✓Has built CI/CD pipelines from scratch for a team of 5+ engineers using GitLab CI or equivalent
- ✓Has authored Helm charts and deployed applications through ArgoCD or equivalent GitOps tooling
- ✓Preferred
- ✓Experience with Thanos or Cortex for long-term Prometheus metric storage
- ✓Direct experience with Ceph CSI or Rook-Ceph on Kubernetes
- ✓Has managed GPU workloads on Kubernetes (NVIDIA device plugin)
- ✓NetworkPolicy design experience with Calico or Cilium
- ✓Terraform experience for infrastructure-as-code
Other Opportunities
Senior Frontend-Heavy Fullstack Engineer
The company is focused on building data and AI infrastructure solutions for enterprise and government clients. • Full-time
Senior Backend Engineer
The company is focused on building data and AI infrastructure solutions for enterprise and government clients. • Full-time
Senior React Native Engineer
AI Company, UK • Full-time