•Manage RKE2 Kubernetes clusters across multiple node roles (control plane, compute workers, GPU workers, storage nodes), including cluster bootstrapping, upgrades, etcd backup and restore, and node lifecycle management
•Build and maintain GitOps pipelines using ArgoCD and GitLab CI/CD: Helm chart packaging, image promotion workflows (dev to staging to production), and environment-specific configuration management
•Operate and configure platform services: Kong (API gateway), Keycloak (identity), Vault (secrets), and GitLab (CI/CD and container registry)
•Implement observability: Prometheus for metrics, Grafana for dashboards, centralized logging (Loki or Elasticsearch), and alerting with on-call routing
•Manage Ceph integration at the Kubernetes level: CSI driver, StorageClass definitions, and PVC troubleshooting
•Manage GPU workload scheduling: NVIDIA device plugin, resource quotas for GPU pods, and utilization monitoring
✓5+ years in DevOps, SRE, or Platform engineering, with at least 2 years managing self-hosted production Kubernetes clusters (not managed K8s)
✓Has operated stateful workloads on Kubernetes in production (databases, Kafka, Ceph, or equivalent)
✓Production experience operating at least one of: Kong, Keycloak, Vault, or equivalent service in each category (API gateway, identity provider, secret manager)
✓Has built CI/CD pipelines from scratch for a team of 5+ engineers using GitLab CI or equivalent
✓Has authored Helm charts and deployed applications through ArgoCD or equivalent GitOps tooling
✓Preferred
✓Experience with Thanos or Cortex for long-term Prometheus metric storage
✓Direct experience with Ceph CSI or Rook-Ceph on Kubernetes
✓Has managed GPU workloads on Kubernetes (NVIDIA device plugin)
✓NetworkPolicy design experience with Calico or Cilium