Partner Company

Senior DevOps / Platform Engineer

The company is focused on building data and AI infrastructure solutions for enterprise and government clients.

Location

Onsite

Employment Type

Full-time

About This Role

  • Manage RKE2 Kubernetes clusters across multiple node roles (control plane, compute workers, GPU workers, storage nodes), including cluster bootstrapping, upgrades, etcd backup and restore, and node lifecycle management
  • Build and maintain GitOps pipelines using ArgoCD and GitLab CI/CD: Helm chart packaging, image promotion workflows (dev to staging to production), and environment-specific configuration management
  • Operate and configure platform services: Kong (API gateway), Keycloak (identity), Vault (secrets), and GitLab (CI/CD and container registry)
  • Implement observability: Prometheus for metrics, Grafana for dashboards, centralized logging (Loki or Elasticsearch), and alerting with on-call routing
  • Manage Ceph integration at the Kubernetes level: CSI driver, StorageClass definitions, and PVC troubleshooting
  • Manage GPU workload scheduling: NVIDIA device plugin, resource quotas for GPU pods, and utilization monitoring
  • Automate operational tasks: certificate rotation, secret rotation, backup verification, and capacity planning

What We're Looking For

  • 5+ years in DevOps, SRE, or Platform engineering, with at least 2 years managing self-hosted production Kubernetes clusters (not managed K8s)
  • Has operated stateful workloads on Kubernetes in production (databases, Kafka, Ceph, or equivalent)
  • Production experience operating at least one of: Kong, Keycloak, Vault, or equivalent service in each category (API gateway, identity provider, secret manager)
  • Has built CI/CD pipelines from scratch for a team of 5+ engineers using GitLab CI or equivalent
  • Has authored Helm charts and deployed applications through ArgoCD or equivalent GitOps tooling
  • Preferred
  • Experience with Thanos or Cortex for long-term Prometheus metric storage
  • Direct experience with Ceph CSI or Rook-Ceph on Kubernetes
  • Has managed GPU workloads on Kubernetes (NVIDIA device plugin)
  • NetworkPolicy design experience with Calico or Cilium
  • Terraform experience for infrastructure-as-code