Senior DevOps / Platform Engineer
RainTech
Location
Onsite
Employment Type
Full-time
About This Role
- •Manage RKE2 Kubernetes clusters across multiple node roles (control plane, compute workers, GPU workers, storage nodes), including cluster bootstrapping, upgrades, etcd backup and restore, and node lifecycle management
- •Build and maintain GitOps pipelines using ArgoCD and GitLab CI/CD: Helm chart packaging, image promotion workflows (dev to staging to production), and environment-specific configuration management
- •Operate and configure platform services: Kong (API gateway), Keycloak (identity), Vault (secrets), and GitLab (CI/CD and container registry)
- •Implement observability: Prometheus for metrics, Grafana for dashboards, centralized logging (Loki or Elasticsearch), and alerting with on-call routing
- •Manage Ceph integration at the Kubernetes level: CSI driver, StorageClass definitions, and PVC troubleshooting
- •Manage GPU workload scheduling: NVIDIA device plugin, resource quotas for GPU pods, and utilization monitoring
- •Automate operational tasks: certificate rotation, secret rotation, backup verification, and capacity planning
- •Collaborate closely with backend engineering teams to design, deploy, troubleshoot, and optimize applications running on Kubernetes and the supporting infrastructure stack.
What We're Looking For
- ✓5+ years in DevOps, SRE, or Platform engineering, with at least 2 years managing self-hosted production Kubernetes clusters (not managed K8s)
- ✓Strong software engineering background, preferably in backend development, with a solid understanding of application architecture and software development lifecycle (SDLC).
- ✓At least 2–3 years of hands-on experience operating self-managed Kubernetes clusters in production (RKE2, kubeadm, or equivalent), not limited to managed Kubernetes services such as EKS, GKE, or AKS.
- ✓Strong understanding of core infrastructure components, including databases, message brokers, container registries, storage systems, and deployment automation.
- ✓Able to troubleshoot application and infrastructure issues across the full platform stack.
- ✓Has operated stateful workloads on Kubernetes in production (databases, Kafka, Ceph, or equivalent)
- ✓Production experience operating at least one of: Kong, Keycloak, Vault, or equivalent service in each category (API gateway, identity provider, secret manager)
- ✓Has built CI/CD pipelines from scratch for a team of 5+ engineers using GitLab CI or equivalent
- ✓Has authored Helm charts and deployed applications through ArgoCD or equivalent GitOps tooling
- ✓Preferred
- ✓Experience with Thanos or Cortex for long-term Prometheus metric storage
- ✓Direct experience with Ceph CSI or Rook-Ceph on Kubernetes
- ✓Has managed GPU workloads on Kubernetes (NVIDIA device plugin)
- ✓NetworkPolicy design experience with Calico or Cilium
- ✓Terraform experience for infrastructure-as-code
- ✓Experience working closely with backend engineering teams and supporting production application deployments.
- ✓Experience administering self-hosted Kubernetes environments rather than primarily managed cloud Kubernetes services.