America's Job Portal
What You’ll Do
Reliability & Operations
- Own availability, latency, and scalability across SaaS and AI systems
- Define and enforce SLOs, SLIs, and error budgets
- Participate in a global on-call rotation (~1 week every 4 weeks)
- Lead incident response and drive blameless postmortems with systemic fixes
Platform & Infrastructure
- Architect and operate on-premise and multi-region, multi-cloud environments
- Manage large-scale Kubernetes workloads
- Build and evolve infrastructure using Terraform and Ansible
- Improve system resilience, fault isolation, and capacity planning
AI/ML & Automation
- Build and scale agentic AI systems for triage, anomaly detection, and self-healing
- Ensure reliability of model serving infrastructure
- Operate, optimize and scale distributed systems
What You Bring ...