Site Reliability Engineer at Cognition
Job Description
📋 Description
- Define and own SLOs, SLIs, and error budgets for Devin and Windsurf.
- Build monitoring, alerting, and observability for service health.
- Lead incident response with speed and blameless postmortems.
- Create runbooks and tooling for sustainable on-call.
- Own CI/CD pipelines and deployment infrastructure.
- Reduce toil with automation and developer tooling.
🎯 Requirements
- Deep exp running production systems at scale: SLOs, on-call, incident command.
- Strong software fundamentals; SRE writes real code, not just configuring tools.
- Cloud infra (AWS, GCP, or Azure), Kubernetes, and Terraform.
- Experience building and owning CI/CD pipelines and deployment infrastructure.
- Strong observability instincts; instrument systems and design useful alerts.
- Proven track record reducing toil through automation.
🎁 Benefits
- Small, selective team shipping products used by thousands of developers.
- High ownership and trust; set the reliability bar.
- Environment rewards proactive, systematic reliability as a craft.
More Current Jobs at Cognition
Apply to other open positions at Cognition
