Site Reliability Engineer at Cognition

Job Description

📋 Description

Define and own SLOs, SLIs, and error budgets for Devin and Windsurf.
Build monitoring, alerting, and observability for service health.
Lead incident response with speed and blameless postmortems.
Create runbooks and tooling for sustainable on-call.
Own CI/CD pipelines and deployment infrastructure.
Reduce toil with automation and developer tooling.

🎯 Requirements

Deep exp running production systems at scale: SLOs, on-call, incident command.
Strong software fundamentals; SRE writes real code, not just configuring tools.
Cloud infra (AWS, GCP, or Azure), Kubernetes, and Terraform.
Experience building and owning CI/CD pipelines and deployment infrastructure.
Strong observability instincts; instrument systems and design useful alerts.
Proven track record reducing toil through automation.

🎁 Benefits

Small, selective team shipping products used by thousands of developers.
High ownership and trust; set the reliability bar.
Environment rewards proactive, systematic reliability as a craft.

More Current Jobs at Cognition

Apply to other open positions at Cognition

Product Marketer

San Francisco Empllo

Posted: April 17, 2026

Head of Partnerships - Japan

Posted: April 04, 2026