Principal Site Reliability Engineer at SecurityScorecard
Job Description
About SecurityScorecard:
SecurityScorecard is the global leader in cybersecurity ratings, with over 12 million companies continuously rated, operating in 64 countries. Founded in 2013 by security and risk experts Dr. Alex Yampolskiy and Sam Kassoumeh and funded by world-class investors, SecurityScorecard’s patented rating technology is used by over 25,000 organizations for self-monitoring, third-party risk management, board reporting, and cyber insurance underwriting; making all organizations more resilient by allowing them to easily find and fix cybersecurity risks across their digital footprint.
Headquartered in New York City, our culture has been recognized by Inc Magazine as a "Best Workplace,” by Crain’s NY as a "Best Places to Work in NYC," and as one of the 10 hottest SaaS startups in New York for two years in a row. Most recently, SecurityScorecard was named to Fast Company’s annual list of the World’s Most Innovative Companies for 2023 and to the Achievers 50 Most Engaged Workplaces in 2023 award recognizing “forward-thinking employers for their unwavering commitment to employee engagement.” SecurityScorecard is proud to be funded by world-class investors including Silver Lake Waterman, Moody’s, Sequoia Capital, GV and Riverwood Capital.
Role Overview – Principal Site Reliability Engineer, ML/AI Infrastructure
As a Principal Site Reliability Engineer (SRE) focused on ML and AI initiatives, you will play a critical role in designing and scaling infrastructure that powers advanced machine learning workloads. You will lead the development of highly reliable, observable, and automated Kubernetes-based platforms that support model training, inference, and continuous delivery of ML applications. Working closely with ML engineers, data scientists, and platform teams, you will help operationalize machine learning workflows and bring cutting-edge AI capabilities into production with confidence and speed.
Key Responsibilities
- Design and scale Kubernetes infrastructure purpose-built for ML/AI workloads, including GPU scheduling, autoscaling, and secure multi-tenant clusters.
- Enhance CI/CD pipelines for ML applications and model delivery (MLOps), including support for reproducible training, model versioning, and shadow testing.
- Implement progressive delivery strategies (e.g., canary, A/B testing) for machine learning models to ensure safe and incremental rollout of experiments.
- Partner with ML teams to operationalize ML workflows with tools like MLflow, Kubeflow, or Vertex AI, and integrate these into the broader platform architecture.
- Integrate and support Apache Kafka for streaming data ingestion and real-time feature delivery for ML pipelines.
- Deploy and maintain Airflow pipelines for orchestrating complex ML workflows and data preparation tasks.
- Build and optimize infrastructure and workflows for Langsmith/Langfuse to support observability and tracing of LLM-based applications and agents in production.
- Lead improvements in Infrastructure as Code using Terraform, Helm, and Argo CD, while establishing reusable and secure infrastructure patterns for AI applications.
- Support YugabyteDB as a high-performance, distributed database backend for ML and AI services requiring strong consistency and scale.
- Define and enforce automated testing strategies tailored to ML environments, such as data validation, model performance regression, and pipeline integration tests.
- Drive observability and alerting across ML pipelines and services, including monitoring data drift, model latency, and system-level metrics using tools like Prometheus, OpenTelemetry, New Relic, and Datadog.
- Actively support incident response for ML systems and infrastructure, focusing on root cause analysis and resilient remediation strategies.
- Mentor engineers and champion best practices across ML, platform, and infrastructure teams.
Qualifications
- 6+ years in SRE, DevOps, or infrastructure roles, including 2+ years supporting machine learning or data-intensive workloads in production
- Deep experience running Kubernetes in production, especially with ML workloads (GPU scheduling, autoscaling, pod optimization)
- Proven track record building CI/CD pipelines for ML systems using tools like GitHub Actions, GitLab CI, or Jenkins
- Strong command of cloud-native infrastructure (EKS, GKE, or AKS), including GPU provisioning and autoscaling for AI workloads
- Familiarity with MLOps and workflow orchestration tools, such as MLflow, Kubeflow, Airflow, and Argo Workflows
- Proficiency in Infrastructure as Code and GitOps with Terraform, Helm, and Argo CD
- Experience managing event streaming (Kafka), distributed databases (YugabyteDB), and LLM observability (Langsmith / Langfuse)
- Programming/scripting ability in Python, Go, or Bash for automating infrastructure or ML pipelines
- Solid knowledge of monitoring and observability tools (Prometheus, OpenTelemetry, New Relic, Datadog)
- Strong communication and mentoring skills; able to influence cross-functional teams
Benefits:
Specific to each country, we offer a competitive salary, stock options, Health benefits, and unlimited PTO, parental leave, tuition reimbursements, and much more!
The estimated total compensation range for this position is $220,000 - $290,000 (base plus bonus). Actual compensation for the position is based on a variety of factors, including, but not limited to affordability, skills, qualifications and experience, and may vary from the range. In addition to base salary, employees may also be eligible for annual performance-based incentive compensation awards and equity, among other company benefits.
SecurityScorecard is committed to Equal Employment Opportunity and embraces diversity. We believe that our team is strengthened through hiring and retaining employees with diverse backgrounds, skill sets, ideas, and perspectives. We make hiring decisions based on merit and do not discriminate based on race, color, religion, national origin, sex or gender (including pregnancy) gender identity or expression (including transgender status), sexual orientation, age, marital, veteran, disability status or any other protected category in accordance with applicable law.
We also consider qualified applicants regardless of criminal histories, in accordance with applicable law. We are committed to providing reasonable accommodations for qualified individuals with disabilities in our job application procedures. If you need assistance or accommodation due to a disability, please contact talentacquisitionoperations@securityscorecard.io.
Any information you submit to SecurityScorecard as part of your application will be processed in accordance with the Company’s privacy policy and applicable law.
SecurityScorecard does not accept unsolicited resumes from employment agencies. Please note that we do not provide immigration sponsorship for this position. #LI-DNI
More Current Jobs at SecurityScorecard
Apply to other open positions at SecurityScorecard