DISQO’s mission is to build the world’s most trusted ad measurement platform that fuels brand growth. The world’s largest brands, agencies, and media companies trust DISQO for expert insight and AI-driven intelligence about their advertising performance across all platforms. We capture people’s sentiments and journeys, connecting them with the brands they value and the media they consume. With this identity-based approach, brands gain more accurate and authentic insight so they can create more meaningful interactions.
When you join DISQO Nation, you join a community that values trust, transparency and innovation. We invest in our employees and apply a bottom-up management approach, rooted in the concept of servant leadership. We approach each day eager to learn, grow, and make a lasting impact. Best of all, we have fun while doing it!
About the Role:
We are seeking an experienced Lead Site Reliability Engineer to join our engineering team and drive the reliability, scalability, and performance of our production systems through innovative use of AI and automation. In this role, you will lead SRE initiatives, mentor team members, and leverage AI technologies to enhance operational excellence, predictive maintenance, and intelligent automation across our infrastructure.
Key Responsibilities:
Technical Leadership:
Design and implement comprehensive monitoring, alerting, and observability solutions, leveraging AI for intelligent anomaly detection and root cause analysisLead incident response efforts using AI-assisted diagnostics and automated remediation, conduct post-mortems, and drive systemic improvementsDevelop and maintain service level objectives (SLOs) and error budgets with AI-powered predictive analytics to forecast reliability risksArchitect and implement intelligent automation solutions for deployment, scaling, and infrastructure management using machine learning modelsDrive capacity planning and performance optimization using AI forecasting models and predictive analyticsAI-Enhanced SRE Leadership:
Implement and maintain AI-powered incident prediction and prevention systemsDesign intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learningDevelop AI-driven capacity planning models that predict resource needs and optimize cost efficiencyBuild and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triageImplement automated root cause analysis using AI correlation engines and log analysisTeam Leadership & Collaboration:
Mentor junior SREs on integrating AI tools and practices into traditional SRE workflowsPartner with engineering teams to embed AI-enhanced reliability principles into the software development lifecycleLead cross-functional initiatives to implement AI-driven operational improvementsCollaborate with data science teams to develop custom AI models for operational use casesParticipate in on-call rotations while developing AI systems to minimize toil and improve response efficiencyStrategic Initiatives:
Develop and execute an SRE roadmap aligned with business objectives and technological advancementEvaluate and implement new AI tools and technologies to improve system reliability, security and operational efficiencyDrive adoption of AI-powered engineering and predictive failure testingEstablish metrics and reporting using AI analytics to demonstrate the business value of intelligent reliability investmentsRequired Qualifications:
6+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure-focused roles2+ years of experience leading technical teams or initiativesStrong experience with AI/ML tools and frameworks applied to operational use cases (anomaly detection, predictive analytics, NLP)Hands-on experience implementing AI-powered monitoring, alerting, and automation solutionsStrong programming skills in Python with experience in AI/ML librariesExtensive experience with cloud platforms (AWS, GCP,) and their AI/ML servicesKnowledge of prompt engineering, LLM integration, and building AI-powered operational toolsProficiency with infrastructure as code and configuration management with AI-enhanced workflowsExperience with time series analysis, statistical modeling, and predictive analytics for infrastructure metricsDeep understanding of monitoring and observability tools enhanced with AI capabilitiesExperience with CI/CD pipelines incorporating AI-driven quality gates and automated decision makingStrong knowledge of networking, distributed systems, and database technologiesExpert level knowledge in following domains: AWS ( core services, networking, compute, databases, storage, etc.. ) TerraformKubernetes / Karpetner / HelmStrong experience building in-house observability platforms, including: OpenTelemetryLokiGrafanaPrometheusAWS CloudwatchAWS X-Ray or JaegerExperience in ArgoCD / ArgoWorkflow will be big plusBachelor’s degree in Computer Science, Engineering, Data Science, or equivalent practical experiencePreferred Qualifications:
Advanced experience with large language models (LLMs) for operational documentation, code generation, and incident responseExperience with automated incident response systems using AI decision enginesExperience with microservices architecture and intelligent service mesh managementFamiliarity with AI-powered security tools and anomaly detection for infrastructure protectionExperience building and maintaining AI-driven dashboards and reporting systemsExperience with AI-powered cost optimization and resource right-sizing toolsCertification in relevant cloud platforms #LI-MV1
At DISQO, we pride ourselves on having a positive, performance-oriented workplace that includes a flexible hybrid approach, competitive medical benefits, and an amazing vacation policy. Read more about our culture on Glassdoor.
Perks & Benefits:
·100% covered Medical/Dental/Vision for employee, competitive dependent coverage
·Equity
·401K
·Generous PTO policy
·Flexible workplace policy
·Team offsites, social events & happy hours
·Life Insurance
·Health FSA
·Commuter FSA (for hybrid employees)
·Catered lunch and fully stocked kitchen
·Paid Maternity/Paternity leave
·Disability Insurance
·Travel Assistance Program
·24/7 Counseling Services offered to Employees
Note: The benefits noted above are for full time US based employees only.
DISQO is an equal opportunity employer. Discovery, innovation, and growth are possible when we open ourselves to new possibilities, perspectives, and approaches. That’s why, at DISQO, we welcome, support, and empower individuals from diverse backgrounds. Exceptional teams are rooted in extraordinary people, each with a unique story and a compelling set of skills. DISQO does not discriminate against employees based on race, color, religion, sex, national origin, gender identity or expression, age, disability, pregnancy (including childbirth, breastfeeding, or related medical condition), genetic information, protected military or veteran status, sexual orientation, or any other characteristic protected by applicable federal, state or local laws.
*Recruiting firms that submit resumes to DISQO without first entering into a written contract will not be entitled to any compensation on candidates referred by that firm.