What You'll Do
Design representative evaluation datasets by refining sampling methods and query diversity to reflect actual user behavior. Develop and manage large-scale evaluation frameworks that assess AI performance across thousands of real queries, ensuring reliable measurement of response quality.
Implement LLM-driven evaluation models to score key attributes such as factual accuracy, completeness, and clarity, calibrating them with human judgment data. Conduct pre-deployment assessments of new models and product updates, establishing quality thresholds that guide release decisions.
Construct observability systems that enhance traceability and data visibility for AI agents, enabling deeper analysis of behavior patterns. Use evaluation outcomes and user feedback to drive iterative improvements, including automated prompt refinement and model tuning.
Partner with engineering teams across the organization to integrate evaluation practices into development workflows, ensuring quality remains central to product evolution.
What We're Looking For
- Minimum of two years of software engineering experience with strong programming skills
- Proficiency in Go and Python, with experience in distributed data processing systems
- Background in LLM evaluation, reinforcement learning from human feedback, or natural language processing
- Ability to reason critically about how offline metrics correlate with real user outcomes
- Commitment to product quality and experience working in collaborative, cross-functional environments
- Strong ownership mindset with a focus on delivering measurable impact
Technology Environment
Go, Python, LLM evaluation frameworks, reinforcement learning from human feedback, natural language processing, distributed data pipelines
Benefits
- Competitive salary and potential equity or variable compensation
- Medical, vision, and dental insurance
- Generous paid time off
- 401k contribution options
- Stipends for home office setup, education, and wellness
- Daily healthy lunches and regular team events
Work Environment
This role operates in a hybrid model, requiring 3–4 days per week in one of our offices in the SF Bay Area. We foster a collaborative, inclusive culture that values diverse perspectives and customer-centric innovation.
Our Commitment
We are dedicated to building a diverse and equitable workplace. We do not discriminate on the basis of race, gender, age, religion, sexual orientation, disability, or any other protected status. All are welcome to contribute to our mission of creating intelligent, reliable AI systems.


