You will help shape how AI coding systems are evaluated by creating meaningful programming challenges grounded in real-world development scenarios. Your work will directly influence how AI performance is measured, focusing on reasoning, implementation accuracy, and handling of complex requirements.
What You'll Do
- Develop and refine coding tasks based on realistic production codebases, ensuring they reflect authentic development challenges
- Write detailed functional tests that assess full behavior, including edge cases and integration points
- Design problems that are fair but demanding—requiring AI to synthesize information across files and external sources
- Examine AI-generated solutions to identify patterns in success and failure
- Improve tasks based on structured feedback from expert reviewers using defined quality benchmarks
What We're Looking For
- Computer Science or related degree
- At least 5 years of hands-on software development with strong Python experience (pytest, async/await, subprocess, file operations)
- Full-stack background with practical work in both React front-ends and backend systems
- Proven experience writing tests, not just running them
- Familiarity with Docker for running isolated evaluations
- Working knowledge of CI/CD, particularly GitHub Actions (triggers, labels, interpreting results)
- Functional English (B2 level or higher)
Work Environment
This is a freelance, project-based position open to candidates worldwide. You’ll have full control over your schedule as long as deadlines are met. Projects vary in scope and complexity, and compensation adjusts accordingly, with an equivalent base rate of $50 per hour.
Technology Stack
Python, pytest, async/await, subprocess, file operations, React, Docker, GitHub Actions, CI/CD


