As a Principal Software Engineer focused on AI/ML, you will lead the design and implementation of advanced inference systems that power next-generation applications. Your work will center on building robust, scalable solutions for multi-modal and distributed AI, translating cutting-edge research into reliable production deployments.
Key Responsibilities
- Design and deploy high-performance AI/ML applications using modern inference platforms, with emphasis on agent architectures and multi-modal integration
- Optimize model serving performance through techniques such as KV cache tuning, batching strategies, and distributed execution
- Contribute improvements to open-source inference frameworks including vLLM, Text Generation Inference, PyTorch, and OpenVINO
- Lead technical direction across teams, aligning engineering efforts around scalable AI deployment patterns
- Evaluate and implement advancements in inference technology to maintain system competitiveness
- Measure and refine inference efficiency at scale, using empirical data to guide optimization
- Share technical insights through publications, conference talks, and community engagement
What We’re Looking For
- Bachelor’s degree in Computer Science, Engineering, or relevant field, or equivalent practical experience
- Minimum of five years in AI/ML engineering, with a focus on production inference pipelines
- Strong command of PyTorch and deep learning model deployment workflows
- Proven experience in inference optimization—model serving, quantization, and memory management
- Proficiency in Python and C++ for performance-sensitive applications
- Track record of active open-source contributions and community leadership
- Ability to work independently and collaboratively across distributed teams
- Excellent communication skills in English, both written and verbal
- Openness to sharing ideas and fostering technical collaboration
Preferred Experience
- Hands-on work with vLLM, TGI, or similar inference engines
- Contributions to core AI frameworks such as PyTorch or OpenVINO
- Experience deploying models in distributed GPU environments
- Familiarity with Kubernetes and cloud-native AI architectures
- Knowledge of quantization methods including GPTQ, AWQ, and FP8
- Background in CUDA, Triton, or GPU programming models
- Work with diffusion models and transformer-based generative systems
- Development of AI agents and autonomous decision-making pipelines
Technology Environment
Our stack includes PyTorch, vLLM, Text Generation Inference, OpenVINO, Python, C++, KV cache optimization, model quantization, distributed serving, CUDA, Triton, Kubernetes, GPU acceleration, diffusion architectures, and agent-based systems.
Work Environment
This role supports flexible arrangements—fully remote, office-flex, or in-office—based on individual and team needs. We value autonomy, transparency, and inclusive collaboration, encouraging contributions from all team members regardless of background or seniority. Innovation thrives here through open dialogue and shared ownership.
Equal Opportunity
We are committed to a workplace that values diversity and inclusion. All qualified applicants will be considered without regard to race, color, religion, gender, sexual orientation, national origin, age, disability, or any other protected characteristic. We provide reasonable accommodations for candidates who need support during the application process. If assistance is required, please contact application-assistance@redhat.com.


