✨ About The Role
- Responsible for scaling up critical inference infrastructure to efficiently service customer requests for state-of-the-art AI models like GPT-4 and DALL-E
- Improve reliability, security, scalability, and observability of distributed inference infrastructure
- Design and implement solutions to address bottlenecks and sources of instability in the infrastructure
- Ensure efficient use of a variety of AI models deployed on a large fleet of Azure VMs under Kubernetes
- Work towards OpenAI's goal of safely deploying broadly beneficial Artificial General Intelligence (AGI) through responsible use of powerful AI tools
âš¡ Requirements
- Experienced engineer with a background in high-throughput scheduling and supercomputing scale
- Proficient in low-level systems languages like C++ or Go, with a strong understanding of modern AI architectures
- Skilled in testing, debugging, and maintaining systems, with at least 3 years of professional software engineering experience
- Deep knowledge of network technologies from L1 to L7, able to own problems end-to-end and adapt quickly in a fast-paced environment
- Ability to build tools and solutions to address complex issues, ensuring the reliability, security, and scalability of distributed infrastructure