Software Engineer I - AI/ML, AWS Neuron Distributed Training
Job Description
Annapurna Labs (U.S.) Inc is seeking a Software Engineer I to contribute to AI and ML distributed training initiatives in Cupertino, onsite. This role centers on optimizing large-scale models on AWS Trainium, including mixed-precision training and framework extensions, in collaboration with hardware and AWS teams. The position offers a salary range of USD 127,100 - 185,000 per yearly.
Responsibilities
- Help design and implement distributed training solutions for large-scale ML models running on Trainium instances.
- Extend and optimize distributed training frameworks such as FSDP, torchtitan, and Hugging Face libraries within the Neuron ecosystem.
- Develop and optimize mixed-precision and low-precision training techniques, including BF16, FP8, and emerging numerical formats to boost training throughput while preserving accuracy.
- Implement precision-aware training strategies, loss scaling, and careful gradient management to maintain stability across reduced precision formats.
- Profile, analyze, and tune end-to-end training pipelines to maximize performance on Trainium hardware.
- Collaborate with hardware, compiler, and runtime teams to understand system constraints and unlock new capabilities.
- Work with AWS solution architects and customers to support the deployment and optimization of training workloads at scale.
Requirements
- Bachelor's degree or higher in computer science, computer engineering, or a related field, or equivalent
- 1+ years of programming experience in at least one software language (including academic projects, internships, or research)
- Experience with software development practices including code reviews, source control, testing, and build processes
- Experience with machine learning concepts and at least one ML framework (PyTorch, JAX, or TensorFlow)
Technologies
- FSDP
- torchtitan
- Hugging Face libraries
- PyTorch
- JAX
- TensorFlow
- Trainium
- AWS Neuron
- BF16
- FP8
Benefits
- Health insurance
- 401(k) matching
- Paid time off
- Parental leave
- Sign-on payments
- Restricted stock units (RSUs)
- Flexible Spending Accounts
- Employee Assistance Program (EAP)
- Mental Health Support
- Adoption and Surrogacy Reimbursement
Similar Jobs
S