DeveloperJobs.io
← Back to all jobs
Annapurna Labs (U.S.) Inc.

Software Engineer I - AI/ML, AWS Neuron Distributed Training

Cupertino, CA $127k - $185k/yr Full time Posted 8d ago

Job Description

Annapurna Labs (U.S.) Inc is seeking a Software Engineer I to contribute to AI and ML distributed training initiatives in Cupertino, onsite. This role centers on optimizing large-scale models on AWS Trainium, including mixed-precision training and framework extensions, in collaboration with hardware and AWS teams. The position offers a salary range of USD 127,100 - 185,000 per yearly.

Responsibilities

  • Help design and implement distributed training solutions for large-scale ML models running on Trainium instances.
  • Extend and optimize distributed training frameworks such as FSDP, torchtitan, and Hugging Face libraries within the Neuron ecosystem.
  • Develop and optimize mixed-precision and low-precision training techniques, including BF16, FP8, and emerging numerical formats to boost training throughput while preserving accuracy.
  • Implement precision-aware training strategies, loss scaling, and careful gradient management to maintain stability across reduced precision formats.
  • Profile, analyze, and tune end-to-end training pipelines to maximize performance on Trainium hardware.
  • Collaborate with hardware, compiler, and runtime teams to understand system constraints and unlock new capabilities.
  • Work with AWS solution architects and customers to support the deployment and optimization of training workloads at scale.

Requirements

  • Bachelor's degree or higher in computer science, computer engineering, or a related field, or equivalent
  • 1+ years of programming experience in at least one software language (including academic projects, internships, or research)
  • Experience with software development practices including code reviews, source control, testing, and build processes
  • Experience with machine learning concepts and at least one ML framework (PyTorch, JAX, or TensorFlow)

Technologies

  • FSDP
  • torchtitan
  • Hugging Face libraries
  • PyTorch
  • JAX
  • TensorFlow
  • Trainium
  • AWS Neuron
  • BF16
  • FP8

Benefits

  • Health insurance
  • 401(k) matching
  • Paid time off
  • Parental leave
  • Sign-on payments
  • Restricted stock units (RSUs)
  • Flexible Spending Accounts
  • Employee Assistance Program (EAP)
  • Mental Health Support
  • Adoption and Surrogacy Reimbursement

Similar Jobs

Get Job Alerts

New jobs delivered to your inbox.