Position Overview
Description
Join our team building the scale-out networking backbone that powers the world's largest AI training clusters. We're developing high-performance RDMA and RoCE solutions that enable distributed training of trillion-parameter models across thousands of compute nodes on AWS infrastructure.
Our team is responsible for creating the networking software that connects massive AI accelerator clusters, focusing on SmartNIC integration, collective communication optimization, and ultra-high-bandwidth inter-rack connectivity. You'll be working at the intersection of cloud infrastructure and state-of-the-art AI hardware to solve some of the most challenging networking problems in distributed computing.
Key job responsibilities
* Design and develop high-performance networking software solutions utilizing RDMA and RoCE technologies for large-scale AI clusters
* Integrate SmartNIC acceleration hardware with EC2 control plane systems and APIs
* Implement and opt...