Job Description
We are looking for a highly skilled engineer to build, optimize, and maintain high-performance inference services for large language models (LLMs) and multimodal models. You will work closely with algorithm, systems, and product teams to deliver best-in-class performance, stability, and efficiency in production environments-ensuring low-latency, highly available AI services for tens of millions of users.
Key Responsibilities
High-Performance Computing & Kernel Optimization
- Perform deep GPU/CUDA kernel optimization, including memory access pattern tuning, instruction-level parallelism, and warp-level optimization to fully utilize hardware capabilities
- Develop and optimize custom high-performance operators using advanced DSLs or compiler frameworks such as Triton and TVM
- Identify and resolve performance bottlenecks in scenarios such as operator fusion and quantization