Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development.
Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity‑optimized sampling, confidence‑based filtering, edge‑case prioritization, and deduplication strategies.
Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction.
Define annotation pipeline architecture: establish requirements for data labeling—intent annotation, entity tagging, dialog act classification, task completion scorin...
Ready to Apply?
Join thousands of Americans building their careers