Member of Technical Staff (Data Acquisition)
Sanas
Member of Technical Staff (Data Acquisition) About the Role Your mission is to build and operate the ingestion systems that turn the open web and large-scale audio sources into reliable, well-structured corpora for training Sanas's frontier speech models. You'll own the machinery that acquires, extracts, filters, versions, and delivers audio data to our training pipelines — and you'll work directly with our research scientists to close the loop between what we collect and how it moves model quality. Job Description Own and lead engineering projects across the full data acquisition stack — web crawling, audio ingestion, source discovery, and dataset delivery to training pipelines. Build and operate large-scale distributed crawling infrastructure capable of continuously discovering and ingesting audio at scale across languages, accents, domains, and recording environments. Develop specialized crawlers for high-priority audio sources with source-specific extraction and normalization logic. Run experiments to evaluate crawling strategies, extraction methods, and ingestion tradeoffs; analyze results to identify gaps, redundancy, and coverage improvements across speaker demographics and language pairs. Build ingestion pipelines that scale reliably across large data campaigns, with automated audio quality filtering — SNR estimation, clipping detection, codec artifact identification — as a first-class pipeline stage. Systems & Infrastructure Design and deploy highly scalable distributed systems capable of handling petabytes of audio data — from raw acquisition through quality filtering, deduplication, segmentation, and versioned dataset generation. Architect and implement indexing and search capabilities over large audio corpora — enabling fast lookup by language, speaker, acoustic condition, duration, and quality tier. Build and maintain backend services for data storage, including key-value databases, metadata synchronization, and manifest management across dataset versions. Deploy and operate acquisition infrastructure in a Kubernetes / Infrastructure-as-Code environment; perform routine system health checks and respond to production issues quickly. Collaborate closely with data processing, architecture, and ML platform teams to ensure smooth data flow from acquisition through to training‑ready outputs. Compliance & Data Governance Work closely with legal to handle compliance, data privacy, and licensing matters across all acquisition sources — maintaining a clear audit trail of provenance, permitted use, and commercial training rights for every dataset. Enforce speaker consent documentation, GDPR requirements, robots.txt and ToS adherence, and audio retention policies across all ingestion pipelines. Manage relationships with third‑party data vendors — writing precise acquisition briefs, evaluating quality on delivery, and ensuring sourced data meets Sanas's licensing and quality standards. Qualifications 4+ years of experience in data engineering, ML data infrastructure, or backend systems engineering — with direct experience building large‑scale data ingestion or crawling systems. Strong Python and systems engineering skills — you build robust, maintainable infrastructure, not just one‑off scripts. Hands‑on experience with distributed systems design: you've built systems that handle failure gracefully, scale horizontally, and recover cleanly. Experience with web crawling infrastructure at scale including handling rate limiting, deduplication, and content extraction. Proficiency with cloud platforms (AWS or GCP), object storage (S3/GCS), and container orchestration (Kubernetes). Comfort working with audio processing tooling — ffmpeg, librosa, torchaudio, sox — and experience handling large volumes of audio files. Strong data quality instincts: you instrument pipelines, surface issues proactively, and treat data correctness with the same rigor as software correctness. Bonus Experience building speech or audio datasets for ASR, TTS, speech enhancement, or speaker verification model training. Familiarity with major open speech corpora — Common Voice, LibriSpeech, VoxPopuli, AISHELL — and their sourcing and quality characteristics. Experience with data versioning tools. Background in multilingual or low‑resource language data collection. Experience with annotation and labeling platforms. Familiarity with speaker diarization, language identification, or automated audio quality estimation models used for data filtering at scale. #J-18808-Ljbffr Sanas
- ...hardware revolution. What You'll Do As a Founding Member of the Technical Staff (ML infra) at Architect, you'll be responsible for the critical... ...record of building end-to-end ML pipelines, including data curation, preparation, and large-scale LLM finetuning (...Data
- A technology company specializing in AI is seeking a Member of Technical Staff to enhance reliability across a multi-data center environment. The role involves automating processes and implementing robust observability solutions to ensure operational excellence. Ideal...Data
- ...largest real-world radiology datasets. About the Role As a Member of Technical Staff focused on model training, you will be responsible for... ...environment they run in. This role owns how models are trained: how data flows into training, how training jobs are structured, and...Data
- ...largest real-world radiology datasets. About the Role As a Member of Technical Staff on the ML Infrastructure team, you will build and operate... ...Own systems that manage large-scale compute, storage, and data movement for ML workloads. Design and optimize model serving...Data
- ...reliable despite non-deterministic model behavior. Role As a Member of Technical Staff, Machine Learning, you will build core ML components. You... ...real‑world ML. Focus Build and improve ML components across data, training, evaluation, and inference. Fine‑tune and adapt...DataImmediate start
- About the Role As a Member of Technical Staff [Research] at NeoCognition , you’ll be part of the core team advancing the frontier of LLM agents... ...use) LLM post-training (instruction tuning, RL, reasoning) Data pipeline design and model evaluation Proficiency in Python...Data
- About the Role As a Member of Technical Staff [Platform] at NeoCognition , you’ll design and build the internal systems that power everything we... ...research scientists and software engineers to ensure that our data, model, and product workflows are robust, reproducible, and...Data
- RadixArk is seeking a Member of Technical Staff — Training to build and scale the systems that train frontier AI models. You will work on large... ...Strong experience with large-scale distributed training (data, tensor, and pipeline parallelism) Deep understanding of GPU...DataFlexible hours
$180k
...interview (“phone interview”) during which a member of our team will ask some basic questions... ...the main process, which consists of 2 technical interviews and 1 project deep-dive... ...and trade secret information, and/or user data; Interacting with internal and/or external...DataWork at officeLocal areaWork from home- Member of Technical Staff Physical AI (Robotics / World Models) Palo Alto, CA About Orbifold AI Orbifold AI advances the frontier of physical AI... ...through rigorous evaluation and curated, real-world data. We work directly with leading robotics and world model research...DataShift work
$180k
...with their teammates. About the Team The Data Platform team at X builds and operates... ...interview (“phone interview”) during which a member of our team will ask some basic questions... ...the main process, which consists of 2 technical interviews and 1 project deep‑dive interview...DataTemporary workWork at officeWork from home$180k
...interview (“phone interview”) during which a member of our team will ask some basic questions... ...the main process, which consists of four technical interviews: Coding assessment in a... ...and trade secret information, and/or user data; Interacting with internal and/or external...DataLocal areaRelocation$180k
...interactions at global scale. Build and optimize data pipelines using real-time user signals to... ...(“phone interview”) during which a member of our team will ask some basic questions... ...the main process, which consists of 2 technical interviews and 1 project deep-dive interview...DataTemporary work$180k
...enhance the user experience on X Write data pipelines and training jobs that continuously... ...(“phone interview”) during which a member of our team will ask some basic questions... ...the main process, which consists of four technical interviews: Coding assessment in a language...DataLocal areaRelocation$180k
Member Of Technical Staff - Cloud Infrastructure ABOUT xAI xAI’s mission is to create AI systems that can accurately understand the universe and... ...such as Pulumi, Terraform, or Ansible, with a focus on secure data handling. Drive system reliability through incident...DataTemporary work$180k
...expert engineers in multimodal mid‑training data. Python JAX and XLA Spark Ray... ...interview ("phone interview") during which a member of our team will ask some basic questions... ...the main process, which consists of four technical interviews: Coding assessment in a language...DataTemporary workRelocation- ...two minutes and one that lasts two hours. As our RecSys founding member, you'll own this problem end-to-end - set the architecture,... ...players, from candidate retrieval through final ranking Own the full data pipeline - ingestion, feature engineering, training data...Data
$180k
...with their teammates. About the Team The Data Platform team at X builds and operates the... ...(“phone interview”) during which a member of our team will ask some basic questions... ...enter the main process, which consists of 2 technical interviews and 1 project deep-dive interview...DataTemporary workH1bWork at officeWork from homeWork visa$180k
Member of Technical Staff - Multimodal Understanding About xAI xAI’s mission is to create AI systems that can accurately understand the... ..., video, audio, and text—spanning the full stack: data curation/acquisition, tokenizer training, large‑scale pre‑training, post‑training...DataTemporary work$180k
Member of Technical Staff - Web Engineering About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid... ...excellence via testing, monitoring, deployment, and secure data handling. Drive technical/product decisions with teams and...DataTemporary workWorldwide$90k - $130k
Member of Technical Staff - Program Analysis This role is based in Palo Alto, California, and follows a hybrid work model. If you’re excited about... ...infrastructure, which includes call graph construction, data‑flow and taint analysis, and language‑specific analyzers....Data- ...share knowledge with their teammates. About the Role We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and...Data
$180k
...day with lightning speed and perfect reliability. As a Member of Technical Staff - Inference, you will design and optimize large-scale model... ...and perks. xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice....DataTemporary work- Member of Technical Staff - Foundation Model Architecture & AI Infrastructure Vinci | Full-Time | Remote / Hybrid The Mission At Vinci, we are... ...production workloads. Trained on 45TB+ of structured physics data Running billion-voxel inference in production Deployed...DataFull timeRemote work
$180k
Member of Technical Staff - RL Infrastructure About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid... ...is seeking experienced software engineers to create robust data pipelines, comprehensive evaluations for benchmarking LLMs,...DataTemporary work- ...largest real-world radiology datasets. About the Role As a Member of Technical Staff in Software Engineering, you will build the full stack... ...off. Work with large volumes of structured and unstructured data, including imaging metadata and model outputs. Ensure platform...Data
$139.9k - $274.8k
...businesses, developers - so that everyone can realize its benefits. Microsoft AI (MS AI) is seeking a experienced Member of Technical Staff - Data Engineer - Microsoft AI - Copilot to help build mission critical data pipelines that ingest, process and publishes data...DataOngoing contractWork at officeLocal area$120k - $160k
...Member Of Technical Staff - Backend Software Engineer This role is based in Palo Alto, California, and follows a hybrid work model. If you're... ...and always start with the customer's success. We debate with data, make the complex simple, and challenge each other with...DataShift work- ...’s models are trained and validated on one of the world’s largest real‑world radiology datasets. About the Role As a Member of Technical Staff in Data Engineering, you will own the systems and workflows that transform Cognita’s raw radiology data into high‑quality datasets...Data
$180k
...They should be able to concisely and accurately share knowledge with their teammates. RESPONSIBILITIES: Scale synthetic coding data to trillions of tokens with large-scale docker verification. Distill the intelligence of flagship models into flash models...DataTemporary work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Member of Technical Staff (Data Acquisition). Be the first to apply!
- technical support assistant Palo Alto, CA
- technical analyst Palo Alto, CA
- end user support technician Palo Alto, CA
- IT assistant Palo Alto, CA
- help desk assistant Palo Alto, CA
- IT support technician Palo Alto, CA
- operations support technician Palo Alto, CA
- desktop support analyst Palo Alto, CA
- support analyst Palo Alto, CA
- technical associate Palo Alto, CA


