Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Member of Technical Staff (Data Acquisition)

Sanas

Member of Technical Staff (Data Acquisition) About the Role Your mission is to build and operate the ingestion systems that turn the open web and large-scale audio sources into reliable, well-structured corpora for training Sanas's frontier speech models. You'll own the machinery that acquires, extracts, filters, versions, and delivers audio data to our training pipelines — and you'll work directly with our research scientists to close the loop between what we collect and how it moves model quality. Job Description Own and lead engineering projects across the full data acquisition stack — web crawling, audio ingestion, source discovery, and dataset delivery to training pipelines. Build and operate large-scale distributed crawling infrastructure capable of continuously discovering and ingesting audio at scale across languages, accents, domains, and recording environments. Develop specialized crawlers for high-priority audio sources with source-specific extraction and normalization logic. Run experiments to evaluate crawling strategies, extraction methods, and ingestion tradeoffs; analyze results to identify gaps, redundancy, and coverage improvements across speaker demographics and language pairs. Build ingestion pipelines that scale reliably across large data campaigns, with automated audio quality filtering — SNR estimation, clipping detection, codec artifact identification — as a first-class pipeline stage. Systems & Infrastructure Design and deploy highly scalable distributed systems capable of handling petabytes of audio data — from raw acquisition through quality filtering, deduplication, segmentation, and versioned dataset generation. Architect and implement indexing and search capabilities over large audio corpora — enabling fast lookup by language, speaker, acoustic condition, duration, and quality tier. Build and maintain backend services for data storage, including key-value databases, metadata synchronization, and manifest management across dataset versions. Deploy and operate acquisition infrastructure in a Kubernetes / Infrastructure-as-Code environment; perform routine system health checks and respond to production issues quickly. Collaborate closely with data processing, architecture, and ML platform teams to ensure smooth data flow from acquisition through to training‑ready outputs. Compliance & Data Governance Work closely with legal to handle compliance, data privacy, and licensing matters across all acquisition sources — maintaining a clear audit trail of provenance, permitted use, and commercial training rights for every dataset. Enforce speaker consent documentation, GDPR requirements, robots.txt and ToS adherence, and audio retention policies across all ingestion pipelines. Manage relationships with third‑party data vendors — writing precise acquisition briefs, evaluating quality on delivery, and ensuring sourced data meets Sanas's licensing and quality standards. Qualifications 4+ years of experience in data engineering, ML data infrastructure, or backend systems engineering — with direct experience building large‑scale data ingestion or crawling systems. Strong Python and systems engineering skills — you build robust, maintainable infrastructure, not just one‑off scripts. Hands‑on experience with distributed systems design: you've built systems that handle failure gracefully, scale horizontally, and recover cleanly. Experience with web crawling infrastructure at scale including handling rate limiting, deduplication, and content extraction. Proficiency with cloud platforms (AWS or GCP), object storage (S3/GCS), and container orchestration (Kubernetes). Comfort working with audio processing tooling — ffmpeg, librosa, torchaudio, sox — and experience handling large volumes of audio files. Strong data quality instincts: you instrument pipelines, surface issues proactively, and treat data correctness with the same rigor as software correctness. Bonus Experience building speech or audio datasets for ASR, TTS, speech enhancement, or speaker verification model training. Familiarity with major open speech corpora — Common Voice, LibriSpeech, VoxPopuli, AISHELL — and their sourcing and quality characteristics. Experience with data versioning tools. Background in multilingual or low‑resource language data collection. Experience with annotation and labeling platforms. Familiarity with speaker diarization, language identification, or automated audio quality estimation models used for data filtering at scale. #J-18808-Ljbffr Sanas

Vacancy posted 14 hours ago
Similar jobs that could be interesting for youBased on the Member of Technical Staff (Data Acquisition) in Palo Alto, CA vacancy
  •  ...hardware revolution. What You'll Do As a Founding Member of the Technical Staff (ML infra) at Architect, you'll be responsible for the critical...  ...record of building end-to-end ML pipelines, including data curation, preparation, and large-scale LLM finetuning (... 
    Data

    Architect Labs

    Palo Alto, CA
    1 day ago
  • A technology company specializing in AI is seeking a Member of Technical Staff to enhance reliability across a multi-data center environment. The role involves automating processes and implementing robust observability solutions to ensure operational excellence. Ideal... 
    Data

    Pantera Capital

    Palo Alto, CA
    1 day ago
  •  ...largest real-world radiology datasets. About the Role As a Member of Technical Staff focused on model training, you will be responsible for...  ...environment they run in. This role owns how models are trained: how data flows into training, how training jobs are structured, and... 
    Data

    Cognita Imaging Inc.

    Palo Alto, CA
    1 day ago
  •  ...largest real-world radiology datasets. About the Role As a Member of Technical Staff on the ML Infrastructure team, you will build and operate...  ...Own systems that manage large-scale compute, storage, and data movement for ML workloads. Design and optimize model serving... 
    Data

    Cognita Imaging Inc.

    Palo Alto, CA
    2 days ago
  •  ...reliable despite non-deterministic model behavior. Role As a Member of Technical Staff, Machine Learning, you will build core ML components. You...  ...real‑world ML. Focus Build and improve ML components across data, training, evaluation, and inference. Fine‑tune and adapt... 
    Data
    Immediate start

    A1

    Palo Alto, CA
    4 days ago
  • About the Role As a Member of Technical Staff [Research] at NeoCognition , you’ll be part of the core team advancing the frontier of LLM agents...  ...use) LLM post-training (instruction tuning, RL, reasoning) Data pipeline design and model evaluation Proficiency in Python... 
    Data

    NeoCognition Inc.

    Palo Alto, CA
    1 day ago
  • About the Role As a Member of Technical Staff [Platform] at NeoCognition , you’ll design and build the internal systems that power everything we...  ...research scientists and software engineers to ensure that our data, model, and product workflows are robust, reproducible, and... 
    Data

    NeoCognition Inc.

    Palo Alto, CA
    1 day ago
  • RadixArk is seeking a Member of Technical Staff — Training to build and scale the systems that train frontier AI models. You will work on large...  ...Strong experience with large-scale distributed training (data, tensor, and pipeline parallelism) Deep understanding of GPU... 
    Data
    Flexible hours

    RadixArk

    Palo Alto, CA
    3 days ago
  • $180k

     ...interview (“phone interview”) during which a member of our team will ask some basic questions...  ...the main process, which consists of 2 technical interviews and 1 project deep-dive...  ...and trade secret information, and/or user data; Interacting with internal and/or external... 
    Data
    Work at office
    Local area
    Work from home

    Pantera Capital

    Palo Alto, CA
    4 days ago
  • Member of Technical Staff Physical AI (Robotics / World Models) Palo Alto, CA About Orbifold AI Orbifold AI advances the frontier of physical AI...  ...through rigorous evaluation and curated, real-world data. We work directly with leading robotics and world model research... 
    Data
    Shift work

    Bonfirevc

    Palo Alto, CA
    4 days ago
  • $180k

     ...with their teammates. About the Team The Data Platform team at X builds and operates...  ...interview (“phone interview”) during which a member of our team will ask some basic questions...  ...the main process, which consists of 2 technical interviews and 1 project deep‑dive interview... 
    Data
    Temporary work
    Work at office
    Work from home

    Pantera Capital

    Palo Alto, CA
    1 day ago
  • $180k

     ...interview (“phone interview”) during which a member of our team will ask some basic questions...  ...the main process, which consists of four technical interviews: Coding assessment in a...  ...and trade secret information, and/or user data; Interacting with internal and/or external... 
    Data
    Local area
    Relocation

    Pantera Capital

    Palo Alto, CA
    14 hours ago
  • $180k

     ...interactions at global scale. Build and optimize data pipelines using real-time user signals to...  ...(“phone interview”) during which a member of our team will ask some basic questions...  ...the main process, which consists of 2 technical interviews and 1 project deep-dive interview... 
    Data
    Temporary work

    Pantera Capital

    Palo Alto, CA
    1 day ago
  • $180k

     ...enhance the user experience on X Write data pipelines and training jobs that continuously...  ...(“phone interview”) during which a member of our team will ask some basic questions...  ...the main process, which consists of four technical interviews: Coding assessment in a language... 
    Data
    Local area
    Relocation

    Pantera Capital

    Palo Alto, CA
    4 days ago
  • $180k

    Member Of Technical Staff - Cloud Infrastructure ABOUT xAI xAI’s mission is to create AI systems that can accurately understand the universe and...  ...such as Pulumi, Terraform, or Ansible, with a focus on secure data handling. Drive system reliability through incident... 
    Data
    Temporary work

    x.ai

    Palo Alto, CA
    2 days ago
  • $180k

     ...expert engineers in multimodal mid‑training data. Python JAX and XLA Spark Ray...  ...interview ("phone interview") during which a member of our team will ask some basic questions...  ...the main process, which consists of four technical interviews: Coding assessment in a language... 
    Data
    Temporary work
    Relocation

    xAI

    Palo Alto, CA
    3 days ago
  •  ...two minutes and one that lasts two hours. As our RecSys founding member, you'll own this problem end-to-end - set the architecture,...  ...players, from candidate retrieval through final ranking Own the full data pipeline - ingestion, feature engineering, training data... 
    Data

    Astrocade

    Palo Alto, CA
    3 days ago
  • $180k

     ...with their teammates. About the Team The Data Platform team at X builds and operates the...  ...(“phone interview”) during which a member of our team will ask some basic questions...  ...enter the main process, which consists of 2 technical interviews and 1 project deep-dive interview... 
    Data
    Temporary work
    H1b
    Work at office
    Work from home
    Work visa

    xAI

    Palo Alto, CA
    1 day ago
  • $180k

    Member of Technical Staff - Multimodal Understanding About xAI xAI’s mission is to create AI systems that can accurately understand the...  ..., video, audio, and text—spanning the full stack: data curation/acquisition, tokenizer training, large‑scale pre‑training, post‑training... 
    Data
    Temporary work

    x.ai

    Palo Alto, CA
    3 days ago
  • $180k

    Member of Technical Staff - Web Engineering About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid...  ...excellence via testing, monitoring, deployment, and secure data handling. Drive technical/product decisions with teams and... 
    Data
    Temporary work
    Worldwide

    xAI

    Palo Alto, CA
    4 days ago
  • $90k - $130k

    Member of Technical Staff - Program Analysis This role is based in Palo Alto, California, and follows a hybrid work model. If you’re excited about...  ...infrastructure, which includes call graph construction, data‑flow and taint analysis, and language‑specific analyzers.... 
    Data

    Endor Labs

    Palo Alto, CA
    3 days ago
  •  ...share knowledge with their teammates. About the Role We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and... 
    Data

    Pantera Capital

    Palo Alto, CA
    4 days ago
  • $180k

     ...day with lightning speed and perfect reliability. As a Member of Technical Staff - Inference, you will design and optimize large-scale model...  ...and perks. xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.... 
    Data
    Temporary work

    xAI

    Palo Alto, CA
    a month ago
  • Member of Technical Staff - Foundation Model Architecture & AI Infrastructure Vinci | Full-Time | Remote / Hybrid The Mission At Vinci, we are...  ...production workloads. Trained on 45TB+ of structured physics data Running billion-voxel inference in production Deployed... 
    Data
    Full time
    Remote work

    Vinci4d

    Palo Alto, CA
    14 hours ago
  • $180k

    Member of Technical Staff - RL Infrastructure About xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid...  ...is seeking experienced software engineers to create robust data pipelines, comprehensive evaluations for benchmarking LLMs,... 
    Data
    Temporary work

    xAI

    Palo Alto, CA
    4 days ago
  •  ...largest real-world radiology datasets. About the Role As a Member of Technical Staff in Software Engineering, you will build the full stack...  ...off. Work with large volumes of structured and unstructured data, including imaging metadata and model outputs. Ensure platform... 
    Data

    Cognita Imaging Inc.

    Palo Alto, CA
    2 days ago
  • $139.9k - $274.8k

     ...businesses, developers - so that everyone can realize its benefits. Microsoft AI (MS AI) is seeking a experienced Member of Technical Staff - Data Engineer - Microsoft AI - Copilot to help build mission critical data pipelines that ingest, process and publishes data... 
    Data
    Ongoing contract
    Work at office
    Local area

    Microsoft Corporation

    Mountain View, CA
    1 day ago
  • $120k - $160k

     ...Member Of Technical Staff - Backend Software Engineer This role is based in Palo Alto, California, and follows a hybrid work model. If you're...  ...and always start with the customer's success. We debate with data, make the complex simple, and challenge each other with... 
    Data
    Shift work

    Endor Labs

    Palo Alto, CA
    2 days ago
  •  ...’s models are trained and validated on one of the world’s largest real‑world radiology datasets. About the Role As a Member of Technical Staff in Data Engineering, you will own the systems and workflows that transform Cognita’s raw radiology data into high‑quality datasets... 
    Data

    Cognita Imaging Inc.

    Palo Alto, CA
    2 days ago
  • $180k

     ...They should be able to concisely and accurately share knowledge with their teammates. RESPONSIBILITIES: Scale synthetic coding data to trillions of tokens with large-scale docker verification. Distill the intelligence of flagship models into flash models... 
    Data
    Temporary work

    xAI

    Palo Alto, CA
    a month ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Member of Technical Staff (Data Acquisition). Be the first to apply!