Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

LLM Dataset Engineer

Sciforium

LLM Dataset Engineer

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.

Role Overview

Sciforium is seeking a highly technical and visionary LLM Dataset Engineer to lead the strategy, creation, and curation of the massive datasets that power our foundation models. We believe that in the era of LLMs, data is the primary competitive advantage. In this role, you will own the end-to-end data lifecycle—from raw web-scale crawling to the fine-grained human-alignment datasets that define model behavior.

This position is ideal for a scientist who views data as a high-scale engineering challenge and an analytical puzzle. You will not just "provide" data; you will design the taxonomies, filtering heuristics, and post-training pipelines that ensure our models are world-class in reasoning, safety, and multimodal understanding.

Key Responsibilities
  • Foundation Dataset Strategy: Own the end-to-end creation of pre-training datasets for LLMs. This includes defining the mix of web data, code, books, and technical papers to optimize for downstream model performance.

  • Petabyte-Scale Curation: Design and implement sophisticated pipelines for data cleaning, exact/fuzzy deduplication, and high-quality signal extraction from petabytes of raw, unstructured data.

  • Post-Training & Alignment Data: Lead the development of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO).

  • Multimodal Expansion: Drive the acquisition and processing of vision and video data, navigating the complexities of multimodal alignment, video compression, and temporal data consistency.

  • High-Performance Engineering: Develop high-throughput data processing scripts using Python, leveraging multiprocessing and multithreading to handle massive-scale ingestion and transformation without bottlenecks.

  • Data Profiling & Analysis: Conduct deep-dive statistical analysis on training corpora to identify biases, gaps in knowledge, and quality regressions, ensuring the "diet" of the model is mathematically balanced.

  • Synthetic Data Generation: (Added Value) Design pipelines to generate high-reasoning synthetic data to augment gaps in natural datasets, utilizing existing models for data labeling and refinement.

Must-Haves
  • 5+ years of industry experience in Data Science or Machine Learning, with a proven track record of building and managing datasets for foundation models.

  • Deep Proficiency in Python: Expert-level skills with a focus on high-performance code, including multiprocessing, multithreading, and efficient memory management for large-scale data tasks.

  • Petabyte-Scale Experience: Demonstrated experience working with petabyte-scale datasets that have been directly used to train production-grade LLMs or Large Vision Models.

  • Dataset Reconstruction: Experience building massive LLM training sets from scratch, including raw web crawls (e.g., Common Crawl) and specialized domain data.

  • Post-Training Expertise: Hands-on experience building datasets for RLHF, DPO, and multi-turn instruction following, including the management of human-labeling workflows and quality gold-sets.

  • Data Tooling: Mastery of data-at-scale frameworks such as Spark, Ray, or high-performance data-loading formats (e.g., WebDataset, Parquet).

Nice-to-Haves
  • Computer Vision (CV) Curation: Experience building large-scale image or video datasets from scratch (e.g., LAION-style pipelines).

  • Multimodal Crawling: Familiarity with large-scale crawling of multimodal data and the associated challenges of video processing, codecs, and compression.

  • Taxonomy Design: Experience in designing complex labeling schemas for reasoning, coding, and mathematical benchmarks.

  • Research Background: A Master's or PhD in a quantitative field with a focus on data-centric AI or information retrieval.

Benefits include
  • Medical, dental, and vision insurance

  • 401k plan

  • Daily lunch, snacks, and beverages

  • Flexible time off

  • Competitive salary and equity

Equal opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the LLM Dataset Engineer in San Francisco, CA vacancy
  • $160k - $230k

     ...LLM Inference Frameworks and Optimization Engineer San Francisco, Singapore, Amsterdam About the Role At Together.ai, we are building state-of-the...  ...contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been... 
    Suggested
    Full time

    Together AI

    San Francisco, CA
    7 days ago
  •  ...dental, vision; MacBook Pro + peripherals We are hiring an AI Engineer to own the intelligence layer. This is not a demo or prototype...  ...orchestration in production You should already have built LLM-powered systems that operate beyond the playground stage. What... 
    Suggested
    Full time

    PulseRise Technologies

    San Francisco, CA
    22 hours ago
  •  ...accountability required by the Fortune 500. By bridging the gap between LLM capabilities and domain-specific requirements, we unlock the...  ...I improve its fundamentals?" CTGT's Senior Machine Learning Engineer will operate deep within the model stack, working directly with... 
    Suggested

    CTGT

    San Francisco, CA
    3 days ago
  • $175k - $225k

     ...We are seeking a Staff Software Engineer to join a well-funded, early-stage technology startup...  ...of processing real-time and large-scale datasets Partner across engineering and product...  ...and deploy infrastructure supporting LLM-based and reasoning-driven systems Contribute... 
    Suggested

    Murphy Talent Group

    San Francisco, CA
    22 hours ago
  •  ...across large data campaigns. We're looking for engineers who combine strong engineering fundamentals with...  ...you will: Own upstream data quality for LLM post-training and evaluation by analyzing expert-developed datasets and operationalizing quality standards for reasoning... 
    Suggested
    Relocation package

    Reflection AI

    San Francisco, CA
    2 days ago
  • $146.5k

     ...preferences. About the team: The ML Data Engineering team powers metadata extraction,...  ...at massive scale, supporting diverse datasets like user-generated content (UGC), ebooks...  ...product teams to deploy scalable ML and LLM-powered solutions in production. Role... 
    Local area
    Worldwide
    Home office
    Flexible hours

    Scribd

    San Francisco, CA
    22 hours ago
  •  ...you honest about both. Researchers and ML engineers will hand you workloads that barely run;...  ...) that ingest, transform, and curate the datasets behind training and evaluation. The bottleneck...  ...fix. ~ Bonus: hands-on experience with LLM inference engines (vLLM, SGLang, TensorRT... 
    Flexible hours

    Adaption

    San Francisco, CA
    5 days ago
  •  ...backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date. About the role As a Distributed LLM Inference Engineer, you will help systems and optimizations that push the boundaries of performance for inference at large scale. This is an... 
    Full time
    Work at office

    Anyscale

    San Francisco, CA
    15 hours ago
  •  ...etc.) to analyze, categorize, and score datasets across various use cases. Design and...  ...automated workflows to pipe data into LLM models and retrieve structured outputs....  ...Collaborate with cross‑functional teams (Product, Engineering, Ops) to integrate AI‑driven insights... 
    Remote work

    Simera

    San Francisco, CA
    2 days ago
  •  ...foundational to long-term success. ML Engineer (AI-Native Systems & Forecasting) Location...  ..., labor allocation intelligence, and LLM-powered workflows. This is a production...  ...Inherit and remediate messy, inconsistent datasets and establish scalable data pipelines... 
    Hourly pay
    Contract work

    Ando Technologies, Inc

    San Francisco, CA
    3 days ago
  •  ...Design and implement guardrail systems for securing LLM inputs and outputs in production environments Secure Retrieval...  ...Collaborate closely with Product Security, Machine Learning Engineering, and Platform teams. Required Qualifications LLM... 

    3B Staffing LLC

    San Francisco, CA
    2 days ago
  • $125.8k - $239.73k

     ...are looking for a Senior Data Automation Engineer to join the Experience League Strategic Technology...  ...you’ll do Explore and profile customer datasets to identify, evaluate, and document data...  ...(e.g., LangChain, AutoGen, CrewAI) or LLM-powered automation tools to drive... 
    Full time
    Temporary work
    Local area
    Worldwide

    Adobe

    San Francisco, CA
    1 day ago
  • $140k - $190k

     ...Senior Quality & Automation Engineer We live in a world where technology is rapidly changing the educational experiences of students...  ...detail and strong product intuition Experience testing AI/LLM-backed features is strongly preferred. You'll need to think... 
    Full time
    Live in
    Work at office
    Local area

    KIRA

    San Francisco, CA
    2 days ago
  • $100k - $137k

     ...About this role Faire is looking for an IT Operations Automation Engineer to join our IT Operations team in San Francisco. We're building...  ..., or system changes. * Deploy modern AI tooling (agents, LLM-powered triage, auto-resolution, knowledge retrieval) to raise... 
    Work at office
    Local area
    Remote work
    Flexible hours
    3 days per week

    Faire Inc

    San Francisco, CA
    2 days ago
  • $79.61k - $168.59k

     ...Senior Associate, Infrastructure Project Advisory (Construction/Engineering) in Infrastructure and Projects Advisory for our Deal Advisory...  ...of observations and recommendations as well as review datasets, identify insights, and develop presentation materials using analytics... 
    Contract work
    H1b
    Local area

    KPMG

    San Francisco, CA
    2 days ago
  •  ...related open-source tools. Clean, synthesize, and analyze complex datasets with rigor and efficiency. Translate ambiguous business...  ...and drive decision-making. Collaborate with product managers, engineers, design teams, and other data scientists to scale experimentation... 

    Apple

    San Francisco, CA
    11 hours ago
  • $160k - $265k

     ...unreliable APIs, evolving schemas, massive datasets, and edge cases that don't show up in...  ...— not just how they're supposed to work. Engineers on this team quickly become experts in distributed...  ...Experience building agentic systems or LLM-enabled products Frequent user of AI... 
    Immediate start

    Hebbia

    San Francisco, CA
    10 hours ago
  • $121.3k - $183.2k

     ...Software QA Engineer - Automation, Siri Apple is where individual imaginations gather together, committing to the values that lead to...  ..., testing methodologies, and testing tools ~ Familiarity with LLM usage to improve efficiency of their daily work ~ Executing and... 
    Worldwide
    Relocation

    Apple

    San Francisco, CA
    2 days ago
  • $120k - $170k

     ...meals in production - generating the world's largest proprietary dataset for deformable food manipulation. Backed by investors including...  .... About the Role Chef is seeking Systems Support Engineers to serve as a vital link between our customers and our engineering... 
    Work at office
    Flexible hours
    Night shift

    Chef Robotics

    San Francisco, CA
    4 days ago
  • $160k - $240k

     ...meals in production - generating the world's largest proprietary dataset for deformable food manipulation. Backed by investors including...  ...commercial kitchen. About the Role As a Staff Autonomy Engineer, you will own the technical architecture of Chef's autonomy... 
    Flexible hours

    Chef Robotics

    San Francisco, CA
    22 hours ago
  • $160k - $210k

     ...celebrate all forms of diversity. Role As a Senior Systems Engineer at Liminal, you will join a passionate and agile team...  ...and coupling strategies. Analyze large and complex ultrasonic datasets and develop advanced signal processing and imaging methodologies... 
    Flexible hours

    Liminal

    San Francisco, CA
    22 hours ago
  • $144.5k - $180.6k

     ...constellation delivers an unprecedented dataset of empirical information via a revolutionary...  ..., data processing, and software engineering, our office is a truly inspiring mix of experts...  ...Own: Develop and optimize multimodal LLM applications Work with and support the... 
    Full time
    Temporary work
    Work at office
    Local area
    Remote work
    Home office
    3 days per week

    Planet

    San Francisco, CA
    22 hours ago
  • $123k - $168k

     ...automation, and usability. We are looking for a Senior Software Engineer who combines strong hands-on engineering with clear...  ..., not just implement tickets. Comfort working with large datasets and performance-sensitive systems. Clear, confident communication... 
    Flexible hours

    MSCI

    San Francisco, CA
    22 hours ago
  • $160k - $320k

     ...by those who show initiative and deliver excellence.  We seek engineers/researchers with strong intrinsic drive, a true passion for advancing...  ...into Vast, systems and architectures (virtual, 30 minutes) LLM-assisted coding assessment (virtual, 1 hour) Meet and greet... 
    Full time
    Work at office

    Vast

    San Francisco, CA
    22 hours ago
  •  ...bigger. Our CTO led the vector database team at Redis, shipped 100+ LLM applications, and is a contributor to LangChain and LlamaIndex....  ...~ An insatiable desire to ship. ~7+ years of software engineering experience comprising of: ~5+ years of backend development experience... 
    Work at office
    Shift work

    Arcade

    San Francisco, CA
    2 days ago
  • $160k - $230k

     ...Systems Research Engineer, GPU Programming San Francisco About the Role As a Systems Research Engineer specialized in GPU Programming...  ...have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind... 
    Full time
    Remote work

    Together AI

    San Francisco, CA
    4 days ago
  • $169.7k - $217.5k

     ...looking for a highly collaborative Senior Engineer to join our team to help enhance and...  ...intellectual property, regulated data, experimental datasets) Perform threat modeling and risk...  ...AI risk assessment, red teaming of AI/LLM systems, and defenses against prompt injection... 
    Contract work
    Local area

    Altos Labs

    San Francisco, CA
    22 hours ago
  • $144.5k - $180.6k

     ...satellites in history. This constellation delivers an unprecedented dataset of empirical information via a revolutionary cloud-based...  ...hardware design, manufacturing, data processing, and software engineering, our office is a truly inspiring mix of experts from a variety... 
    Full time
    Temporary work
    For contractors
    Work at office
    Local area
    Remote work
    Home office
    3 days per week

    Planet Labs PBC

    San Francisco, CA
    3 days ago
  • $148k - $260k

     ...positive way. To learn more visit: As a Senior/Staff Software Engineer embedded within our Autonomy & Algorithms team, you will build...  .... - Direct experience managing ML pipelines, including dataset management, dataloading, and optimization. - Strong understanding... 
    Full time
    Work at office
    Remote work
    Work from home
    Flexible hours

    Waabi

    San Francisco, CA
    2 days ago
  • $200k - $300k

     ...Enterprise tech 30: ABOUT THE ROLE As a Senior Software Engineer, Support Automations at Retell, you will own the technical...  ...use cases. Have built or configured AI agents, workflows, or LLM-powered systems. Are highly detail-oriented and calm under pressure... 
    H1b
    Relocation

    Retell AI

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to LLM Dataset Engineer. Be the first to apply!