LLM Dataset Engineer
Sciforium
LLM Dataset Engineer
Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.
Role Overview
Sciforium is seeking a highly technical and visionary LLM Dataset Engineer to lead the strategy, creation, and curation of the massive datasets that power our foundation models. We believe that in the era of LLMs, data is the primary competitive advantage. In this role, you will own the end-to-end data lifecycle—from raw web-scale crawling to the fine-grained human-alignment datasets that define model behavior.
This position is ideal for a scientist who views data as a high-scale engineering challenge and an analytical puzzle. You will not just "provide" data; you will design the taxonomies, filtering heuristics, and post-training pipelines that ensure our models are world-class in reasoning, safety, and multimodal understanding.
Key Responsibilities
Foundation Dataset Strategy: Own the end-to-end creation of pre-training datasets for LLMs. This includes defining the mix of web data, code, books, and technical papers to optimize for downstream model performance.
Petabyte-Scale Curation: Design and implement sophisticated pipelines for data cleaning, exact/fuzzy deduplication, and high-quality signal extraction from petabytes of raw, unstructured data.
Post-Training & Alignment Data: Lead the development of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO).
Multimodal Expansion: Drive the acquisition and processing of vision and video data, navigating the complexities of multimodal alignment, video compression, and temporal data consistency.
High-Performance Engineering: Develop high-throughput data processing scripts using Python, leveraging multiprocessing and multithreading to handle massive-scale ingestion and transformation without bottlenecks.
Data Profiling & Analysis: Conduct deep-dive statistical analysis on training corpora to identify biases, gaps in knowledge, and quality regressions, ensuring the "diet" of the model is mathematically balanced.
Synthetic Data Generation: (Added Value) Design pipelines to generate high-reasoning synthetic data to augment gaps in natural datasets, utilizing existing models for data labeling and refinement.
Must-Haves
5+ years of industry experience in Data Science or Machine Learning, with a proven track record of building and managing datasets for foundation models.
Deep Proficiency in Python: Expert-level skills with a focus on high-performance code, including multiprocessing, multithreading, and efficient memory management for large-scale data tasks.
Petabyte-Scale Experience: Demonstrated experience working with petabyte-scale datasets that have been directly used to train production-grade LLMs or Large Vision Models.
Dataset Reconstruction: Experience building massive LLM training sets from scratch, including raw web crawls (e.g., Common Crawl) and specialized domain data.
Post-Training Expertise: Hands-on experience building datasets for RLHF, DPO, and multi-turn instruction following, including the management of human-labeling workflows and quality gold-sets.
Data Tooling: Mastery of data-at-scale frameworks such as Spark, Ray, or high-performance data-loading formats (e.g., WebDataset, Parquet).
Nice-to-Haves
Computer Vision (CV) Curation: Experience building large-scale image or video datasets from scratch (e.g., LAION-style pipelines).
Multimodal Crawling: Familiarity with large-scale crawling of multimodal data and the associated challenges of video processing, codecs, and compression.
Taxonomy Design: Experience in designing complex labeling schemas for reasoning, coding, and mathematical benchmarks.
Research Background: A Master's or PhD in a quantitative field with a focus on data-centric AI or information retrieval.
Benefits include
Medical, dental, and vision insurance
401k plan
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity
Equal opportunity
Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.
$160k - $230k
...LLM Inference Frameworks and Optimization Engineer San Francisco, Singapore, Amsterdam About the Role At Together.ai, we are building state-of-the... ...contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been...SuggestedFull time- ...dental, vision; MacBook Pro + peripherals We are hiring an AI Engineer to own the intelligence layer. This is not a demo or prototype... ...orchestration in production You should already have built LLM-powered systems that operate beyond the playground stage. What...SuggestedFull time
- ...accountability required by the Fortune 500. By bridging the gap between LLM capabilities and domain-specific requirements, we unlock the... ...I improve its fundamentals?" CTGT's Senior Machine Learning Engineer will operate deep within the model stack, working directly with...Suggested
$175k - $225k
...We are seeking a Staff Software Engineer to join a well-funded, early-stage technology startup... ...of processing real-time and large-scale datasets Partner across engineering and product... ...and deploy infrastructure supporting LLM-based and reasoning-driven systems Contribute...Suggested- ...across large data campaigns. We're looking for engineers who combine strong engineering fundamentals with... ...you will: Own upstream data quality for LLM post-training and evaluation by analyzing expert-developed datasets and operationalizing quality standards for reasoning...SuggestedRelocation package
$146.5k
...preferences. About the team: The ML Data Engineering team powers metadata extraction,... ...at massive scale, supporting diverse datasets like user-generated content (UGC), ebooks... ...product teams to deploy scalable ML and LLM-powered solutions in production. Role...Local areaWorldwideHome officeFlexible hours- ...you honest about both. Researchers and ML engineers will hand you workloads that barely run;... ...) that ingest, transform, and curate the datasets behind training and evaluation. The bottleneck... ...fix. ~ Bonus: hands-on experience with LLM inference engines (vLLM, SGLang, TensorRT...Flexible hours
- ...backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date. About the role As a Distributed LLM Inference Engineer, you will help systems and optimizations that push the boundaries of performance for inference at large scale. This is an...Full timeWork at office
- ...etc.) to analyze, categorize, and score datasets across various use cases. Design and... ...automated workflows to pipe data into LLM models and retrieve structured outputs.... ...Collaborate with cross‑functional teams (Product, Engineering, Ops) to integrate AI‑driven insights...Remote work
- ...foundational to long-term success. ML Engineer (AI-Native Systems & Forecasting) Location... ..., labor allocation intelligence, and LLM-powered workflows. This is a production... ...Inherit and remediate messy, inconsistent datasets and establish scalable data pipelines...Hourly payContract work
- ...Design and implement guardrail systems for securing LLM inputs and outputs in production environments Secure Retrieval... ...Collaborate closely with Product Security, Machine Learning Engineering, and Platform teams. Required Qualifications LLM...
$125.8k - $239.73k
...are looking for a Senior Data Automation Engineer to join the Experience League Strategic Technology... ...you’ll do Explore and profile customer datasets to identify, evaluate, and document data... ...(e.g., LangChain, AutoGen, CrewAI) or LLM-powered automation tools to drive...Full timeTemporary workLocal areaWorldwide$140k - $190k
...Senior Quality & Automation Engineer We live in a world where technology is rapidly changing the educational experiences of students... ...detail and strong product intuition Experience testing AI/LLM-backed features is strongly preferred. You'll need to think...Full timeLive inWork at officeLocal area$100k - $137k
...About this role Faire is looking for an IT Operations Automation Engineer to join our IT Operations team in San Francisco. We're building... ..., or system changes. * Deploy modern AI tooling (agents, LLM-powered triage, auto-resolution, knowledge retrieval) to raise...Work at officeLocal areaRemote workFlexible hours3 days per week$79.61k - $168.59k
...Senior Associate, Infrastructure Project Advisory (Construction/Engineering) in Infrastructure and Projects Advisory for our Deal Advisory... ...of observations and recommendations as well as review datasets, identify insights, and develop presentation materials using analytics...Contract workH1bLocal area- ...related open-source tools. Clean, synthesize, and analyze complex datasets with rigor and efficiency. Translate ambiguous business... ...and drive decision-making. Collaborate with product managers, engineers, design teams, and other data scientists to scale experimentation...
$160k - $265k
...unreliable APIs, evolving schemas, massive datasets, and edge cases that don't show up in... ...— not just how they're supposed to work. Engineers on this team quickly become experts in distributed... ...Experience building agentic systems or LLM-enabled products Frequent user of AI...Immediate start$121.3k - $183.2k
...Software QA Engineer - Automation, Siri Apple is where individual imaginations gather together, committing to the values that lead to... ..., testing methodologies, and testing tools ~ Familiarity with LLM usage to improve efficiency of their daily work ~ Executing and...WorldwideRelocation$120k - $170k
...meals in production - generating the world's largest proprietary dataset for deformable food manipulation. Backed by investors including... .... About the Role Chef is seeking Systems Support Engineers to serve as a vital link between our customers and our engineering...Work at officeFlexible hoursNight shift$160k - $240k
...meals in production - generating the world's largest proprietary dataset for deformable food manipulation. Backed by investors including... ...commercial kitchen. About the Role As a Staff Autonomy Engineer, you will own the technical architecture of Chef's autonomy...Flexible hours$160k - $210k
...celebrate all forms of diversity. Role As a Senior Systems Engineer at Liminal, you will join a passionate and agile team... ...and coupling strategies. Analyze large and complex ultrasonic datasets and develop advanced signal processing and imaging methodologies...Flexible hours$144.5k - $180.6k
...constellation delivers an unprecedented dataset of empirical information via a revolutionary... ..., data processing, and software engineering, our office is a truly inspiring mix of experts... ...Own: Develop and optimize multimodal LLM applications Work with and support the...Full timeTemporary workWork at officeLocal areaRemote workHome office3 days per week$123k - $168k
...automation, and usability. We are looking for a Senior Software Engineer who combines strong hands-on engineering with clear... ..., not just implement tickets. Comfort working with large datasets and performance-sensitive systems. Clear, confident communication...Flexible hours$160k - $320k
...by those who show initiative and deliver excellence. We seek engineers/researchers with strong intrinsic drive, a true passion for advancing... ...into Vast, systems and architectures (virtual, 30 minutes) LLM-assisted coding assessment (virtual, 1 hour) Meet and greet...Full timeWork at office- ...bigger. Our CTO led the vector database team at Redis, shipped 100+ LLM applications, and is a contributor to LangChain and LlamaIndex.... ...~ An insatiable desire to ship. ~7+ years of software engineering experience comprising of: ~5+ years of backend development experience...Work at officeShift work
$160k - $230k
...Systems Research Engineer, GPU Programming San Francisco About the Role As a Systems Research Engineer specialized in GPU Programming... ...have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind...Full timeRemote work$169.7k - $217.5k
...looking for a highly collaborative Senior Engineer to join our team to help enhance and... ...intellectual property, regulated data, experimental datasets) Perform threat modeling and risk... ...AI risk assessment, red teaming of AI/LLM systems, and defenses against prompt injection...Contract workLocal area$144.5k - $180.6k
...satellites in history. This constellation delivers an unprecedented dataset of empirical information via a revolutionary cloud-based... ...hardware design, manufacturing, data processing, and software engineering, our office is a truly inspiring mix of experts from a variety...Full timeTemporary workFor contractorsWork at officeLocal areaRemote workHome office3 days per week$148k - $260k
...positive way. To learn more visit: As a Senior/Staff Software Engineer embedded within our Autonomy & Algorithms team, you will build... .... - Direct experience managing ML pipelines, including dataset management, dataloading, and optimization. - Strong understanding...Full timeWork at officeRemote workWork from homeFlexible hours$200k - $300k
...Enterprise tech 30: ABOUT THE ROLE As a Senior Software Engineer, Support Automations at Retell, you will own the technical... ...use cases. Have built or configured AI agents, workflows, or LLM-powered systems. Are highly detail-oriented and calm under pressure...H1bRelocation
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to LLM Dataset Engineer. Be the first to apply!



