Scientific Lead - Scientific Data Engineer

$166.5k - $266.2k

BioSpace, Inc.

At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world. The Opportunity We are building something unprecedented — an AI foundation that will push the frontier on what is possible today across drug discovery research, from target identification and disease biology through translational science. AI4D Team The Applied Intelligence for Discovery (AI4D) team is a newly formed group within Lilly Research Laboratories that operates at the intersection of scientific delivery and core platform development. AI4D’s mission is connecting scientists to petabyte‑scale data through natural language interfaces, automated analysis workflows, and intelligent search — and to convert early deployments into repeatable system standards and evaluation practices that scale across therapeutic areas. As a Scientific Data Engineer, you will close that gap. You will build the semantic layer, data harmonization infrastructure, AI‑ready data products, and lakehouse architecture that bridge how data is stored and how AI systems need to consume it. You will be working at the intersection of the data infrastructure team and the generative AI engineers who build the systems scientists interact with. Responsibilities Data Harmonization and Lakehouse Architecture Design and build the data architecture that transforms raw and processed omics data into harmonized, AI‑consumable layers Build and optimize ETL/ELT pipelines that produce denormalized views, pre‑computed aggregations, embedding‑ready text representations, and feature stores optimized for AI system consumption Implement data quality monitoring, automated profiling, and validation checks across harmonization layers Create versioned, reproducible data snapshots that support model training, evaluation, and audit requirements in a regulated environment Partner with the teams to extend harmonization patterns as data modalities expand beyond genomics and proteomics into spatial transcriptomics, perturbational data (Perturb‑Seq), single‑cell, and digital pathology Semantic Layer and Schema Engineering Design and maintain a semantic layer over Lilly’s multi‑omics databases that enables AI systems Create comprehensive schema documentation: table descriptions, column‑level annotations, relationship mappings, business logic rules, and domain‑specific constraints (e.g., statistical thresholds, unit conventions, experimental design metadata) Develop gold‑standard question/SQL pairs for each major database, in collaboration with computational biologists and Generative AI Engineers, to serve as training data, few‑shot examples, and evaluation benchmarks Build and maintain a data dictionary and ontology mapping layer that translates how scientists think and speak about data (gene names, pathway terms, assay types) into how the data is physically stored AI‑Ready Data Products Build and manage vector embedding pipelines for scientific documents, study metadata, and structured data descriptions to power RAG‑based retrieval Build integration pipelines that connect heterogeneous data sources — omics databases, internal publications, electronic lab notebooks, assay results, and clinical annotations — into a unified, queryable layer Develop and enforce metadata standards that ensure new data sources are AI‑accessible from the point of ingestion, not retroactively Design data products that serve multiple consumption patterns: direct SQL access for computational biologists, structured feeds for ML training pipelines, and semantic interfaces for LLM‑powered tools Qualifications Bachelors degree in Computer Science, Data Engineering, Bioinformatics, or a related field + 8 years data engineering experience OR Masters degree and 5 years data engineering experience Additional Skills/Preferences Phd in data or related field Demonstrated expertise in building data pipelines, ETL/ELT workflows, and data products that serve downstream AI/ML systems Strong SQL skills and experience with complex relational database schemas (hundreds of tables, multi‑level joins, domain‑specific conventions) Experience with modern data platform technologies, including at least one of: Databricks, Snowflake, or equivalent lakehouse platforms Experience with modern data engineering tools: dbt, Spark, Airflow, or similar orchestration and transformation frameworks Proficiency in Python for data processing, scripting, and pipeline development Experience with cloud data platforms (AWS preferred: Redshift, Athena, Glue, S3, or similar) Familiarity with at least one of: vector databases, embedding pipelines, or semantic layer tooling Strong communication skills — you can work effectively with both engineers who think in schemas and scientists who think in biology Experience with biomedical or scientific data: omics datasets (RNA‑seq, proteomics, GWAS), clinical data, or laboratory information management systems Experience in pharmaceutical, biotech, or life sciences environments Familiarity with biomedical ontologies and controlled vocabularies (Gene Ontology, MeSH, ChEBI, HGNC) and their application to data integration Experience building data products that serve AI/ML systems — feature stores, training datasets, evaluation benchmarks, or semantic annotations for text‑to‑SQL Knowledge of data governance practices in regulated industries: data lineage, access controls, versioning, and auditability Experience with knowledge graph technologies (Neo4j, Amazon Neptune, RDF/SPARQL) or graph‑based data modeling Deep experience with Databricks ecosystem: Unity Catalog for data governance, Delta Lake for ACID transactions, MLflow integration, and Databricks SQL for analytics workloads Experience designing data architectures that bridge traditional bioinformatics workflows (Nextflow, R/Bioconductor) with modern lakehouse consumption patterns Lilly is dedicated to helping individuals with disabilities to actively engage in the workforce, ensuring equal opportunities when vying for positions. If you require accommodation to submit a resume for a position at Lilly, please complete the accommodation request form for further assistance. Please note this is for individuals to request an accommodation as part of the application process and any other correspondence will not receive a response. Lilly is proud to be an EEO Employer and does not discriminate on the basis of age, race, color, religion, gender identity, sex, gender expression, sexual orientation, genetic information, ancestry, national origin, protected veteran status, disability, or any other legally protected status. Our employee resource groups (ERGs) offer strong support networks for their members and are open to all employees. Our current groups include: Africa, Middle East, Central Asia Network, Black Employees at Lilly, Chinese Culture Network, Japanese International Leadership Network (JILN), Lilly India Network, Organization of Latinx at Lilly (OLA), PRIDE (LGBTQ+ Allies), Veterans Leadership Network (VLN), Women’s Initiative for Leading at Lilly (WILL), enAble (for people with disabilities). Learn more about all of our groups. Actual compensation will depend on a candidate’s education, experience, skills, and geographic location. The anticipated wage for this position is $166,500 - $266,200. Full‑time equivalent employees also will be eligible for a company bonus (depending, in part, on company and individual performance). In addition, Lilly offers a comprehensive benefit program to eligible employees, including eligibility to participate in a company‑sponsored 401(k); pension; vacation benefits; eligibility for medical, dental, vision and prescription drug benefits; flexible benefits (e.g., healthcare and/or dependent day care flexible spending accounts); life insurance and death benefits; certain time off and leave of absence benefits; and well‑being benefits (e.g., employee assistance program, fitness benefits, and employee clubs and activities). Lilly reserves the right to amend, modify, or terminate its compensation and benefit programs in its sole discretion and Lilly’s compensation practices and guidelines will apply regarding the details of any promotion or transfer of Lilly employees. #WeAreLilly #J-18808-Ljbffr

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the Scientific Lead - Scientific Data Engineer in San Francisco, CA vacancy

Scientific Lead - Scientific Data Engineer
$166.5k - $266.2k
...translational science. Responsibilities Data Harmonization and Lakehouse Architecture... ...digital pathology Semantic Layer and Schema Engineering Design and maintain a semantic layer... ...and manage vector embedding pipelines for scientific documents, study metadata, and...
Scientific
Full time
Flexible hours
Initial Therapeutics, Inc.
San Francisco, CA
4 days ago
Lead Data Engineer
$149.3k - $200.2k
...audiences experience sports, entertainment & news. The Product & Data Engineering team is responsible for end to end development for Disney’s... ...personalization, search, messaging and data. Job Summary As a Lead Data Engineer in the Data Platforms team, you will be partnering...
Suggested
Work experience placement
Worldwide
The Walt Disney Company
San Francisco, CA
4 days ago
Lead Data Engineer for Sleep Tech Hypergrowth
...A leading sleep technology company is looking for a Data Engineer to drive the construction of data infrastructure that supports millions of users. This role requires 6+ years of experience building data platforms and mastery of tools like SQL and Python. You will work...
Suggested
Eight Sleep
San Francisco, CA
4 days ago
Lead Data Engineer
$215.2k - $245.6k
...Lead Data Engineer Do you love building and pioneering in the technology space? Do you enjoy solving complex business problems in a fast‑paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you’ll be part of a big group of makers, breakers...
Suggested
Internship
Local area
Capital One National Association
San Francisco, CA
4 days ago
Lead Data Engineer Enterprise Reporting & Analytics
$191.52k - $212.8k
...belong to something beautiful. Ready for a career glow up? As a Lead Engineer you will design and implement innovative analytical solutions... ...and architecture. Reporting to the Director, Engineering, Data & AI you will work closely with other team members like data architects...
Suggested
Full time
Sephora USA, Inc.
San Francisco, CA
4 days ago
Lead Data Engineer - Enterprise Analytics & AI Reporting
$191.52k - $212.8k
...Sephora USA, Inc is seeking a Lead Engineer based in San Francisco, CA. The successful candidate will design and implement analytical solutions... ...8 years of experience in software development, strong SQL and data warehousing skills, and experience with AI/ML tools. A...
Sephora USA, Inc.
San Francisco, CA
4 days ago
Lead Data Engineer
...Position Title We are seeking a Lead Data Engineer to architect, build, and lead the development of scalable, cloud‑based data platforms that support enterprise analytics, operational reporting, and advanced data use cases. This role provides technical leadership in designing...
Q-Cells
San Francisco, CA
4 days ago
Senior Data Engineer, Scientific Data Ingestion & AI
...A cutting-edge biotech company in San Francisco is seeking a Data Engineer to build AI-powered data ingestion pipelines from various sources. The role demands strong expertise in data engineering and Python, with a focus on data normalization and quality control. As part...
Scientific
Mithrl
San Francisco, CA
3 days ago
Lead Data Engineer
...Lead Data Engineer The Office of Information Technology (IT) is responsible for enabling State Bar's internal and external stakeholders by the management, implementation, and maintenance of an organization's technology to support of State Bar's mission and goals. The...
Work at office
State Bar CA
San Francisco, CA
1 day ago
Lead Data Engineer
...tools • Write SQL for processing raw data, kafka ingestions, adf pipelines, data validation... ...protect sensitive information. Lead, design and implement innovative... ...technologies Work with product and engineering team to understand requirements, evaluate...
BayOne Solutions
San Francisco, CA
1 day ago
Lead Data Engineer - Only W2
...Role: Lead Data Engineer Location: San Francisco, CA (1-2 days a week onsite is must) Job Type: W2 Contract Contract Length: 9 months with possible extension Experience Level: Senior/Lead Main Skills SQL Databricks Azure Data Factory (ADF) DataStage (or other ETL tools...
Contract work
2 days per week
1 day per week
Saransh
San Francisco, CA
4 days ago
Lead Data Engineer
...Lead Data Engineer RADIUMONE IS A GLOBAL PROGRAMMATIC AD BUYING PLATFORM RadiumOne is the 6th largest web property in the U.S. according to comScore We build intelligent software that automates media buying, making big data actionable for marketers and connects...
Stepping Up Solutions
San Francisco, CA
1 day ago
Lead Data Engineer
$180k - $225k
...principles: a rigorous understanding of data, modern technology, and most importantly,... ...actuarial science, and research. The Data Engineer team is a core part of the broader Data... ...of industry experience with technical lead experience of running a data platform for...
Shift work
Nuna Inc
San Francisco, CA
4 days ago
Lead Data Engineer
...Job Title Mandatory Skills: (Oracle or PostgreSQL) and ETL Pipelines and Big Data and AWS Responsibilities · Uses structured tools for analysis and presentation of concepts and models to enhance the BRD · Develop, maintain and deliver training materials to the...
Work experience placement
Omega Solutions Inc
San Francisco, CA
4 days ago
Lead Data Engineer
$215.2k - $245.6k
...Lead Data Engineer Do you love building and pioneering in the technology space? Do you enjoy solving complex business problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of makers, breakers...
Full time
Part time
Internship
H1b
Local area
Capital One Financial Corp
San Francisco, CA
1 day ago
Lead Data Engineer: Data Architecture & Pipelines
...A consumer fintech startup in San Francisco is seeking a Lead Data Engineer to build and optimize data engineering functions. You will establish data architecture, mentor junior engineers, and collaborate with cross-functional teams. The position requires extensive data...
Full time
Cerebras
San Francisco, CA
1 day ago
Lead Data Engineer - ETL, Data Lake, & Analytics
...Hebbia, Inc. is seeking its first Data Engineer to refine data infrastructure and drive best practices for data pipelines in San Francisco. The ideal candidate has over 5 years of software development experience focused on data engineering, alongside a Bachelor's or Master...
Hebbia, Inc.
San Francisco, CA
3 days ago
Agentic Analytics Engineer
$80 per hour
...Job Title: Agentic Analytics Engineer (contract) PR: $80/hr Location: Hybrid onsite in South San Francisco Hours: 9-5 PST Start... ...) in Genentech, you will be responsible for integrating scientific and business data from multiple sources to generate agentic analytics product...
Scientific
Contract work
Work experience placement
Immediate start
Medasource
San Francisco, CA
3 days ago
Founding Data Scientist / Machine Learning Engineer
...Seeking Founding Data Scientists and Machine Learning Engineers Imagine Multiplying Your Impact You've unlocked major wins in your career - you'... ...mindset. 6+ years in production ML/DS; you balance scientific rigor with "it ships today, iteration on the way" pragmatism...
Scientific
Remote work
Palladio AI, Inc
San Francisco, CA
2 days ago
Lead Data Engineer with MarTech
...Key Responsibilities Lead end-to-end MarTech engineering initiatives across orchestration, data processing, and activation pipelines. Architect scalable, event-driven systems that power real-time marketing experiences and automated customer journeys...
ALIS Software LLC
San Francisco, CA
14 days ago
Lead Data Engineer
...Job title: Lead Data Engineer Work Location: San Francisco, CA. Type: Contract Tech Stack & Skills She's Looking For: Core (Must-Have): Backend / Data Engineering (Primary focus) End-to-end data pipeline experience Strong SQL...
Contract work
VBeyond
San Francisco, CA
3 days ago
Scientific Data Engineer: AI-Ready Data Lakes for Discovery
...Lilly is looking for a Scientific Data Engineer to build a data architecture that bridges data storage with AI systems. This role involves designing and optimizing ETL pipelines for harmonized, AI-ready data products. You will work within the Applied Intelligence for...
Scientific
BioSpace, Inc.
San Francisco, CA
3 days ago
Lead Data Engineer
Job Description Title: Lead Data Engineer Location: Hybrid in SF (Tuesdays onsite) Openings: 1 Work Schedule: (available until 12 am PT time to overlap with onshore team) Follow-Up Meeting: After each interview is scheduled. Contract Type: 12 months contract extensions...
Contract work
Insight Global
San Francisco, CA
1 day ago
Lead Data Engineer: Databricks Migrations & Cloud Pipelines
A dynamic technology company is seeking a Lead Data Engineer for a hybrid role in San Francisco, emphasizing expertise in Databricks and Datastage. The ideal candidate will lead an offshore engineering team and drive migration from SQL to Databricks while developing cloud...
Insight Global
San Francisco, CA
1 day ago
ML Infrastructure Engineer Agent Systems & Data Pipelines
...Xterraai is looking for an ML Software Engineer to help build innovative AI agents capable of tackling complex scientific challenges. The position involves designing and developing systems that support cutting-edge research in geospatial and geophysics intelligence. The...
Scientific
Xterraai
San Francisco, CA
4 days ago
Scientific Data Engineer
...Thank you for your interest in Uncountable Engineering! Uncountable is seeking recent graduates interested in a career in data engineering to help manage customers datasets. Our goal is to revolutionize industrial research and development. We’re looking for motivated software...
Scientific
Uncountable Inc
San Francisco, CA
3 days ago
Data Engineer, Knowledge Graphs
...patients in months, not years, and where scientific breakthroughs happen at the speed... ...Co-Scientist. It is a discovery engine that transforms messy biological data into insights in minutes.... ...year revenue growth ~ Trusted by leading biotechs and big pharma across three...
Scientific
Work at office
Mithrl
San Francisco, CA
4 days ago
Data Engineer, Scientific Data Ingestion
...AI Co-Scientist—a discovery engine that empowers life science teams... ...to go from messy biological data to novel insights in minutes.... ...revenue growth Trusted by leading biotechs and big pharma across... ...ETL/ELT pipelines, ideally for scientific or lab‑derived data. Ability...
Scientific
Work at office
Mithrl
San Francisco, CA
4 days ago
ML Systems Engineer Scale AI for Science (Remote)
$250k - $400k
...A leading AI research firm in San Francisco seeks experienced professionals to build and scale systems for AI-driven scientific discovery. The role involves developing training pipelines, supporting... ..., with opportunities for ML Engineers, ML Infra, Research Engineers,...
Scientific
Remote work
Trades Workforce Solutions
San Francisco, CA
3 days ago
Senior Data Engineer (5+ years)
$185k - $221.4k
...You’ll Do Build and own production data infrastructure. Design, implement... ...; ingest clinical, financial, scientific, and commercial data from REST APIs... ...at the right time. Uphold high engineering standards and collaborate broadly. Lead code and design reviews, establish...
Scientific
Remote work
Flexible hours
Foresite Labs
San Francisco, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Scientific Lead - Scientific Data Engineer. Be the first to apply!