Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Research Scientist - Vision Language Model

Institute of Foundation Models

Job Description

Job Description

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. 

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

Position Summary

As a Research Scientist in the Vision Language Model (VLM) team, your role will be central to advancing state-of-the-art multimodal foundation models that integrate visual understanding, reasoning, and agentic capabilities. You will work on the research and development of large-scale VLM systems, spanning model architectures, data recipes for pre-training and post-training, and evaluation benchmarks. The role combines cutting-edge research with practical engineering, emphasizing large-scale data processing, filtering, and weighting pipelines, distributed training systems, and reinforcement learning algorithms and systems for multimodal reasoning and agent development.

Key Responsibilities

  • Research and development of next-generation Vision Language Models across pre-training, instruction tuning, reasoning, and agents.

  • Develop novel architectures and training methodologies for integrating visual understanding, language reasoning, and tool-use capabilities.

  • Research efficient multimodal learning techniques, including data-efficient training, long-context modeling, model modularity, and inference optimization.

  • Build and improve large-scale multimodal datasets, synthetic data generation pipelines, and evaluation benchmarks for VLM capabilities.

  • Investigate multimodal reasoning, agentic behavior, OCR, grounding, document understanding, chart understanding, and visual question answering capabilities.

  • Contribute to technical reports, research publications, and open-source software.

  • Represent MBZUAI at research conferences and industry events, showcasing advancements in multimodal foundation models and large-scale AI systems.

  • Mentor junior researchers and collaborate across teams to drive impactful research initiatives.

Academic Qualifications

PhD or equivalent research experience in Machine Learning, Computer Vision, Natural Language Processing, or Multimodal AI.

Salary Range

 

The posted salary range represents the company’s good faith estimate of the compensation for this position upon hire. The actual compensation offered may vary within this range depending on individual qualifications, including but not limited to relevant skills, experience, education, certifications, geographic location, and specific business needs.  


Professional Experience Minimum  

  • Experience working with large language models and/or vision-language models, including pre-training, fine-tuning, evaluation, or inference.

  • Strong Python and PyTorch development skills for large-scale machine learning research.

  • Experience with distributed training systems and large-scale model optimization.

  • Familiarity with multimodal datasets and data processing pipelines involving images, text, and video.

  • Understanding of modern deep learning architectures, including Transformers, attention mechanisms, and multimodal fusion techniques.

  • Experience with ML infrastructure, including model evaluation, debugging, optimization, and large-scale experimentation.

  • Problem-solving and research skills with the ability to independently drive research/engineering projects.

  • Effective communication and collaboration skills for working across research and engineering teams.

Preferred Skills
  • Hands-on experience training or fine-tuning large Vision Language Models or multimodal foundation models at scale.

  • Experience with distributed learning frameworks and infrastructure such as PyTorch Distributed, Megatron, Triton, or CUDA.

  • Research experience in multimodal reasoning, agentic systems, tool use, OCR, grounding, document understanding, or multimodal retrieval.

  • Experience with synthetic data generation, multimodal data curation, or automated evaluation frameworks for VLMs.

  • Familiarity with efficient training and inference techniques such as FlashAttention, quantization, tensor parallelism, pipeline parallelism, or memory optimization.

  • Experience contributing to open-source ML software and large-scale research codebases.

  • Strong publication record in leading AI conferences such as NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP, or related venues.

  • Experience collaborating across research, infrastructure, and product-oriented teams to deliver state-of-the-art multimodal systems.

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Research Scientist - Vision Language Model in Sunnyvale, CA vacancy
  • $165k - $185k

     ...Description Company Description The Bosch Research and Technology Center North America...  ...Silicon Valley focuses on Foundation Models, Big Data Visual Analytics, Explainable AI (XAI), Natural Language Processing, Computer Vision & Mixed Reality, Cloud Robotics, Data... 
    Language
    Work experience placement
    Worldwide

    Bosch Group

    Sunnyvale, CA
    1 day ago
  • $126k - $423k

     ...looking for multiple passionate Research Scientists to join the Research Group at...  ...pretraining world-action foundation model with various world modalities including vision and physics associated with ego...  ..., human data incorporation, language modality, and spatial reasoning... 
    Language
    Full time
    For contractors
    For subcontractor
    Casual work
    Work at office
    Immediate start
    Remote work
    Day shift

    Applied Intuition

    Sunnyvale, CA
    2 days ago
  • $165k - $185k

    Robert Bosch Group seeks a motivated Research Scientist specializing in Vision-Language-Action Models in Sunnyvale, California. The role emphasizes cutting-edge research in AI, focusing on autonomous systems and collaboration across global teams. The successful candidate... 
    Language

    Robert Bosch Group

    Sunnyvale, CA
    1 day ago
  • $184k - $287.5k

     ...new AI-powered application is built. We are seeking a senior vision language model engineer to design and build agentic data and training...  ...Physical AI. What you'll be doing: Partner with our researchers to develop and evaluate prototypes of our latest models, such... 
    Language

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $150k

     ...Institute of Foundation Models We are a dedicated research lab for building, understanding...  ...-class researchers, data scientists, and engineers, tackling...  ...specializing in Computer Vision your role will be crucial...  ...‑related concepts (e.g., language modeling, computer vision... 
    Language
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  • $165k - $185k

     ...Company Description The Bosch Research and Technology Center...  ...focuses on Foundation Models, Big Data Visual Analytics...  ...Explainable AI (XAI), Natural Language Processing, Computer Vision & Mixed Reality, Cloud...  ...As a Research Scientist- Vision- Language- Action... 
    Language
    Full time
    Work experience placement
    Local area
    Worldwide

    Robert Bosch Group

    Sunnyvale, CA
    1 day ago
  •  ...as our ability to measure it. At Sanas, model quality spans dimensions that automated...  ...-world disfluency. We’re looking for a Research Scientist who can define what "better" actually...  ...Noise Cancellation, Speech Enhancement, Language Translation, and more — ensuring each captures... 
    Language

    Sanas

    Palo Alto, CA
    2 days ago
  •  ...Job Title: CW Research on Large Vehicle Data Model - Summer Intern (99W210) About Kyyba: Founded...  ...and post-training, leveraging language supervision, and enhancing multimodal...  ...technical field. Prior experience with Vision Language Models (VLMs), Large... 
    Language
    Summer internship
    Visa sponsorship
    Work visa

    Kyyba

    Mountain View, CA
    3 days ago
  • $190k - $250k

     ...developing large-scale generative world models that learn to predict realistic,...  ...autonomous trucks. We are looking for a research scientist to lead the design and development of world...  ...bonuses Excellent Medical, Dental, and Vision plans through Kaiser Permanente, Cigna,... 
    Temporary work
    Work at office
    Visa sponsorship
    Flexible hours

    Kodiak

    Mountain View, CA
    8 days ago
  • $57.69 per hour

     ...economic potential of industrial robotics. Role We are seeking a highly motivated intern to build state-of-the-art vision foundation models for industrial robotics. You will design, train and test your own vision foundation models with the goal of achieving... 
    Full time
    Internship
    Local area

    Intrinsic

    Mountain View, CA
    21 hours ago
  • $224k - $356.5k

     ...never been done before takes vision, innovation, and the world’s...  ...Principal Deep Learning Engineer — Model Evaluation & AI Systems, you...  ...-on experience with large language models and NLP, including...  ...communicate effectively across research, engineering, and product teams... 
    Language

    NVIDIA

    Santa Clara, CA
    21 hours ago
  • $174.72k - $295.68k

     ...Machine Learning Engineer - Foundation Model Santa Clara, CA XPENG is a leading...  ...full-time Machine Learning Engineer / Research Scientist to drive the modeling and...  ...development of XPENG's next-generation Vision-Language-Action (VLA) Foundation Model — the... 
    Language
    Full time

    XPENG

    Santa Clara, CA
    4 days ago
  • $181.1k - $318.4k

     ...Machine Learning Engineer, Foundation Model Services Work Locations (2) Submit Resume...  ...on optimizing billions of parameter language and vision and speech models using state of the...  ...time. Work along side Foundation Model Research team to prototype and develop inference... 
    Language
    Relocation

    Apple

    Santa Clara, CA
    21 hours ago
  • $160.36k - $240.54k

     ...Machine Learning Research Scientist: Generative Modeling for Planning Mountain View, California (HQ) Nuro...  ...foundation models. Leverage large language models and world foundation models...  ...autonomous driving. Experiences in vision-language-action models, reinforcement... 
    Language

    Nuro

    Mountain View, CA
    3 days ago
  • $184k - $287.5k

     ...deployment of cutting-edge deep learning models on every NVIDIA GPU. With demand for...  ...particularly in the realm of large language models (LLMs) and vision language models (VLMs, VLAs), we are...  ..., interfacing directly with NVIDIA Researchers, GPU Architects, and other teams... 
    Language

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $215.28k - $364.32k

     ...Learning Engineer – Autonomous Driving Model Quantization & Deployment Santa Clara...  ...connectivity. The challenge of Vision-Language-Action (VLA) models and Foundation Models...  ...Engineer to bridge the gap between massive research models and production-ready L4... 
    Language
    Full time

    XPENG

    Santa Clara, CA
    2 days ago
  • $158.4k - $237.6k

     ...and deploy machine learning models on edge and mobile hardware....  ...range from quantizing large language models (LLMs) and generative...  ...compressing latency-critical vision, audio, and multimodal networks...  ...and BU partners to external researchers and application developers... 
    Language
    Work experience placement
    Immediate start
    Work from home

    Qualcomm

    Santa Clara, CA
    4 days ago
  • $50 per hour

     ...to lead day-to-day execution of Chinese (zh-CN) multimedia and language data labeling and review work (e.g., video, images, and related...  ...experience in data annotation, multimodal data labeling, computer vision labeling, content QA, or a closely related field, including... 
    Language
    Full time

    Welo Global

    Santa Clara, CA
    4 days ago
  • $141.8k - $258.6k

    AI Experience Researcher, Product Evaluation, Vision Products Group Sunnyvale, California...  ...collaborating with ML and data scientists, software engineers,...  ...to recognize patterns in model behaviors and outputs,...  ...models, preferably Large Language Models Familiarity with... 
    Language
    Relocation

    Apple Inc.

    Sunnyvale, CA
    2 days ago
  • $172.43k - $230.95k

     ...Senior Software Engineer For The Ai Model Lifecycle Team Crusoe is on a mission to accelerate...  ...Machine Learning models, including Large Language Models (LLMs). What You'll Be Working...  ...~ Comprehensive health, dental & vision insurance ~ Employer contributions to HSA... 
    Language
    Temporary work

    Crusoe

    Sunnyvale, CA
    21 hours ago
  • $165.2k - $223.6k

     ...forefront of running a wide range of models and supporting novel...  ...including massive scale large language models like the Llama family,...  ...cross-functional team of applied scientists, system engineers, and...  ...insurance (medical, dental, vision, prescription, Basic Life & AD... 
    Language
    Work experience placement
    Internship
    Local area
    Flexible hours

    Amazon

    Cupertino, CA
    5 days ago
  • $184k - $299k

    Senior Research Scientist, Efficient Deep Learning NVIDIA is searching for...  ...about methods for post-training model optimization (pruning,...  ...the top venues in computer vision and machine learning. Our existing...  .... Experience with large language models and large vision‑language... 
    Language

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  •  ...to lead day-to-day execution of Chinese (zh-CN) multimedia and language data labeling and review work (e.g., video, images, and related...  ...experience in data annotation, multimodal data labeling, computer vision labeling, content QA, or a closely related field, including... 
    Language
    Full time

    Welo Global

    Santa Clara, CA
    18 days ago
  • $147k - $211k

     ...experience. Experience training embodied reasoning VLMs (Vision Language Models). Experience working with simulators and real-world robots...  ...unlock new robot capabilities. Write software to implement research ideas and iterate. Participate in research, including learning... 
    Language
    Full time

    Google Inc.

    Mountain View, CA
    2 days ago
  • $192k - $304.75k

    Senior Research Scientist, AI-Mediated Reality and Interaction page is loaded##...  ...AI interaction and 4D world modeling using new ideas in...  ...generative modeling, large language models, human behavior understanding...  ...at top venues in computer vision, artificial intelligence and... 
    Language

    NVIDIA Corporation

    Santa Clara, CA
    21 hours ago
  • $184k - $299k

    We are now looking for a Senior Research Scientist focused on Multimodal Foundation Models and Robotics! NVIDIA is searching for an outstanding research scientist...  ...least one of the following topics: LLMs; Large vision-language models; Video generative models and diffusion... 
    Language

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $300k

     ...Institute of Foundation Models We are a dedicated research lab for building, understanding...  ...-class researchers, data scientists, and engineers, tackling...  ...the future of large language models. Why You’ll Love...  ...Comprehensive medical, dental, and vision benefits Bonus 401K... 
    Language
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  • $150k - $300k

     ...scale. The Silicon Valley Research Lab focuses on developing...  ...) , etc. As a Research Scientist in the team, you will...  ...and evaluate algorithms, models and prototypes of AI systems...  ...machine learning, natural language processing, computer vision, reinforcement learning,... 
    Language
    Full time
    H1b
    Work at office
    Visa sponsorship
    3 days per week

    Horizon Robotics

    Cupertino, CA
    4 days ago
  • $150k

     ...the Institute of Foundation Models We are a dedicated research lab for building,...  ...world‑class researchers, data scientists, and engineers, tackling the...  ...Experience working with large language models, including...  ...Comprehensive medical, dental, and vision benefits Bonus 401K Plan... 
    Language
    Worldwide
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  • $150k

     ...the Institute of Foundation Models We are a dedicated research lab for building,...  ...world‑class researchers, data scientists, and engineers, tackling the...  ...focus on data‑centric large language model (LLM) development,...  ...Comprehensive medical, dental, and vision benefits Bonus 401K Plan... 
    Language
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Research Scientist - Vision Language Model. Be the first to apply!