Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer (SRE) - AI Platform & Cloud

Full-time

Morgan Stanley

In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities.

This is a Software Engineering position at Director level, which is part of the job family responsible for developing and maintaining software solutions that support business needs.

Since 1935, Morgan Stanley is known as a global leader in financial services, always evolving and innovating to better serve our clients and our communities in more than 40 countries around the world.

Our mission is to develop a firmwide Artificial Intelligence (AI) Development Platform that aligns with the firm’s Technology principles and drives efficiency and consistency, controls, security and strong governance and promotes innovation, enabling teams to build applications that leverage AI capabilities and accelerate the adoption of AI across our businesses. 

This role is for an experienced and driven Site Reliability Engineer (SRE) to join our AI Platform team to help support, scale and harden the infrastructure that powers our AI/ML systems. You will collaborate closely with infrastructure engineering, cloud engineering, data engineering, and security teams to ensure availability, reliability, performance, and security of production AI workloads (training, inference, data pipelines) in a regulated, high-stakes financial environment.

As an SRE on the AI platform, you will bring deep operations, automation, and systems engineering skills to enable our models and pipelines to run reliably at scale, while balancing cost, security, and compliance constraints.

The ideal candidate will have strong hands-on experience supporting software platforms on any combination of the following platforms - Kubernetes, Cloud (AWS, Azure, and/or Google), API based development, REST framework, data engineering, and large-scale API Gateway environments etc. Knowledge of AIML and hands-on experience implementing solutions using Generative AI are also preferable. The candidate will have great communication skills, a team-based mentality and a strong passion for using AI to increase productivity as well as help generate new ideas for product & technical improvements. 

What you'll do in the role:

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

  • Design and build automation for core platform capabilities, reducing manual toil

  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

  • Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards

  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

  • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

  • Optimize cost vs. performance tradeoffs in large-scale compute environments

  • Harden systems for security, compliance, auditability, and data governance

  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems

  • Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms

  • Maintain runbooks, operational playbooks, documentation, and training materials

  • Participate in on-call rotations and respond to production incidents 24/7 as needed

  • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

What you ' ll bring to the role:

  • Bachelor’s or Master’s degree in Computer Science or related field, or equivalent job experience 

  • 5 years of production experience in SRE / Infrastructure / ops for large-scale systems

  • Strong programming/scripting skills (Python, Go, Java, or equivalent)

  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)

  • Solid experience in capacity planning, performance tuning, scaling, and incident response

  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements

  • Experience in regulated environments (financial services, compliance, audit, security) is a strong plus

  • Excellent communication, documentation, and cross-team collaboration skills

  • Proven track record of reducing operational toil via automation

Nice to have

  • Understanding of SRE techniques.

  • Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.

  • Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.

  • Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)

  • Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage.

  • Experience working with Generative AI development, embeddings, fine tuning of Generative AI models. 

  • Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)

  • Understanding of ModelOps/ ML Ops/ LLM Op.

  • Experience with chaos engineering, canary deployments, blue/green rollouts

We have a track record of innovation and passion for unlocking new opportunities, we help our clients raise, manage and allocate capital. We do this by offering a wide range of investment banking, securities, wealth management and asset management services.

All that we do at Morgan Stanley is driven by our five core values: do the right thing, put clients first, lead with exceptional ideas, commit to diversity and inclusion, and give back. These aren’t just beliefs, they guide the decisions we make every day, ensuring we do what's best for our clients, communities and more than 80,000 employees around the world. And at the core of our success are the people who drive it - relentless collaborators and creative thinkers who are fueled by diverse thinking and experiences.

Wherever you are in our 1,200 global offices, you’ll have the opportunity to work alongside the best and the brightest in an environment where you are empowered to achieve your full potential. We are proud to support our employees and their families at every point along their work-life journey, offering some of the most attractive and comprehensive employee benefits and perks in the industry.

At Morgan Stanley Alpharetta, we support the Firm’s global business and functions from Wealth Management and Institutional Securities to Technology and Operations, Finance and Human Resources. With the 2020 acquisition of E-TRADE, Morgan Stanley Alpharetta grew significantly and has grown its role in our Wealth Management business helping deliver a premiere experience for the digitally inclined investor and trader. Learn more about our work and culture in Morgan Stanley Alpharetta.

Morgan Stanley's goal is to build and maintain a workforce that is diverse in experience and background but uniform in reflecting our standards of integrity and excellence. Consequently, our recruiting efforts reflect our desire to attract and retain the best and brightest from all talent pools. We want to be the first choice for prospective employees.

It is the policy of the Firm to ensure equal employment opportunity without discrimination or harassment on the basis of race, color, religion, creed, age, sex, sex stereotype, gender, gender identity or expression, transgender, sexual orientation, national origin, citizenship, disability, marital and civil partnership/union status, pregnancy, veteran or military service status, genetic information, or any other characteristic protected by law.

Morgan Stanley is an equal opportunity employer committed to diversifying its workforce (M/F/Disability/Vet).

WHAT YOU CAN EXPECT FROM MORGAN STANLEY:

At Morgan Stanley, we raise, manage and allocate capital for our clients – helping them reach their goals. We do it in a way that’s differentiated – and we’ve done that for 90 years. Our values - putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity and inclusion, and giving back - aren’t just beliefs, they guide the decisions we make every day to do what's best for our clients, communities and more than 80,000 employees in 1,200 offices across 42 countries. At Morgan Stanley, you’ll find an opportunity to work alongside the best and the brightest, in an environment where you are supported and empowered. Our teams are relentless collaborators and creative thinkers, fueled by their diverse backgrounds and experiences. We are proud to support our employees and their families at every point along their work-life journey, offering some of the most attractive and comprehensive employee benefits and perks in the industry. There’s also ample opportunity to move about the business for those who show passion and grit in their work.

To learn more about our offices across the globe, please copy and paste into your browser.

Morgan Stanley is an equal opportunity employer committed to building and maintaining a workforce that is diverse in experience and background. Our recruiting efforts reflect our strong commitment to a culture of inclusion, where individuals are hired, developed, and advanced based on their skills and talents.

Our workforce reflects a broad cross-section of the global communities in which we operate, bringing a variety of backgrounds, talents, perspectives, and experiences.

For more information, please visit: .

Vacancy posted 17 hours ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer (SRE) - AI Platform & Cloud in Alpharetta, GA vacancy
  •  ...Senior Databricks AI Platform SRE We are looking for a Senior Databricks AI Platform...  .... You will partner with ML engineer, data scientists, platform teams, and cloud architects to automate...  ...Monitor platform performance, reliability, and usage, and drive improvements... 
    Platform
    Cloud

    Central Business Solutions

    Alpharetta, GA
    3 days ago
  •  ...Overview: Job Title: AI/ML Ops & Infrastructure Engineer Company: R2...  ...AI & IoT Intelligence Platform utilizing advanced NLP...  ...Kubernetes across multi-cloud environments (AWS, GCP,...  ...experience in MLOps, DevOps, Site Reliability Engineering (SRE), or Cloud... 
    Platform
    Cloud
    Full time
    Remote work
    Shift work

    R2 Technologies

    Alpharetta, GA
    2 days ago
  •  ...We are seeking a Cloud Developer with expertise...  ...in Generative AI and Agent...  ...oriented software engineering (AOSE) journey, combining...  ...customer engagement platform using Generative...  ...production-readiness and reliability. Participate in...  ...). 5+ years in SRE, DevOps, MLOps, or... 
    Platform
    Cloud

    Compunnel

    Alpharetta, GA
    4 days ago
  •  ...Job Title: GCP Cloud Engineer with SRE Location: Alpharetta GA - Day 1 Onsite Duration: 6 to 12 Months 1 Cloud Engineer with SRE experience...  ...: Cloud Engineer will be part of the GCP Cloud Platform team who is responsible for building Cloud Infrastructure... 
    Platform
    Cloud

    Navtech

    Alpharetta, GA
    1 day ago
  •  ...Senior AI Engineer Equifax is where you can power your possible...  ...recommend AI frameworks, cloud services, and third-party platforms aligned with business...  ...engineering, quality engineering, reliability engineering and project...  ...Influence architects, SRE leads and other technical... 
    Platform
    Cloud
    Full time
    Work at office
    Immediate start
    Remote work
    Monday to Friday

    Equifax

    Alpharetta, GA
    1 day ago
  •  ...specializing in Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure), Artificial Intelligence (AI), Machine Learning (ML),...  ...1000 companies-with scalable, platform-based solutions and data-driven...  ...technology innovation! Data Science Engineer (Python & Databricks)... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    4 days ago
  •  ...Java, data warehousing, cloud computing, and build engineering. As a global provider of...  ...We are seeking a skilled AI & Enterprise Applications...  ...high system performance, reliability, and scalability ~...  ...Experience with enterprise platforms and system integrations... 
    Platform
    Cloud
    Shift work

    OmniTrust Technologies

    Alpharetta, GA
    2 days ago
  • $142.6k - $261.5k

     ...opportunity The Platforms Practice...  ...our product-driven, AI-centric approach,...  ...designers, and software engineers enable our clients...  ...building and operating cloud infrastructure and...  ...with a focus on reliability and excellent...  ...across teams Apply SRE best practices, establish... 
    Platform
    Cloud
    Summer holiday
    Flexible hours

    EY

    Alpharetta, GA
    4 days ago
  •  ...Overview: Job Title: Data Engineer (AI & Data Platforms) Company: R2 Technologies Location...  ...downstream analytics. Implement data reliability engineering practices (data contracts...  ...SQL. Hands-on experience with cloud data platforms (Snowflake, Databricks... 
    Platform
    Cloud
    Full time
    Remote work
    Shift work

    R2 Technologies

    Alpharetta, GA
    2 days ago
  •  ...master’s degree in computer science, Data Science, Engineering, or related field. 5+ years of experience in ML/AI development and deployment. Strong programming...  ..., PyTorch). Hands‑on experience with Google Cloud Platform services and Databricks. Proficiency in Kubernetes... 
    Platform
    Cloud

    Unicorn Technologies

    Alpharetta, GA
    3 days ago
  •  ...leading technology firm in Alpharetta, Georgia, is seeking an AI Engineer to design and implement intelligent systems leveraging...  ...models. Key responsibilities include deploying solutions to cloud platforms and ensuring scalable AI interactions. Strong experience with... 
    Platform
    Cloud

    Compunnel

    Alpharetta, GA
    4 days ago
  •  ...team is building the future of Cisco’s AI‑driven platforms and data infrastructure, driving...  ...exploring the intersection of backend engineering and AI to transform how Cisco and its...  ...building infrastructure to support modern cloud and distributed computing initiatives.... 
    Platform
    Cloud
    Full time
    Temporary work
    Apprenticeship
    Flexible hours

    Cisco

    Alpharetta, GA
    1 day ago
  •  ...Unicorn Technologies LLC is seeking an experienced ML/AI Engineer based in Alpharetta, Georgia. The ideal candidate will have robust...  ...libraries. Experience with containerized ML deployment and Google Cloud Platform is essential. This role requires strong programming... 
    Platform
    Cloud

    Unicorn Technologies

    Alpharetta, GA
    1 day ago
  •  ...OVA.Work is looking for an experienced Senior AI Developer to design and develop AI-powered solutions. This role involves collaborating...  ...the ability to work on scalable applications across multiple platforms. The employment type is full-time, with options for remote,... 
    Platform
    Cloud
    Full time
    Remote work

    OVA

    Alpharetta, GA
    4 days ago
  •  ...maintain Python-based solutions, emphasizing innovation in technology transformation initiatives. Strong communication skills and experience in a distributed environment are essential. Familiarity with cloud-native platforms and RESTful APIs is preferred. #J-18808-Ljbffr... 
    Platform
    Cloud

    SAPSOL Technologies Inc

    Alpharetta, GA
    4 days ago
  •  ...specializing in Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure), Artificial Intelligence (AI), Machine Learning (ML),...  ...1000 companies-with scalable, platform-based solutions and data-driven...  ...in Computer Science, Software Engineering, or a related field (or equivalent... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    3 days ago
  •  ...Mirrors is seeking a skilled Cloud Developer based in Alpharetta,...  ...applications that leverage Generative AI for improving cloud...  ...development, familiarity with cloud platforms, and a strong understanding of...  .... This position requires on-site presence 5 days a week. #J-188... 
    Platform
    Cloud

    Tech Mirrors

    Alpharetta, GA
    3 days ago
  •  ...Machine Learning, Java, data warehousing, cloud computing, and build engineering. As a global provider of IT staffing...  ...NoSQL) Familiarity with cloud platforms and containerization tools...  ...event-driven systems Exposure to AI or modern automation technologies... 
    Platform
    Cloud
    Shift work

    OmniTrust Technologies

    Alpharetta, GA
    2 days ago
  •  ...forward. Position Summary We are looking for a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and reliability of...  ...: 6+ years as an SRE, DevOps Engineer, or similar role Cloud: Strong experience with AWS (EKS, EC2, S3, Route53, IAM)... 
    Cloud
    Permanent employment
    Work experience placement
    Local area

    SCIENTIFIC GAMES

    Alpharetta, GA
    more than 2 months ago
  •  ...Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure...  ...Intelligence (AI), Machine Learning (ML)...  ...companies-with scalable, platform-based solutions and data...  ...innovation! Cloud Data Engineer (GCP BigQuery & Dataflow...  ...analysts to provide clean, reliable data for reporting and... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    4 days ago
  •  ...specializing in Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure), Artificial Intelligence (AI), Machine Learning (ML),...  ...1000 companies-with scalable, platform-based solutions and data-driven...  ...in Computer Science, Software Engineering, or a related field (or equivalent... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    4 days ago
  •  ...specializing in Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure), Artificial Intelligence (AI), Machine Learning (ML),...  ...1000 companies-with scalable, platform-based solutions and data-driven...  ...technology innovation! Java MLOps Engineer (Spring + ML Pipelines)... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    4 days ago
  •  ...is seeking a Senior Data Engineer in Alpharetta, Georgia...  ...modernize the lottery data platform which involves reporting, analytics, and AI capabilities. You'll be...  ...for building reliable data pipelines and improving...  ...models in both legacy and cloud systems. The ideal candidate... 
    Platform
    Cloud

    Scientific Games

    Alpharetta, GA
    17 hours ago
  •  ...Role: ML/AI Engineers (This role is open to US Citizens, Green Card holders, GC-EAD only. We do not sponsor...  ...and developing AI or machine learning solutions on platforms such as AWS, Databricks, Azure, Google Cloud and OpenAI. Software engineering and/or Data Engineering... 
    Platform
    Cloud
    Remote work
    Visa sponsorship
    Relocation package

    Adidev Technologies Inc

    Alpharetta, GA
    2 days ago
  •  ...LoadUp Technologies in Alpharetta, GA is seeking a Senior Software Engineer. This role involves designing and delivering a cloud-native platform with a focus on performance and reliability. The ideal candidate will leverage their experience in Java and Spring Boot to develop... 
    Platform
    Cloud

    LoadUp

    Alpharetta, GA
    1 day ago
  •  ...Position Summary Head of Platform Product & Program...  ...Scientific Games’ Data & AI Platform and...  ...production stability, reliability, and quality of platform...  ...for large‑scale Data or Cloud platforms. Proven ability...  ...technical roadmaps that drive engineering success. Deep... 
    Platform
    Cloud

    Scientific Games

    Alpharetta, GA
    3 days ago
  •  ...specializing in Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure), Artificial Intelligence (AI), Machine Learning (ML),...  ...1000 companies-with scalable, platform-based solutions and data-driven...  ...in Computer Science, Software Engineering, or a related field (or equivalent... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    4 days ago
  •  ...Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure...  ...Intelligence (AI), Machine Learning (ML...  ...companies-with scalable, platform-based solutions and data...  ...for performance and reliability across frontend and backend...  ...Science, Software Engineering, or a related field (or... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    4 days ago
  •  ...specializing in Java, .NET, Big Data, Cloud Computing (AWS, GCP, Azure), Artificial Intelligence (AI), Machine Learning (ML),...  ...1000 companies-with scalable, platform-based solutions and data-driven...  ...in Computer Science, Software Engineering, or a related field (or equivalent... 
    Platform
    Cloud
    Full time

    R2 Technologies

    Alpharetta, GA
    4 days ago
  • $129k - $161k

     ...Job Description Job title: Senior Site Reliability Engineer Reports to: Director, Site Reliability Engineering Department: Cloud Platforms Location: Remote Grade: 20...  ...engineering experience, including 3+ years in SRE or reliability-focused roles. ~ Demonstrated... 
    Platform
    Cloud
    Remote work

    Priority Technology Holdings, LLC

    Alpharetta, GA
    12 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer (SRE) - AI Platform & Cloud. Be the first to apply!