Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer - AI Agents

Kraken

Building the Future of Open Finance

Payward - the parent company behind Kraken, NinjaTrader, Breakout, xStocks, Payward Services and CF Benchmarks - has spent the last 15 years building one of the most modern and globally accessible financial infrastructure platforms in the industry, built to advance an open, global financial system.

The Team

Founded in 2011, Kraken is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions across the globe. It offers spot trading, margin, futures, staking, and OTC services, with products built for both individual investors and institutional clients.

The AI Infrastructure team sits within the Data organization and is responsible for building, operating, and scaling the systems that power AI agents in production — both internal tools and external-facing products. Working closely with the AI and Agent Systems teams, this group ensures that the orchestration, execution, and model-serving layers underpinning agentic workflows are reliable, observable, and built to scale.

This team operates at the intersection of data infrastructure and applied AI — a space that moves fast and demands engineers who can bring production discipline to emerging technology. You'll partner across Data Engineering, ML, and product-facing teams to harden agent infrastructure and keep it running at the standards our users expect.

Importantly, this is a platform engineering team. Beyond operating infrastructure, the team is responsible for building the APIs, SDKs, and platform capabilities that enable AI, Data, and Engineering teams to safely and efficiently consume agent infrastructure as a service. Success in this role requires thinking beyond infrastructure operations and toward developer experience, platform adoption, and long-term scalability.

The Opportunity
  • Design, build, and operate the infrastructure layer supporting AI agent workflows in production

  • Ensure reliability, scalability, and observability of agentic systems across internal and external products

  • Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services

  • Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution

  • Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads

  • Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components

  • Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows

  • Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems

  • Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems

  • Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services

  • Implement access controls and security best practices across AI infrastructure environments

  • Document architecture, runbooks, and best practices to support knowledge sharing across the team

What You Bring
  • 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment

  • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production

  • Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale

  • Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design

  • Proficiency with Infrastructure as Code tools, particularly Terraform

  • Experience with containerization and orchestration, particularly Kubernetes and Docker

  • Solid understanding of cloud infrastructure, preferably AWS

  • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)

  • Experience designing and operating observability, monitoring, and alerting systems

  • Experience implementing incident response procedures and participating in on-call rotations

  • Strong collaboration skills working across data, AI, and engineering teams

  • High ownership mindset in a fast-moving, high-stakes production environment

Nice to Haves
  • Experience building or operating infrastructure for agent-based or LLM-powered systems

  • Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)

  • Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake tooling

  • Experience with CI/CD pipelines and deployment automation for AI/ML workloads

  • Exposure to evaluation frameworks and model performance monitoring at scale

  • Experience working in fast-moving 0→1 environments or platform-building teams

  • Experience building SDKs, developer tooling, or internal platform products with a strong focus on usability and adoption

  • Experience with Cloudflare's cloud platform and product ecosystem, including networking, security, performance, and Zero Trust solutions

Unless a specific application deadline is stated in the job posting, applications are accepted on an ongoing basis.

Please note, applicants are permitted to redact or remove information on their resume that identifies age, date of birth, or dates of attendance at or graduation from an educational institution.

We consider qualified applicants with criminal histories for employment on our team, assessing candidates in a manner consistent with the requirements of the San Francisco Fair Chance Ordinance.

Our Commitment

Payward is powered by people from around the world and we celebrate the diverse talents, backgrounds, contributions, and unique perspectives that everyone brings to the table. We hire based on merit, seeking out people with the right abilities, knowledge, and skills for the job. We encourage you to apply for roles where you don't fully meet the listed requirements, especially if you're passionate or knowledgeable about crypto.

We may ask candidates to complete job-related skills or work-style assessments as part of our hiring process. These assessments evaluate competencies relevant to the role and are applied consistently across candidates for similar positions. Results are considered alongside experience and interviews, and are not the sole basis for any employment decision.

As an equal opportunity employer, we don't tolerate discrimination or harassment of any kind, whether based on race, ethnicity, age, gender identity, citizenship, religion, sexual orientation, disability, pregnancy, veteran status, or any other protected characteristic as outlined by federal, state, or local laws.

Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer - AI Agents in United States vacancy
  •  ...leading investors, we’re building the category-defining AI workflow automation platform that healthcare teams...  ...About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our... 
    Suggested
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    2 days ago
  •  ...Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical customer...  ...and deployment pipelines. Explore opportunities to leverage AI-driven observability, anomaly detection, and operational... 
    Suggested
    Local area
    Remote work
    Flexible hours
    Shift work

    O'Reilly Technology Services, Inc.

    Pierce, ID
    1 day ago
  •  ...Joining a high-performing team remotely, the full-time Senior Site Reliability Engineer will own the reliability and automation of critical AI infrastructure, ensuring systems are resilient and secure while building automation tools to streamline operational workflows... 
    Suggested
    Full time
    Remote work

    Virtual Vocations Inc

    United States
    3 days ago
  •  ...Seeking a Principal Site Reliability Engineer for a hybrid role based in San Jose, CA, or a remote position, who will provide technical vision and...  ...large-scale production services Foundational understanding of AI/ML technologies and experience leveraging AI-driven... 
    Suggested
    Remote work

    Virtual Vocations Inc

    United States
    2 days ago
  •  ...Principal Site Reliability Engineer Req ID: 10147292 At Disney Experiences Technology, our team creates world‑class immersive digital experiences...  ...strategic objectives and competitive advantage. Expert in using AI to optimize system reliability through advanced analytics... 
    Suggested
    Work experience placement
    Worldwide

    Dormont Manufacturing Company

    Silver Bay, MN
    1 day ago
  • $121.4k - $218.6k

     ...delivery challenges?** **Join our critical AI Hardware SRE Team!** The AI Hardware...  ...for ensuring best-in-class uptime and reliability of our AI hardware infrastructure...  ...them when they are breached. As a Senior Site Reliability Engineer, you will be responsible for: +... 
    Work experience placement
    Work at office

    Akamai

    Des Moines, IA
    2 days ago
  •  ...Site Reliability Engineer Duration: Long Term Client: UPS This is a Hybrid Work Model (3x a week Onsite) and Location is Parsippany, NJ. Job...  ...Storage, BigQuery, Pub/Sub, etc.). Familiarity with Google BI and AI/ML tools (Looker, BigQuery ML, Vertex AI, etc.) Experience... 

    Sparktek

    Parsippany, NJ
    2 days ago
  • $60 - $80 per hour

     ...seeking a highly specialized Observability Engineer with deep expertise in Dynatrace (latest...  ...and diagnose complex performance and reliability issues using Dynatrace Drive adoption of...  ...liability. Use of Artificial Intelligence (AI): We may use Artificial Intelligence (AI... 
    Contract work
    Temporary work
    Remote work

    TEKsystems

    Atlanta, GA
    3 days ago
  • $81.1k - $187k

     ...Job Description We are looking for a Site Reliability Engineer 3 to support mission-critical cloud services and production operations. The role...  ...everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers... 
    Temporary work
    Immediate start
    Flexible hours
    Shift work

    Oracle

    Providence, RI
    4 days ago
  • $75k - $120k

     ...MGAs, and carriers with the core digital systems, specialized AI, and data-driven foundation to eliminate distribution drag...  ...Canada, and India. Role Summary We are seeking a Site Reliability Engineer II to support the reliability, scalability, and performance... 
    Contract work
    Temporary work
    Work at office
    Work from home
    Flexible hours

    Vertafore

    Denver, CO
    5 days ago
  • $86.9k - $198k

     ...Job Number: R0232211 Site Reliability Engineer, Senior The Opportunity: Engineering to make a system more resilient and efficient frees up...  ...picture to verify your identity and prevent fraud. Candidate AI Usage Policy AI is a part of our daily work at Booz... 
    Full time
    Contract work
    Part time
    Work at office
    Local area
    Remote work

    Booz Allen Hamilton

    Aurora, CO
    2 days ago
  •  ...Senior Site Reliability Engineer Austin, Texas, United States Who We Are At 2K, we create some of the most iconic and culture-shaping video...  ...resources efficiently at cloud scale. Experience with AI and Agentic Development. Cloud certifications (AWS... 

    2K

    Austin, TX
    5 days ago
  • $91.7k - $163.7k

     ...us to start Caring. Connecting. Growing together. The Site Reliability Engineering (SRE) team at Optum Financial ensures world-class reliability...  ...be instrumental in automating our environment and building AI-enhanced platforms to support the next generation of... 
    Minimum wage
    Full time
    Work experience placement
    Local area
    Remote work

    UnitedHealth Group

    Eden Prairie, MN
    2 days ago
  •  ...Senior Site Reliability Engineer The FreedomPay Commerce Platform is the technology of choice for many...  ...lifecycle — and pushes it forward with AI-driven operations and automation at its...  ...Azure AI services (Foundry, Azure SRE Agent) and the agentic workflows built on them... 
    Full time
    Casual work
    Remote work
    Flexible hours

    FreedomPay

    Philadelphia, PA
    2 days ago
  • $100k - $250k

     ..., economics, financials, weather, tech, AI, culture and more. We believe prediction...  ...Roadmap As a member of Kalshi's engineering team, you'll help build the next-generation...  ...You'll Do Improve observability, reliability, and service availability by defining... 
    Local area

    Kalshi

    New York, NY
    2 days ago
  • $109.5k - $150.55k

     ...Description Renaissance is looking for an experienced Sr Site Reliability Engineer to be part of the Engineering Enablement group's Site Reliability...  ...exercises on our products. Explore and integrate AI tooling into the SRE workflows. Be part of an on-call rotation... 
    For contractors
    Local area
    Remote work
    Worldwide
    Work visa
    Flexible hours
    Weekend work

    Renaissance Services

    Sioux Falls, SD
    5 days ago
  • $95k - $171k

     ...Are you passionate about cutting-edge AI infrastructure? Do you want to build...  ...infrastructure, Kubernetes, and ensuring reliability for AI workloads within Akamai's serverless inference platform. As an Site Reliability Engineer II, you will be responsible for: Building... 
    Permanent employment
    Work experience placement
    Work at office
    Remote work
    Work from home
    Worldwide
    Flexible hours

    Akamai

    Jefferson City, MO
    2 days ago
  • $75.7k - $136.3k

     ...and building systems that scale? Join our highly skilled Site Reliability Engineering team! Our team designs, develops, and manages applications...  ...: Scaling the world's biggest moments without a glitch. AI : Enabling our customers to build, secure, and scale AI... 
    Work experience placement
    Work at office

    Akamai

    Dover, DE
    2 days ago
  •  ...Position Overview SingleStore is seeking a Site Reliability Engineer to help optimize and scale our managed service offering across all three...  ...world's leading organizations to build and scale cutting-edge AI applications on a unified data platform that supports real-... 
    Worldwide

    SingleStore

    Seattle, WA
    5 days ago
  •  ...Senior Site Reliability Engineer Jersey City, New Jersey;Charlotte, North Carolina; Plano, Texas To proceed with your application, you must...  ...emerging technologies and automation techniques (including AI/ML where applicable) to improve platform reliability Skills... 
    Work at office
    Shift work
    Day shift

    Bank of America

    Charlotte, NC
    5 days ago
  •  ...helping the world's most important research sites do their best work. Our solutions are...  ...the Team: We are seeking a Site Reliability Engineer (SRE) to join one of our Scrum teams and...  ...performance of the Florence™ platform. AI-driven tooling and automation are a cornerstone... 
    Work at office

    Florence

    Atlanta, GA
    1 day ago
  •  ...Site Reliability Engineer II Our engineering fleet is a horizontal set of teams providing engineering services across the organization. Our specific...  ...with distributed global teams. Experience using modern AI-assisted development tools (e.g., Copilot, Cursor, or... 

    Disney France

    New York, NY
    5 days ago
  • $150k - $175k

     ...Site Reliability Engineer At ASAPP, our mission is simple: deliver the best AI-powered customer experience—faster than anyone else. To achieve that, we're guided by principles that shape how we think, build, and execute. We value customer obsession, purposeful speed... 
    Remote work

    ASAPP

    New York, NY
    4 days ago
  •  ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools...  ...and TypeScript source code Experience experimenting with AI-driven approaches to operations Comfortable with... 

    TechChain Talent

    New York, NY
    4 days ago
  • $182.3k - $220k

     ...patients first - and that mission depends on reliable, secure, and scalable systems. As a...  ...infrastructure and building tools that empower our engineers to ship safely and confidently. You...  ..., including artificial intelligence (AI), to assist with parts of our recruiting... 
    Local area
    Flexible hours

    Ro

    New York, NY
    1 day ago
  • $117k - $209.33k

     ...Overview An exciting new opportunity has opened for a Site Reliability Engineer within the Autodesk PDMS Platform SRE team. The successful...  ...ability to work independently ~ Experience with AI-assisted engineering tools & practices is preferred... 
    Permanent employment
    For contractors

    Autodesk

    Atlanta, GA
    2 days ago
  •  ...is where you come in. We're seeking a Senior Site Reliability Engineer who can own our data tier at high availability while...  ...the same problem. We're investing in AI to compress incident response, build agents and tooling that speed up root-cause analysis, and... 
    Permanent employment
    Local area
    Flexible hours

    Zello

    Austin, TX
    5 days ago
  • $79.1k - $158.2k

     ...service according to terms for reliability and functionality. ~Assists...  ...~Gains basic knowledge of site reliability trends and shares...  ...seeking a skilled Site Reliability Engineer to design, build, operate, and...  ...to life-saving care. And with AI embedded across our products... 
    Temporary work
    Immediate start
    Flexible hours
    Shift work

    Oracle

    Seattle, WA
    2 days ago
  • $130k - $150k

     ...Site Reliability Engineer (SRE) Engineer Reliability into the Systems That Move the Nation’s Food Supply Who We Are US Cold owns and operates...  ...engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue... 

    USCS

    Camden, NJ
    5 days ago
  • $165k - $225k

     ...Dataiku is the Platform for AI Success, the enterprise orchestration...  ..., machine learning, and AI agents with the transparency,...  ...a true business performance engine delivering measurable value....  ...ll make an impact: As a Site Reliability Engineer (SRE) with advanced... 
    Work at office
    Flexible hours

    Dataiku

    New York, NY
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer - AI Agents. Be the first to apply!