Site Reliability Engineer - AI Agents

Kraken

Building the Future of Open Finance

Payward - the parent company behind Kraken, NinjaTrader, Breakout, xStocks, Payward Services and CF Benchmarks - has spent the last 15 years building one of the most modern and globally accessible financial infrastructure platforms in the industry, built to advance an open, global financial system.

The Team

Founded in 2011, Kraken is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions across the globe. It offers spot trading, margin, futures, staking, and OTC services, with products built for both individual investors and institutional clients.

The AI Infrastructure team sits within the Data organization and is responsible for building, operating, and scaling the systems that power AI agents in production — both internal tools and external-facing products. Working closely with the AI and Agent Systems teams, this group ensures that the orchestration, execution, and model-serving layers underpinning agentic workflows are reliable, observable, and built to scale.

This team operates at the intersection of data infrastructure and applied AI — a space that moves fast and demands engineers who can bring production discipline to emerging technology. You'll partner across Data Engineering, ML, and product-facing teams to harden agent infrastructure and keep it running at the standards our users expect.

Importantly, this is a platform engineering team. Beyond operating infrastructure, the team is responsible for building the APIs, SDKs, and platform capabilities that enable AI, Data, and Engineering teams to safely and efficiently consume agent infrastructure as a service. Success in this role requires thinking beyond infrastructure operations and toward developer experience, platform adoption, and long-term scalability.

The Opportunity

Design, build, and operate the infrastructure layer supporting AI agent workflows in production
Ensure reliability, scalability, and observability of agentic systems across internal and external products
Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
Implement access controls and security best practices across AI infrastructure environments
Document architecture, runbooks, and best practices to support knowledge sharing across the team

What You Bring

5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
Proficiency with Infrastructure as Code tools, particularly Terraform
Experience with containerization and orchestration, particularly Kubernetes and Docker
Solid understanding of cloud infrastructure, preferably AWS
Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
Experience designing and operating observability, monitoring, and alerting systems
Experience implementing incident response procedures and participating in on-call rotations
Strong collaboration skills working across data, AI, and engineering teams
High ownership mindset in a fast-moving, high-stakes production environment

Nice to Haves

Experience building or operating infrastructure for agent-based or LLM-powered systems
Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)
Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake tooling
Experience with CI/CD pipelines and deployment automation for AI/ML workloads
Exposure to evaluation frameworks and model performance monitoring at scale
Experience working in fast-moving 0→1 environments or platform-building teams
Experience building SDKs, developer tooling, or internal platform products with a strong focus on usability and adoption
Experience with Cloudflare's cloud platform and product ecosystem, including networking, security, performance, and Zero Trust solutions

Unless a specific application deadline is stated in the job posting, applications are accepted on an ongoing basis.

Please note, applicants are permitted to redact or remove information on their resume that identifies age, date of birth, or dates of attendance at or graduation from an educational institution.

We consider qualified applicants with criminal histories for employment on our team, assessing candidates in a manner consistent with the requirements of the San Francisco Fair Chance Ordinance.

Our Commitment

Payward is powered by people from around the world and we celebrate the diverse talents, backgrounds, contributions, and unique perspectives that everyone brings to the table. We hire based on merit, seeking out people with the right abilities, knowledge, and skills for the job. We encourage you to apply for roles where you don't fully meet the listed requirements, especially if you're passionate or knowledgeable about crypto.

We may ask candidates to complete job-related skills or work-style assessments as part of our hiring process. These assessments evaluate competencies relevant to the role and are applied consistently across candidates for similar positions. Results are considered alongside experience and interviews, and are not the sole basis for any employment decision.

As an equal opportunity employer, we don't tolerate discrimination or harassment of any kind, whether based on race, ethnicity, age, gender identity, citizenship, religion, sexual orientation, disability, pregnancy, veteran status, or any other protected characteristic as outlined by federal, state, or local laws.

Apply

Vacancy posted 5 days ago

Similar jobs that could be interesting for youBased on the Site Reliability Engineer - AI Agents in United States vacancy

Site Reliability Engineer
...leading investors, we’re building the category-defining AI workflow automation platform that healthcare teams... ...About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our...
Suggested
Work at office
Remote work
Flexible hours
2 days per week
Plenful
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
...Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical customer... ...and deployment pipelines. Explore opportunities to leverage AI-driven observability, anomaly detection, and operational...
Suggested
Local area
Remote work
Flexible hours
Shift work
O'Reilly Technology Services, Inc.
Pierce, ID
1 day ago
Senior Site Reliability Engineer
...Joining a high-performing team remotely, the full-time Senior Site Reliability Engineer will own the reliability and automation of critical AI infrastructure, ensuring systems are resilient and secure while building automation tools to streamline operational workflows...
Suggested
Full time
Remote work
Virtual Vocations Inc
United States
3 days ago
Principal Site Reliability Engineer
...Seeking a Principal Site Reliability Engineer for a hybrid role based in San Jose, CA, or a remote position, who will provide technical vision and... ...large-scale production services Foundational understanding of AI/ML technologies and experience leveraging AI-driven...
Suggested
Remote work
Virtual Vocations Inc
United States
2 days ago
Principal Site Reliability Engineer
...Principal Site Reliability Engineer Req ID: 10147292 At Disney Experiences Technology, our team creates world‑class immersive digital experiences... ...strategic objectives and competitive advantage. Expert in using AI to optimize system reliability through advanced analytics...
Suggested
Work experience placement
Worldwide
Dormont Manufacturing Company
Silver Bay, MN
1 day ago
Senior Site Reliability Engineer
$121.4k - $218.6k
...delivery challenges?** **Join our critical AI Hardware SRE Team!** The AI Hardware... ...for ensuring best-in-class uptime and reliability of our AI hardware infrastructure... ...them when they are breached. As a Senior Site Reliability Engineer, you will be responsible for: +...
Work experience placement
Work at office
Akamai
Des Moines, IA
2 days ago
Site Reliability Engineer
...Site Reliability Engineer Duration: Long Term Client: UPS This is a Hybrid Work Model (3x a week Onsite) and Location is Parsippany, NJ. Job... ...Storage, BigQuery, Pub/Sub, etc.). Familiarity with Google BI and AI/ML tools (Looker, BigQuery ML, Vertex AI, etc.) Experience...
Sparktek
Parsippany, NJ
2 days ago
Site Reliability Engineer
$60 - $80 per hour
...seeking a highly specialized Observability Engineer with deep expertise in Dynatrace (latest... ...and diagnose complex performance and reliability issues using Dynatrace Drive adoption of... ...liability. Use of Artificial Intelligence (AI): We may use Artificial Intelligence (AI...
Contract work
Temporary work
Remote work
TEKsystems
Atlanta, GA
3 days ago
Senior Site Reliability Engineer
$81.1k - $187k
...Job Description We are looking for a Site Reliability Engineer 3 to support mission-critical cloud services and production operations. The role... ...everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers...
Temporary work
Immediate start
Flexible hours
Shift work
Oracle
Providence, RI
4 days ago
Site Reliability Engineer II
$75k - $120k
...MGAs, and carriers with the core digital systems, specialized AI, and data-driven foundation to eliminate distribution drag... ...Canada, and India. Role Summary We are seeking a Site Reliability Engineer II to support the reliability, scalability, and performance...
Contract work
Temporary work
Work at office
Work from home
Flexible hours
Vertafore
Denver, CO
5 days ago
Site Reliability Engineer, Senior
$86.9k - $198k
...Job Number: R0232211 Site Reliability Engineer, Senior The Opportunity: Engineering to make a system more resilient and efficient frees up... ...picture to verify your identity and prevent fraud. Candidate AI Usage Policy AI is a part of our daily work at Booz...
Full time
Contract work
Part time
Work at office
Local area
Remote work
Booz Allen Hamilton
Aurora, CO
2 days ago
Senior Site Reliability Engineer
...Senior Site Reliability Engineer Austin, Texas, United States Who We Are At 2K, we create some of the most iconic and culture-shaping video... ...resources efficiently at cloud scale. Experience with AI and Agentic Development. Cloud certifications (AWS...
2K
Austin, TX
5 days ago
Senior Site Reliability Engineer
$91.7k - $163.7k
...us to start Caring. Connecting. Growing together. The Site Reliability Engineering (SRE) team at Optum Financial ensures world-class reliability... ...be instrumental in automating our environment and building AI-enhanced platforms to support the next generation of...
Minimum wage
Full time
Work experience placement
Local area
Remote work
UnitedHealth Group
Eden Prairie, MN
2 days ago
Sr. Site Reliability Engineer
...Senior Site Reliability Engineer The FreedomPay Commerce Platform is the technology of choice for many... ...lifecycle — and pushes it forward with AI-driven operations and automation at its... ...Azure AI services (Foundry, Azure SRE Agent) and the agentic workflows built on them...
Full time
Casual work
Remote work
Flexible hours
FreedomPay
Philadelphia, PA
2 days ago
Site Reliability Engineer
$100k - $250k
..., economics, financials, weather, tech, AI, culture and more. We believe prediction... ...Roadmap As a member of Kalshi's engineering team, you'll help build the next-generation... ...You'll Do Improve observability, reliability, and service availability by defining...
Local area
Kalshi
New York, NY
2 days ago
Sr Site Reliability Engineer
$109.5k - $150.55k
...Description Renaissance is looking for an experienced Sr Site Reliability Engineer to be part of the Engineering Enablement group's Site Reliability... ...exercises on our products. Explore and integrate AI tooling into the SRE workflows. Be part of an on-call rotation...
For contractors
Local area
Remote work
Worldwide
Work visa
Flexible hours
Weekend work
Renaissance Services
Sioux Falls, SD
5 days ago
Site Reliability Engineer II
$95k - $171k
...Are you passionate about cutting-edge AI infrastructure? Do you want to build... ...infrastructure, Kubernetes, and ensuring reliability for AI workloads within Akamai's serverless inference platform. As an Site Reliability Engineer II, you will be responsible for: Building...
Permanent employment
Work experience placement
Work at office
Remote work
Work from home
Worldwide
Flexible hours
Akamai
Jefferson City, MO
2 days ago
Site Reliability Engineer
$75.7k - $136.3k
...and building systems that scale? Join our highly skilled Site Reliability Engineering team! Our team designs, develops, and manages applications... ...: Scaling the world's biggest moments without a glitch. AI : Enabling our customers to build, secure, and scale AI...
Work experience placement
Work at office
Akamai
Dover, DE
2 days ago
Site Reliability Engineer
...Position Overview SingleStore is seeking a Site Reliability Engineer to help optimize and scale our managed service offering across all three... ...world's leading organizations to build and scale cutting-edge AI applications on a unified data platform that supports real-...
Worldwide
SingleStore
Seattle, WA
5 days ago
Senior Site Reliability Engineer
...Senior Site Reliability Engineer Jersey City, New Jersey;Charlotte, North Carolina; Plano, Texas To proceed with your application, you must... ...emerging technologies and automation techniques (including AI/ML where applicable) to improve platform reliability Skills...
Work at office
Shift work
Day shift
Bank of America
Charlotte, NC
5 days ago
Site Reliability Engineer (SRE)
...helping the world's most important research sites do their best work. Our solutions are... ...the Team: We are seeking a Site Reliability Engineer (SRE) to join one of our Scrum teams and... ...performance of the Florence™ platform. AI-driven tooling and automation are a cornerstone...
Work at office
Florence
Atlanta, GA
1 day ago
Site Reliability Engineer II
...Site Reliability Engineer II Our engineering fleet is a horizontal set of teams providing engineering services across the organization. Our specific... ...with distributed global teams. Experience using modern AI-assisted development tools (e.g., Copilot, Cursor, or...
Disney France
New York, NY
5 days ago
Senior Site Reliability Engineer
$150k - $175k
...Site Reliability Engineer At ASAPP, our mission is simple: deliver the best AI-powered customer experience—faster than anyone else. To achieve that, we're guided by principles that shape how we think, build, and execute. We value customer obsession, purposeful speed...
Remote work
ASAPP
New York, NY
4 days ago
Senior Site Reliability Engineer
...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools... ...and TypeScript source code Experience experimenting with AI-driven approaches to operations Comfortable with...
TechChain Talent
New York, NY
4 days ago
Senior Site Reliability Engineer
$182.3k - $220k
...patients first - and that mission depends on reliable, secure, and scalable systems. As a... ...infrastructure and building tools that empower our engineers to ship safely and confidently. You... ..., including artificial intelligence (AI), to assist with parts of our recruiting...
Local area
Flexible hours
Ro
New York, NY
1 day ago
Site Reliability Engineer
$117k - $209.33k
...Overview An exciting new opportunity has opened for a Site Reliability Engineer within the Autodesk PDMS Platform SRE team. The successful... ...ability to work independently ~ Experience with AI-assisted engineering tools & practices is preferred...
Permanent employment
For contractors
Autodesk
Atlanta, GA
2 days ago
Senior Site Reliability Engineer
...is where you come in. We're seeking a Senior Site Reliability Engineer who can own our data tier at high availability while... ...the same problem. We're investing in AI to compress incident response, build agents and tooling that speed up root-cause analysis, and...
Permanent employment
Local area
Flexible hours
Zello
Austin, TX
5 days ago
Senior Site Reliability Engineer
$79.1k - $158.2k
...service according to terms for reliability and functionality. ~Assists... ...~Gains basic knowledge of site reliability trends and shares... ...seeking a skilled Site Reliability Engineer to design, build, operate, and... ...to life-saving care. And with AI embedded across our products...
Temporary work
Immediate start
Flexible hours
Shift work
Oracle
Seattle, WA
2 days ago
SITE RELIABILITY ENGINEER
$130k - $150k
...Site Reliability Engineer (SRE) Engineer Reliability into the Systems That Move the Nation’s Food Supply Who We Are US Cold owns and operates... ...engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue...
USCS
Camden, NJ
5 days ago
Site Reliability Engineer II
$165k - $225k
...Dataiku is the Platform for AI Success, the enterprise orchestration... ..., machine learning, and AI agents with the transparency,... ...a true business performance engine delivering measurable value.... ...ll make an impact: As a Site Reliability Engineer (SRE) with advanced...
Work at office
Flexible hours
Dataiku
New York, NY
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer - AI Agents. Be the first to apply!