Site Reliability Engineer - AI Agents
Kraken
Building the Future of Open Finance
Payward - the parent company behind Kraken, NinjaTrader, Breakout, xStocks, Payward Services and CF Benchmarks - has spent the last 15 years building one of the most modern and globally accessible financial infrastructure platforms in the industry, built to advance an open, global financial system.
The Team
Founded in 2011, Kraken is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions across the globe. It offers spot trading, margin, futures, staking, and OTC services, with products built for both individual investors and institutional clients.
The AI Infrastructure team sits within the Data organization and is responsible for building, operating, and scaling the systems that power AI agents in production — both internal tools and external-facing products. Working closely with the AI and Agent Systems teams, this group ensures that the orchestration, execution, and model-serving layers underpinning agentic workflows are reliable, observable, and built to scale.
This team operates at the intersection of data infrastructure and applied AI — a space that moves fast and demands engineers who can bring production discipline to emerging technology. You'll partner across Data Engineering, ML, and product-facing teams to harden agent infrastructure and keep it running at the standards our users expect.
Importantly, this is a platform engineering team. Beyond operating infrastructure, the team is responsible for building the APIs, SDKs, and platform capabilities that enable AI, Data, and Engineering teams to safely and efficiently consume agent infrastructure as a service. Success in this role requires thinking beyond infrastructure operations and toward developer experience, platform adoption, and long-term scalability.
The Opportunity
Design, build, and operate the infrastructure layer supporting AI agent workflows in production
Ensure reliability, scalability, and observability of agentic systems across internal and external products
Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
Implement access controls and security best practices across AI infrastructure environments
Document architecture, runbooks, and best practices to support knowledge sharing across the team
What You Bring
5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
Proficiency with Infrastructure as Code tools, particularly Terraform
Experience with containerization and orchestration, particularly Kubernetes and Docker
Solid understanding of cloud infrastructure, preferably AWS
Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
Experience designing and operating observability, monitoring, and alerting systems
Experience implementing incident response procedures and participating in on-call rotations
Strong collaboration skills working across data, AI, and engineering teams
High ownership mindset in a fast-moving, high-stakes production environment
Nice to Haves
Experience building or operating infrastructure for agent-based or LLM-powered systems
Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)
Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake tooling
Experience with CI/CD pipelines and deployment automation for AI/ML workloads
Exposure to evaluation frameworks and model performance monitoring at scale
Experience working in fast-moving 0→1 environments or platform-building teams
Experience building SDKs, developer tooling, or internal platform products with a strong focus on usability and adoption
Experience with Cloudflare's cloud platform and product ecosystem, including networking, security, performance, and Zero Trust solutions
Unless a specific application deadline is stated in the job posting, applications are accepted on an ongoing basis.
Please note, applicants are permitted to redact or remove information on their resume that identifies age, date of birth, or dates of attendance at or graduation from an educational institution.
We consider qualified applicants with criminal histories for employment on our team, assessing candidates in a manner consistent with the requirements of the San Francisco Fair Chance Ordinance.
Our Commitment
Payward is powered by people from around the world and we celebrate the diverse talents, backgrounds, contributions, and unique perspectives that everyone brings to the table. We hire based on merit, seeking out people with the right abilities, knowledge, and skills for the job. We encourage you to apply for roles where you don't fully meet the listed requirements, especially if you're passionate or knowledgeable about crypto.
We may ask candidates to complete job-related skills or work-style assessments as part of our hiring process. These assessments evaluate competencies relevant to the role and are applied consistently across candidates for similar positions. Results are considered alongside experience and interviews, and are not the sole basis for any employment decision.
As an equal opportunity employer, we don't tolerate discrimination or harassment of any kind, whether based on race, ethnicity, age, gender identity, citizenship, religion, sexual orientation, disability, pregnancy, veteran status, or any other protected characteristic as outlined by federal, state, or local laws.
- ...leading investors, we’re building the category-defining AI workflow automation platform that healthcare teams... ...About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our...SuggestedWork at officeRemote workFlexible hours2 days per week
- ...Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical customer... ...and deployment pipelines. Explore opportunities to leverage AI-driven observability, anomaly detection, and operational...SuggestedLocal areaRemote workFlexible hoursShift work
- ...Joining a high-performing team remotely, the full-time Senior Site Reliability Engineer will own the reliability and automation of critical AI infrastructure, ensuring systems are resilient and secure while building automation tools to streamline operational workflows...SuggestedFull timeRemote work
- ...Seeking a Principal Site Reliability Engineer for a hybrid role based in San Jose, CA, or a remote position, who will provide technical vision and... ...large-scale production services Foundational understanding of AI/ML technologies and experience leveraging AI-driven...SuggestedRemote work
- ...Principal Site Reliability Engineer Req ID: 10147292 At Disney Experiences Technology, our team creates world‑class immersive digital experiences... ...strategic objectives and competitive advantage. Expert in using AI to optimize system reliability through advanced analytics...SuggestedWork experience placementWorldwide
$121.4k - $218.6k
...delivery challenges?** **Join our critical AI Hardware SRE Team!** The AI Hardware... ...for ensuring best-in-class uptime and reliability of our AI hardware infrastructure... ...them when they are breached. As a Senior Site Reliability Engineer, you will be responsible for: +...Work experience placementWork at office- ...Site Reliability Engineer Duration: Long Term Client: UPS This is a Hybrid Work Model (3x a week Onsite) and Location is Parsippany, NJ. Job... ...Storage, BigQuery, Pub/Sub, etc.). Familiarity with Google BI and AI/ML tools (Looker, BigQuery ML, Vertex AI, etc.) Experience...
$60 - $80 per hour
...seeking a highly specialized Observability Engineer with deep expertise in Dynatrace (latest... ...and diagnose complex performance and reliability issues using Dynatrace Drive adoption of... ...liability. Use of Artificial Intelligence (AI): We may use Artificial Intelligence (AI...Contract workTemporary workRemote work$81.1k - $187k
...Job Description We are looking for a Site Reliability Engineer 3 to support mission-critical cloud services and production operations. The role... ...everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers...Temporary workImmediate startFlexible hoursShift work$75k - $120k
...MGAs, and carriers with the core digital systems, specialized AI, and data-driven foundation to eliminate distribution drag... ...Canada, and India. Role Summary We are seeking a Site Reliability Engineer II to support the reliability, scalability, and performance...Contract workTemporary workWork at officeWork from homeFlexible hours$86.9k - $198k
...Job Number: R0232211 Site Reliability Engineer, Senior The Opportunity: Engineering to make a system more resilient and efficient frees up... ...picture to verify your identity and prevent fraud. Candidate AI Usage Policy AI is a part of our daily work at Booz...Full timeContract workPart timeWork at officeLocal areaRemote work- ...Senior Site Reliability Engineer Austin, Texas, United States Who We Are At 2K, we create some of the most iconic and culture-shaping video... ...resources efficiently at cloud scale. Experience with AI and Agentic Development. Cloud certifications (AWS...
$91.7k - $163.7k
...us to start Caring. Connecting. Growing together. The Site Reliability Engineering (SRE) team at Optum Financial ensures world-class reliability... ...be instrumental in automating our environment and building AI-enhanced platforms to support the next generation of...Minimum wageFull timeWork experience placementLocal areaRemote work- ...Senior Site Reliability Engineer The FreedomPay Commerce Platform is the technology of choice for many... ...lifecycle — and pushes it forward with AI-driven operations and automation at its... ...Azure AI services (Foundry, Azure SRE Agent) and the agentic workflows built on them...Full timeCasual workRemote workFlexible hours
$100k - $250k
..., economics, financials, weather, tech, AI, culture and more. We believe prediction... ...Roadmap As a member of Kalshi's engineering team, you'll help build the next-generation... ...You'll Do Improve observability, reliability, and service availability by defining...Local area$109.5k - $150.55k
...Description Renaissance is looking for an experienced Sr Site Reliability Engineer to be part of the Engineering Enablement group's Site Reliability... ...exercises on our products. Explore and integrate AI tooling into the SRE workflows. Be part of an on-call rotation...For contractorsLocal areaRemote workWorldwideWork visaFlexible hoursWeekend work$95k - $171k
...Are you passionate about cutting-edge AI infrastructure? Do you want to build... ...infrastructure, Kubernetes, and ensuring reliability for AI workloads within Akamai's serverless inference platform. As an Site Reliability Engineer II, you will be responsible for: Building...Permanent employmentWork experience placementWork at officeRemote workWork from homeWorldwideFlexible hours$75.7k - $136.3k
...and building systems that scale? Join our highly skilled Site Reliability Engineering team! Our team designs, develops, and manages applications... ...: Scaling the world's biggest moments without a glitch. AI : Enabling our customers to build, secure, and scale AI...Work experience placementWork at office- ...Position Overview SingleStore is seeking a Site Reliability Engineer to help optimize and scale our managed service offering across all three... ...world's leading organizations to build and scale cutting-edge AI applications on a unified data platform that supports real-...Worldwide
- ...Senior Site Reliability Engineer Jersey City, New Jersey;Charlotte, North Carolina; Plano, Texas To proceed with your application, you must... ...emerging technologies and automation techniques (including AI/ML where applicable) to improve platform reliability Skills...Work at officeShift workDay shift
- ...helping the world's most important research sites do their best work. Our solutions are... ...the Team: We are seeking a Site Reliability Engineer (SRE) to join one of our Scrum teams and... ...performance of the Florence™ platform. AI-driven tooling and automation are a cornerstone...Work at office
- ...Site Reliability Engineer II Our engineering fleet is a horizontal set of teams providing engineering services across the organization. Our specific... ...with distributed global teams. Experience using modern AI-assisted development tools (e.g., Copilot, Cursor, or...
$150k - $175k
...Site Reliability Engineer At ASAPP, our mission is simple: deliver the best AI-powered customer experience—faster than anyone else. To achieve that, we're guided by principles that shape how we think, build, and execute. We value customer obsession, purposeful speed...Remote work- ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools... ...and TypeScript source code Experience experimenting with AI-driven approaches to operations Comfortable with...
$182.3k - $220k
...patients first - and that mission depends on reliable, secure, and scalable systems. As a... ...infrastructure and building tools that empower our engineers to ship safely and confidently. You... ..., including artificial intelligence (AI), to assist with parts of our recruiting...Local areaFlexible hours$117k - $209.33k
...Overview An exciting new opportunity has opened for a Site Reliability Engineer within the Autodesk PDMS Platform SRE team. The successful... ...ability to work independently ~ Experience with AI-assisted engineering tools & practices is preferred...Permanent employmentFor contractors- ...is where you come in. We're seeking a Senior Site Reliability Engineer who can own our data tier at high availability while... ...the same problem. We're investing in AI to compress incident response, build agents and tooling that speed up root-cause analysis, and...Permanent employmentLocal areaFlexible hours
$79.1k - $158.2k
...service according to terms for reliability and functionality. ~Assists... ...~Gains basic knowledge of site reliability trends and shares... ...seeking a skilled Site Reliability Engineer to design, build, operate, and... ...to life-saving care. And with AI embedded across our products...Temporary workImmediate startFlexible hoursShift work$130k - $150k
...Site Reliability Engineer (SRE) Engineer Reliability into the Systems That Move the Nation’s Food Supply Who We Are US Cold owns and operates... ...engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue...$165k - $225k
...Dataiku is the Platform for AI Success, the enterprise orchestration... ..., machine learning, and AI agents with the transparency,... ...a true business performance engine delivering measurable value.... ...ll make an impact: As a Site Reliability Engineer (SRE) with advanced...Work at officeFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer - AI Agents. Be the first to apply!
- site reliability engineering manager United States
- site reliability engineer remote United States
- lead site reliability engineer United States
- site reliability engineer sre United States
- site reliability engineer United States
- site engineering manager United States
- site safety supervisor United States
- part time site manager United States
- site supervisor United States
- construction site safety manager United States


