AI Systems Engineer: Agent Orchestration & Runtime

AI Chopping Block, Inc.

Check out 30 new AI Systems Engineer opportunities posted on AI Chopping Block The role involves designing and building execution environments for AI agents, including sandboxing, isolation, and reproducibility. It includes developing systems for agent orchestration across multi-step, tool-using workflows and building infrastructure for running, testing, and debugging code generated by models. Responsibilities also include creating state and memory systems that allow agents to persist context across long-running tasks, optimizing tokens, latency, reliability, and cost across Codex’s production fleet, and supporting model rollouts, capacity planning, and managing tradeoffs between quality, speed, and economics to maintain a fleet of frontier agents at scale. Additionally, the job entails building shared platform capabilities that unblock product teams, partner teams, and open source Codex. The Software Engineer in the Defence team will build and extend critical components of client deliverables across diverse software domains, deliver robust technical artefacts in both compiled and non-compiled languages, implement defined engineering patterns and practices tailored for the Defence sector, collaborate closely with Machine Learning Engineering and Data Science teams to integrate and refine technical solutions, apply rigorous software engineering best practices to enhance scalability and quality of codebases, and execute CI/CD processes while managing application deployments on Kubernetes and bare metal environments. Engineering Manager, Distillation & Dectection Platform Lead a team of software engineers building detection and mitigation systems for frontier model misuse, focusing on model IP protection, distillation detection, and emerging risks from autonomous agents. Set the technical roadmap and execution strategy including prioritization, design, shipping, iteration, and impact measurement. Build production systems such as services, pipelines, tooling, instrumentation, and automation that can scale with frontier model usage. Partner with Research and Product teams to translate evolving model capabilities into scalable tests, signals, and mitigations. Drive strong engineering fundamentals including architecture, reliability, monitoring, performance, and operational excellence. Hire and grow a team across backend, data systems, and applied ML engineering domains. Anticipate and address scalability challenges as agentic workflows advance. The Staff Engineer is responsible for making hard architectural tradeoffs and owning the outcomes, such as choices between Durable Object SQLite and shared Postgres for session state, Cloudflare Workers CPU limits versus longer-running workloads, and single-tenant sandboxes versus multi-tenant pools. They design the system that handles concurrent agent sessions across integrations with consistent state. The role includes defining reliability and observability standards for the team, including SLOs, error budgets, tracing strategies, and incident response patterns. The Staff Engineer reviews every significant pull request to set technical direction without blocking velocity and ships code daily, actively contributing architectural leadership alongside output. They work closely with the CTO on all architectural decisions that significantly impact the system. Develop AI agents and software to automatically diagnose and repair hardware faults across massive NVIDIA and AMD GPU clusters. Create deep-level observability and diagnostic tools to monitor the health of high-density compute systems. Build testing suites using PyTorch and NCCL to ensure systems are ready for production. Develop automation for critical facility systems, including power management and advanced liquid cooling. Take full ownership of tools from initial code to deployment and operational support. The Senior Systems Performance Engineer at Crusoe is responsible for leading the evaluation and establishment of New Product Introduction (NPI) across varied hardware architectures with a focus on Bare Metal and VM environments. They conduct deep-dive performance evaluations and workload characterizations across compute, memory, storage, and networking. They develop sophisticated multi-variable projection models and frameworks to analyze system design options through tradeoffs such as Power and Total Cost of Ownership (TCO). The role involves collaborating with external vendors to drive platform customization and optimize server and AI architectures for maximum performance-per-TCO. They design and implement performance methodologies to scale evaluation processes for large-scale GPU/AI data centers. Additionally, they engage in industry research and contribute technical insights to consortiums and standards committees to influence future hardware roadmaps. $172,500 – $210,000 YEAR

(USD)

San Francisco or Sunnyvale, United States Design, build, and optimize the core native runtime powering LM Studio and the C++ libraries powering the app and APIs. Work across runtime, LLM engines, llama.cpp/MLX integrations, build infrastructure, and on-device AI software. Focus on system and library integration by wiring the C++ runtime to GPU backends, vendor SDKs, and operating-system services to support user-facing applications. Implement and harden system-level code involving threading, memory, files, IPC, and scheduling. Integrate platform acceleration paths such as Metal, CUDA, and Vulkan across macOS, Windows, and Linux. Profile, debug, and tune execution paths to ensure fast, dependable local AI and maintainable software. Contribute to the C++ runtime powering LM Studio, extend LLM engine integrations, and build platform-aware performance features for desktop OS. Implement resilient IPC, resource management, and scheduling logic to support concurrent model execution. Improve build, packaging, and release infrastructure for native components. Collaborate with the team to deliver cohesive and recognizable user experiences. As an Engineering Leader at Ema, you will build and lead a high-performance engineering organization by recruiting, hiring, and developing senior engineers across multiple sub-teams including cloud infrastructure, data platform, ML operations, and developer experience. You will establish engineering standards, a code review culture, on-call expectations, and promote a bias-toward-shipping mentality balanced with production rigor. You will coach and grow senior and staff engineers into technical leaders and manage engineering managers as the organization scales. Your responsibilities include setting the 6–18 month platform roadmap in partnership with engineering teams, making critical architectural decisions such as build versus buy and migration strategies, and driving cross-functional alignment with product, ML/AI research, and go-to-market teams. You will own production health for all platform services, including incident response, postmortems, SLO tracking, and capacity planning. Additionally, you will establish and refine engineering practices to maintain fast shipping without compromising reliability, and participate in executive-level reviews related to infrastructure spend, system health, and engineering velocity. Software Engineer, Architecture, Reliability, & Compute As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between AI core and full-stack components, build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead response for production issues in mission-critical environments ensuring rapid resolution and prevention, translate technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons influence future technical architecture and decisions. San Francisco or St. Louis or New York or Washington, United States The Head of Internal Tools Engineering is responsible for owning the end-to-end strategy and roadmap for all internal tools, platforms, and automation, treating internal technology as a product. They make strategic build-vs-buy decisions, map current and next-state process flows, and lead systems transformation for internal teams. They architect and maintain the full engineering lifecycle of internal platforms, build seamless API-first ecosystems integrating various internal systems, ensure system reliability and operational resilience, and design scalable, secure architectures using cloud-native principles and microservices. They lead AI strategy by integrating AI and LLMs into internal workflows and deploying intelligent automation tools. They reduce cognitive load for internal users by providing standardized workflows and self-service capabilities, measure platform success by adoption, satisfaction, and productivity impact, and build, lead, and mentor a high-performing engineering team. They cultivate a collaborative culture, provide technical mentorship, foster psychological safety, partner cross-functionally with leadership across departments, and align internal platform investments with company strategy while demonstrating measurable ROI. #J-18808-Ljbffr AI Chopping Block, Inc.

Apply

Vacancy posted more than 2 months ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Systems Engineer: Agent Orchestration & Runtime. Be the first to apply!