Member of Technical Staff
Xai
Member Of Technical Staff
Memphis, TN
About XAI
XAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
About The Role:
We are seeking a highly skilled member of technical staff to join our team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure. The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime—including close partnership with facility operations to address physical infrastructure impacts. If you thrive in lightning-fast, distributed environments and are passionate about leveraging automation to drive efficiency, this is an opportunity to make a significant impact on our infrastructure's resilience and scalability.
In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities. By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers.
The primary objective of this team is to mitigate downtime and minimize impact to end-users from both scheduled and unscheduled maintenance, as well as events affecting onsite data centers. This is achieved through proactive automation, robust observability, and integrated software-physical reliability strategies, ensuring our AI infrastructure remains resilient, scalable, and at the cutting edge of innovation.
Responsibilities:
- Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning. We value adaptability to new tools and paradigms in the fast-evolving AI space.
- Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers—open to innovative stacks beyond traditional ones like ELK.
- Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management)—to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration). This role encourages broad skill sets from diverse technical backgrounds to foster innovation.
- Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs. Key insight: By applying SWE rigor to troubleshooting, team members can create reusable diagnostic tools that accelerate resolution, turning unscheduled events (e.g., hardware faults) into opportunities for system hardening and reducing overall end-user impact through targeted SLAs that prioritize critical AI services. We seek versatile problem-solvers who adapt to bleeding-edge challenges.
- Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.
- Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation. Key insight: In multi-site setups, network insights allow for automated failover mechanisms that handle both digital and physical disruptions, ensuring seamless continuity for end-users during events like fiber cuts or power outages. This attracts candidates from varied networking and systems backgrounds to drive forward-thinking solutions.
- Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios. We prioritize growth-minded individuals who embrace evolving practices.
- Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.
Basic Qualifications:
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).
- 5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.
- Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.
- Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.
- Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).
- Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.
- Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.
- Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.
- Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.
- Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).
Preferred Skills And Experience:
- 7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.
- Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.
- Proficiency in Rust for systems programming and performance-critical components.
- Direct experience integrating software reliability tools with physical data center infrastructure (e.g., power, cooling, environmental monitoring, facility controls) and automating responses to physical events.
- Exposure to advanced or innovative observability stacks beyond traditional tools (e.g., exploring cutting-edge alternatives for metrics, logs, and tracing).
- Experience building automated remediation, fault tolerance, disaster recovery, capacity planning, or predictive failure detection systems.
- Background in optimizing Linux-based systems for AI workloads, GPU clusters, or high-throughput compute environments.
- Demonstrated success reducing downtime, MTTR, or improving resource efficiency (e.g., through automation or observability) in high-stakes production settings.
- Prior work with bare-metal provisioning, data center interconnects, or hybrid/multi-site failover mechanisms.
- Mentoring experience, strong documentation skills, and a track record of fostering knowledge sharing and automation culture.
- Comfort with rapid technology adaptation in fast-evolving domains like AI infrastructure.
XAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.
$180k
...inference platform that serves Grok to millions of users every day with lightning speed and perfect reliability. As a Member of Technical Staff - Inference, you will design and optimize large-scale model serving systems end-to-end. You will own everything from distributed...SuggestedTemporary work$180k
...Member Of Technical Staff - X Search Palo Alto, CA xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization...SuggestedTemporary workShift work$180k
...Member of Technical Staff - X Money New York, NY; Palo Alto, CA xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence...SuggestedTemporary work- ...About the Role RadixArk is hiring a Member of Technical Staff - Supercomputing to help build, deploy, and operate production-grade AI infrastructure for frontier-scale inference and training workloads. This role sits at the intersection of engineering, deployment...SuggestedFlexible hours
$180k
...Member of Technical Staff - Multimodal Understanding Palo Alto, CA About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering...SuggestedTemporary work$180k
...Member Of Technical Staff - Voice Product Palo Alto, CA xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This...Temporary work$180k
...Member Of Technical Staff - Grok Product Palo Alto, CA xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This...Temporary work- ...someone, and when, is the difference between a session that lasts two minutes and one that lasts two hours. As our RecSys founding member, you'll own this problem end-to-end - set the architecture, build the foundation, and grow it from rule-based systems to deep learning...
- RadixArk is seeking a Member of Technical Staff — Inference to push the limits of large-scale AI inference. You will work on the core systems that serve frontier models at scale, optimizing performance, latency, throughput, and cost across thousands of GPUs. This role...WorldwideFlexible hours
- ...forefront of AI—backed by world-class institutional investors and strategic partners. We are looking for an exceptional Member of Technical Staff to help design, build, and scale core components of our next-generation AI compute platform. Key Responsibilities Core Engineering...
$100k - $120k
About the role:As a Member of Technical Staff 2, you’ll work on building and enhancing both frontend and backend systems that power our core platform. This is a great opportunity for early-career engineers to grow their skills, learn from experienced teammates, and contribute...InternshipWork at officeLocal area2 days per week3 days per week$180k
...invited to a 15 minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the... ..., you will enter the main process, which consists of 2 technical interviews and 1 project deep-dive interview: Practical coding...Temporary workH1bWork at officeWork from homeWork visa$180k - $250k
Member of Technical Staff -- TPU Systems (JAX / XLA / PALLAS) About the Role RadixArk is looking for a TPU Systems Engineer to build high-performance inference and training systems using JAX, XLA, and Pallas. You'll push large-model workloads to their limits on TPU hardware...Full timeFlexible hours$180k
...invited to a 15 minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the... ..., you will enter the main process, which consists of 2 technical interviews and 1 project deep-dive interview: Practical coding...Work at officeLocal areaWork from home$180k
...invited to a 15 minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the... ..., you will enter the main process, which consists of four technical interviews: Coding assessment in a language of your choice. Systems...Local areaRelocation- ...founding team from Anthropic, Google DeepMind, Meta SuperIntelligence, xAI, Apple and Intel. What You’ll Do As a Founding Member of the Technical Staff - Formal Methods at Architect Labs, you'll work on the formal foundations of our chip design flow. You will own a...
$180k
...encouraged to work across multiple areas of the company, and as a result, all engineers and researchers share the title "Member of Technical Staff." We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to...Relocation$180k
...invited to a 15 minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the... ..., you will enter the main process, which consists of four technical interviews: Coding assessment in a language of your choice. Systems...Local areaRelocation$180k
...Member Of Technical Staff - Pre-Training Palo Alto, CA About XAI XAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence...Temporary work- ...Member Of Technical Staff — Training RadixArk is seeking a Member Of Technical Staff — Training to build and scale the systems that train frontier AI models. You will work on large-scale distributed training infrastructure for LLMs and generative models, pushing...Flexible hours
$180k
...Member Of Technical Staff - Media Palo Alto, CA; Seattle, WA About XAI XAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering...Temporary work$180k
...Member of Technical Staff - Data Platform Palo Alto, CA About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence...Temporary work- ...Member Of Technical Staff Inception creates the world's fastest, most efficient AI models. Our Mercury model is the world's fastest reasoning LLM and first commercially available diffusion LLM, delivering 5x greater speed and efficiency than today's LLMs, with best-...Immediate startFlexible hours
$180k
...Member of Technical Staff - Imagine Safety Palo Alto, CA About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence...Temporary workWorldwide- ...Member of Technical Staff — Cluster / Platform RadixArk is looking for a Member of Technical Staff (Cluster / Platform) to architect and scale the core compute platform that powers frontier-level AI training and inference. You will design and operate highly reliable...Flexible hours
- ...limits of AI4EDA and building the intelligence layer for the hardware revolution. What You'll Do As a Founding Member of the Technical Staff (ML infra) at Architect, you'll be responsible for the critical algorithms and infrastructure that our researchers depend...
$180k
...Member of Technical Staff, Pre-training Data Infrastructure xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. All...Temporary workRelocation- ...multi-step reasoning, interact with external tools, and remain reliable despite non-deterministic model behavior. Role As a Member of Technical Staff, Machine Learning, you will build core ML components. You will work on real production systems from day one, learning how...Immediate start
- About the Role As a Member of Technical Staff [Research] at NeoCognition , you’ll be part of the core team advancing the frontier of LLM agents — systems that can reason, plan, and act reliably in the real world. We are an AI research lab focused on making LLM agents reliable...
- Member of Technical Staff (Data Acquisition) About the Role Your mission is to build and operate the ingestion systems that turn the open web and large-scale audio sources into reliable, well-structured corpora for training Sanas's frontier speech models. You'll own the...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Member of Technical Staff. Be the first to apply!
- IT assistant Palo Alto, CA
- desktop support analyst Palo Alto, CA
- technical analyst Palo Alto, CA
- customer support technician Palo Alto, CA
- tech assistant Palo Alto, CA
- technical support assistant Palo Alto, CA
- customer support analyst Palo Alto, CA
- help desk assistant Palo Alto, CA
- support technician Palo Alto, CA
- help desk administrator Palo Alto, CA


