Compensation: $200K–$250K + Equity
Full-Time | Remote | Infrastructure Team
We’re hiring a Staff Reliability Engineer to help scale and maintain the massive GPU infrastructure that powers our cutting-edge AI systems. If you're passionate about building robust, scalable systems and solving deep infrastructure challenges at scale, this role is for you.
Work closely with engineers and researchers to define and meet system performance, availability, and efficiency requirements.
Operate and manage thousands of GPUs distributed across multiple cloud providers and clusters.
Design scalable solutions to support rapid growth in compute demands for AI model training, data processing, and inference.
Build resilient, fault-tolerant systems to ensure continuous uptime and seamless performance.
Develop automation tools to eliminate toil and streamline infrastructure operations.
Set up and maintain monitoring systems to proactively detect issues and drive performance improvements.
Define and track SLOs and SLIs that uphold system reliability standards.
Participate in an on-call rotation to ensure 24/7 system availability.
Proven 7+ years of experience as a reliability engineer, infrastructure engineer, or production engineer in fast-paced, high-growth environments.
Deep knowledge of GPU infrastructure, including scheduling, scaling, cloud networking, storage, and security.
Proficiency in one or more scripting or programming languages.
Strong experience with Kubernetes or similar container orchestration systems.
Familiarity with Infrastructure-as-Code tools like Terraform or CloudFormation.
Experience working with observability tools like Prometheus, Grafana, DataDog, ELK, or Splunk.
Excellent troubleshooting, debugging, and systems thinking.
Strong communication skills and a collaborative mindset.
Bonus: Experience in AI/ML infrastructure, or managing large-scale GPU clusters.
We're developing highly complex infrastructure to support advanced AI research and production systems running on thousands of GPUs. This is an opportunity to work on some of the most demanding reliability and performance challenges in tech today—at scale. You’ll have direct impact on how infrastructure supports foundation model development and deployment.
Base Salary: $200K–$250K/year
Competitive equity package (stock options)
Comprehensive health benefits
Generous PTO and flexible work policies
Support for ongoing professional development
...we seek a highly organized, detail-oriented, and tech-savvy Legal Manager to join our team. The ideal candidate will be able to grow... ...Detailed planning and organizational skills to manage multiple projects concurrently. Ability to work independently, in a fast-...
...experience of previous real estate mortgage loan processing, lending, or closing experience is preferred, experience with banking, credit unions, or financial services funding is suitable Knowledge of mortgage industry regulations from origination to post-closing; as...
...including assessing governance and risk management processes and related controls.In financial... ...need to lead and deliver value at this level include but are not limited to: Apply a... ...*Client service associate positions are entry-level roles and job seekers have completed...
...ABOUT THE POSITION Chrysalis is seeking to hire a Caregiver in Davis and Weber County, UT area. This position helps support individuals... ...fun activities in the community. Our caregivers gain valuable experience in the human services field and will be given many...
We're hiring for a Private Duty PEDs Registered Nurse (RN) in Franklin, TN | Hours: Mon - Friday, 40 hours per week, Weekends Off, and Weekly Pay! At Suncrest Companion Services, a part of LHC Group, we embrace a culture of caring, belonging, and trust and enjoy...