Careers
Careers

job details

Back to jobs search

Jobs search results

2,363 jobs matched
Back to jobs search

Principal Site Reliability Engineer, Cloud AI

GoogleSunnyvale, CA, USADirector+

Minimum qualifications:

  • Bachelor’s degree in Computer Science, related field, or equivalent practical experience.
  • 15 years of experience in software engineering.
  • 10 years of experience working on reliability, scalability, and security of large-scale distributed systems.
  • Experience with ML systems, infrastructure, or a related AI/ML field.

Preferred qualifications:

  • PhD in Electrical Engineering, Computer Science, or related field.
  • Experience in reliability/performance engineering at a hyperscaler or a company known for managing datasets and large teams of data scientists.
  • Knowledge of enterprise security principles.
  • Deep expertise in managing large-scale resource pools, such as GPU/TPU clusters.
  • A track record of success working on products across a broad portfolio, including platforms like agent-building tools, or AI search.

About the job

We are seeking a Principal Engineer for Cloud AI Reliability, Resiliency, and Scalability. This technical individual contributor sits within the Cloud AI SRE team, where they will be a key voice in the architectural design and operational strategy of Google's Cloud AI portfolio. One of the many critical components of the Cloud AI portfolio is the Vertex AI platform; this platform runs both first-party models like Gemini and external third-party models, so a focus on breadth over depth is essential. In this role, you will be responsible for ensuring the availability and scalability of our most impactful AI products, working on a planetary scale.

A secondary part of the role will involve managing enterprise risk, with a key focus on security for the products and platforms we support. You will serve as a counterpart to senior leaders and domain experts, advising on the architectural and security considerations required to launch our next generation of AI and AI agent platforms.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

The US base salary range for this full-time position is $294,000-$414,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google.

Responsibilities

  • Provide expert-level guidance on the architectural design of highly available, scalable, and secure AI and ML systems.
  • Advise on and manage the overall enterprise risk for our AI products and platforms, with a significant focus on identifying and mitigating security vulnerabilities.
  • Partner with engineering and product teams to architect, launch, and operate the next generation of Google's AI and AI agent platforms, built from the ground up for the future of AI.
  • Represent the SRE perspective in highly technical discussions with other senior leaders and domain experts, focusing on the infrastructure and underlying systems that power our models.
  • Influence the platform's long-term strategy, ensuring it can support a wide range of first- and third-party models for all GCP and AI Studio enterprise customers.

Information collected and processed as part of your Google Careers profile, and any job applications you choose to submit is subject to Google's Applicant and Candidate Privacy Policy.

Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google's EEO Policy, Know your rights: workplace discrimination is illegal, Belonging at Google, and How we hire.

If you have a need that requires accommodation, please let us know by completing our Accommodations for Applicants form.

Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.

To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.

Google apps
Main menu