Join Our Team as Senior DevOps Engineer (San Francisco)
At Hyperion360, we believe in empowering our engineers to shape the future of technology from the comfort of their own homes. We are a premier software outsourcing company, partnering with some of the world’s most successful businesses to build and manage dedicated, remote teams of top-tier software engineers and other technical talent.
We are looking for a talented Senior DevOps Engineer (San Francisco) to join our global team.
About This Role
a Senior DevOps Engineer to design, build, and operate the cloud infrastructure that powers our fleet of grid-monitoring devices and the data and applications
Job Description
Why Do We Need You?
• We’re scaling the deployment of critical infrastructure monitoring devices to detect real-world fault events that lead to wildfires. The platform you’ll build and operate ingests millions of events per day from devices in the field, powers customer-facing dashboards and alerting, and supports the data science work that turns raw signals into grid intelligence.
• You will own AWS infrastructure, Kubernetes (EKS), CI/CD, and observability end-to-end, partnering with our Cloud Security team to keep the platform safe and compliant, and with backend, firmware, and data teams to keep them shipping fast. As an early member of the DevOps team, you’ll have a direct hand in shaping how Gridware builds, deploys, and runs production systems for years to come.
Responsibilities
• Design, build, and maintain scalable, secure, and highly available infrastructure on AWS (EKS, EC2, RDS / Aurora Postgres, MSK, S3, VPC, IAM).
• Manage and optimize Kubernetes clusters (EKS) across multiple environments, and deploy applications using Argo CD with GitOps best practices.
• Implement and maintain CI/CD pipelines using GitHub Actions, including reusable workflows, build/push/scan flows for ECR, and frontend deployment pipelines.
• Operate and tune Kafka-based event streaming on Amazon MSK for high-throughput, low-latency device data pipelines.
• Define and manage Infrastructure as Code with Terraform and Terragrunt, with reusable modules, sensible environment separation, and review-friendly plans.
• Manage identity and access across platforms with Auth0 / EntraID integrations, IAM roles for service accounts (IRSA), and short-lived credentials.
• Build and maintain observability with Grafana, Loki, Prometheus / Mimir, and related tooling so on-call engineers can quickly find and fix issues.
• Monitor and optimize infrastructure cost across environments, partnering with engineering teams on right-sizing, capacity planning, and waste reduction.
• Partner with our Cloud Security team to enforce security standards, integrate with SIEM tooling, and respond to vulnerabilities and incidents.
• Debug complex production issues across infrastructure, deployment, and networking layers, and turn the lessons learned into automation and runbooks.
Requirements
• 5+ years in DevOps, SRE, or Platform Engineering with production experience operating AWS infrastructure.
• Deep hands-on experience administering Kubernetes (EKS or equivalent) and deploying via GitOps (Argo CD or Flux).
• Proficiency with Infrastructure as Code using Terraform; comfort with Terragrunt or a similar wrapper.
• Hands-on experience designing and maintaining CI/CD pipelines, preferably with GitHub Actions and reusable workflows.
• Production experience operating distributed systems such as Kafka (MSK).
• Strong understanding of networking, DNS, TLS, and security best practices, including IdP-driven access control (Auth0, EntraID, or similar).
• Solid experience with monitoring and logging stacks such as Grafana, Loki, Prometheus, Mimir, or equivalents.
• Ability to debug complex production issues across infrastructure, deployment, and networking layers.
• Comfortable working in Linux environments with strong scripting skills (Python or Bash preferred for automation).
• Knowledge of version control workflows, automated testing, and release management.
Bonus Skills
Your application will have a higher chance of standing out if you have one (or more) of the following skills or experiences.
• Experience operating Apollo Router / GraphQL federation gateways in production.
• Experience operating Argo Workflows or similar Kubernetes-native job / pipeline runners in production.
• Familiarity with Databricks or ML Ops pipelines for data and model deployment.
• Experience designing, operating, and exercising Disaster Recovery (DR) environments, including cross-region replication, backups, and tested failover runbooks.
• Experience with Tailscale or other zero-trust networking tools.
• Experience supporting IoT / embedded fleets at scale, including secure device-to-cloud connectivity.
• Experience in high-growth startup environments where you must wear many hats.
Location & Schedule
This is a hybrid role (2-3 days minimum per week in the office). We are located in San Francisco.
Benefits
We offer competitive benefits that help employees to thrive, grow and enjoy their lives. These benefits include:
• Health, Dental and Vision insurance, free parking and a commuter allowance
• Stock option plan
• Conveniently located office — directly across the street from Pleasant Hill BART, close to the highway with parking provided
Why Do We Need You?
• We’re scaling the deployment of critical infrastructure monitoring devices to detect real-world fault events that lead to wildfires. The platform you’ll build and operate ingests millions of events per day from devices in the field, powers customer-facing dashboards and alerting, and supports the data science work that turns raw signals into grid intelligence.
• You will own AWS infrastructure, Kubernetes (EKS), CI/CD, and observability end-to-end, partnering with our Cloud Security team to keep the platform safe and compliant, and with backend, firmware, and data teams to keep them shipping fast. As an early member of the DevOps team, you’ll have a direct hand in shaping how Gridware builds, deploys, and runs production systems for years to come.
Responsibilities
• Design, build, and maintain scalable, secure, and highly available infrastructure on AWS (EKS, EC2, RDS / Aurora Postgres, MSK, S3, VPC, IAM).
• Manage and optimize Kubernetes clusters (EKS) across multiple environments, and deploy applications using Argo CD with GitOps best practices.
• Implement and maintain CI/CD pipelines using GitHub Actions, including reusable workflows, build/push/scan flows for ECR, and frontend deployment pipelines.
• Operate and tune Kafka-based event streaming on Amazon MSK for high-throughput, low-latency device data pipelines.
• Define and manage Infrastructure as Code with Terraform and Terragrunt, with reusable modules, sensible environment separation, and review-friendly plans.
• Manage identity and access across platforms with Auth0 / EntraID integrations, IAM roles for service accounts (IRSA), and short-lived credentials.
• Build and maintain observability with Grafana, Loki, Prometheus / Mimir, and related tooling so on-call engineers can quickly find and fix issues.
• Monitor and optimize infrastructure cost across environments, partnering with engineering teams on right-sizing, capacity planning, and waste reduction.
• Partner with our Cloud Security team to enforce security standards, integrate with SIEM tooling, and respond to vulnerabilities and incidents.
• Debug complex production issues across infrastructure, deployment, and networking layers, and turn the lessons learned into automation and runbooks.
Requirements
• 5+ years in DevOps, SRE, or Platform Engineering with production experience operating AWS infrastructure.
• Deep hands-on experience administering Kubernetes (EKS or equivalent) and deploying via GitOps (Argo CD or Flux).
• Proficiency with Infrastructure as Code using Terraform; comfort with Terragrunt or a similar wrapper.
• Hands-on experience designing and maintaining CI/CD pipelines, preferably with GitHub Actions and reusable workflows.
• Production experience operating distributed systems such as Kafka (MSK).
• Strong understanding of networking, DNS, TLS, and security best practices, including IdP-driven access control (Auth0, EntraID, or similar).
• Solid experience with monitoring and logging stacks such as Grafana, Loki, Prometheus, Mimir, or equivalents.
• Ability to debug complex production issues across infrastructure, deployment, and networking layers.
• Comfortable working in Linux environments with strong scripting skills (Python or Bash preferred for automation).
• Knowledge of version control workflows, automated testing, and release management.
Bonus Skills
Your application will have a higher chance of standing out if you have one (or more) of the following skills or experiences.
• Experience operating Apollo Router / GraphQL federation gateways in production.
• Experience operating Argo Workflows or similar Kubernetes-native job / pipeline runners in production.
• Familiarity with Databricks or ML Ops pipelines for data and model deployment.
• Experience designing, operating, and exercising Disaster Recovery (DR) environments, including cross-region replication, backups, and tested failover runbooks.
• Experience with Tailscale or other zero-trust networking tools.
• Experience supporting IoT / embedded fleets at scale, including secure device-to-cloud connectivity.
• Experience in high-growth startup environments where you must wear many hats.
Location & Schedule
This is a hybrid role (2-3 days minimum per week in the office). We are located in San Francisco.
Benefits
We offer competitive benefits that help employees to thrive, grow and enjoy their lives. These benefits include:
• Health, Dental and Vision insurance, free parking and a commuter allowance
• 401K
• Stock option plan
• Conveniently located office — directly across the street from Pleasant Hill BART, close to the highway with parking provided
Why Do We Need You?
• We’re scaling the deployment of critical infrastructure monitoring devices to detect real-world fault events that lead to wildfires. The platform you’ll build and operate ingests millions of events per day from devices in the field, powers customer-facing dashboards and alerting, and supports the data science work that turns raw signals into grid intelligence.
• You will own AWS infrastructure, Kubernetes (EKS), CI/CD, and observability end-to-end, partnering with our Cloud Security team to keep the platform safe and compliant, and with backend, firmware, and data teams to keep them shipping fast. As an early member of the DevOps team, you’ll have a direct hand in shaping how Gridware builds, deploys, and runs production systems for years to come.
Responsibilities
• Design, build, and maintain scalable, secure, and highly available infrastructure on AWS (EKS, EC2, RDS / Aurora Postgres, MSK, S3, VPC, IAM).
• Manage and optimize Kubernetes clusters (EKS) across multiple environments, and deploy applications using Argo CD with GitOps best practices.
• Implement and maintain CI/CD pipelines using GitHub Actions, including reusable workflows, build/push/scan flows for ECR, and frontend deployment pipelines.
• Operate and tune Kafka-based event streaming on Amazon MSK for high-throughput, low-latency device data pipelines.
• Define and manage Infrastructure as Code with Terraform and Terragrunt, with reusable modules, sensible environment separation, and review-friendly plans.
• Manage identity and access across platforms with Auth0 / EntraID integrations, IAM roles for service accounts (IRSA), and short-lived credentials.
• Build and maintain observability with Grafana, Loki, Prometheus / Mimir, and related tooling so on-call engineers can quickly find and fix issues.
• Monitor and optimize infrastructure cost across environments, partnering with engineering teams on right-sizing, capacity planning, and waste reduction.
• Partner with our Cloud Security team to enforce security standards, integrate with SIEM tooling, and respond to vulnerabilities and incidents.
• Debug complex production issues across infrastructure, deployment, and networking layers, and turn the lessons learned into automation and runbooks.
Requirements
• 5+ years in DevOps, SRE, or Platform Engineering with production experience operating AWS infrastructure.
• Deep hands-on experience administering Kubernetes (EKS or equivalent) and deploying via GitOps (Argo CD or Flux).
• Proficiency with Infrastructure as Code using Terraform; comfort with Terragrunt or a similar wrapper.
• Hands-on experience designing and maintaining CI/CD pipelines, preferably with GitHub Actions and reusable workflows.
• Production experience operating distributed systems such as Kafka (MSK).
• Strong understanding of networking, DNS, TLS, and security best practices, including IdP-driven access control (Auth0, EntraID, or similar).
• Solid experience with monitoring and logging stacks such as Grafana, Loki, Prometheus, Mimir, or equivalents.
• Ability to debug complex production issues across infrastructure, deployment, and networking layers.
• Comfortable working in Linux environments with strong scripting skills (Python or Bash preferred for automation).
• Knowledge of version control workflows, automated testing, and release management.
Bonus Skills
Your application will have a higher chance of standing out if you have one (or more) of the following skills or experiences.
• Experience operating Apollo Router / GraphQL federation gateways in production.
• Experience operating Argo Workflows or similar Kubernetes-native job / pipeline runners in production.
• Familiarity with Databricks or ML Ops pipelines for data and model deployment.
• Experience designing, operating, and exercising Disaster Recovery (DR) environments, including cross-region replication, backups, and tested failover runbooks.
• Experience with Tailscale or other zero-trust networking tools.
• Experience supporting IoT / embedded fleets at scale, including secure device-to-cloud connectivity.
• Experience in high-growth startup environments where you must wear many hats.
Location & Schedule
This is a hybrid role (2-3 days minimum per week in the office). We are located in San Francisco.
Benefits
We offer competitive benefits that help employees to thrive, grow and enjoy their lives. These benefits include:
• Health, Dental and Vision insurance, free parking and a commuter allowance
• Stock option plan
• Conveniently located office — directly across the street from Pleasant Hill BART, close to the highway with parking provided
Why Choose Hyperion360?
- Remote-First Culture: Work from anywhere with flexible hours
- Top-Tier Clients: Partner with Fortune 500 companies and top startups
- Professional Growth: Continuous learning and development opportunities
- Competitive Compensation: Market-leading salaries and benefits
- Global Team: Collaborate with talented professionals worldwide
Ready to take your career to the next level? Apply today and become part of Hyperion360’s elite team!