Site Reliability Engineer – Storage

Location: Sacramento, California, United States

THE ROLE:

Quickly maturing startup seeking like-minded Site Reliability Engineer! The technical team is a small, talented, and close-knit group and we need some development and systems help to make business and development operations flow smoothly.

As a well-rounded site reliability engineer, you should definitely be the type that appreciates diversity in your day, and challenges outside of your comfort level!

WHAT YOU LL BE DOING:

Managing and automating the care for Linux systems and a lot of disks at scale.
Extending the server configuration management systems with new features with Salt.
Refactoring existing system management in Ansible as needed, or migrating to Salt.
Working autonomously, or with the software engineering team, to troubleshoot and solve complex or unintuitive system issues.
Work with the software engineers to achieve 100% self-service automation of build pipelines.

WHAT YOU BRING:

As a well-rounded system engineer and scripter, with a diverse set of skills, this makes you one of the very best people to troubleshoot, monitor the platform, and be on top of releases. You should definitely be the type that appreciates diversity in your day, and challenges outside of your comfort level!

Experience working in an environment leveraging remote communication collaboration tools like slack, zoom etc. across multiple time zones
Experience with git in a multi-contributor/team environment
High degree of drive to improve and automate your environment with minimal guidance
Be able to solve for the immediate, and plan to accommodate for future problems
Experience in automating tasks through scripting. You should be able to use Python and be familiar with a variety of packages.
Extensive experience administering a variety of Linux distributions
Extensive experience with Ansible, Salt, Terraform
Experience with bare metal hardware including physical servers, JBODs, physical cabling, and networking equipment.
Experience with ZFS, XFS, GPFS, Ceph, or other distributed file systems
Solid understanding of web protocols such as HTTP, TLS, HTTP/2, Server send events, CDN
Solid understanding of nginx and SSL

Preferred Experience

Experience with Grafana
Experience managing Cassandra installations
Experience in PXE based deployments
Experience with a message queue system like RabbitMQ or Kafka
Experience with build pipelines, integration testing, Jenkins, and github actions

Requirements

You can be located anywhere in the world, but we do keep a balance in distribution between time zones. Currently this role is only for those who can work standard North American working hours (work day starting somewhere in UTC -5 to UTC -8). Tags: Los Angeles, CA

Site Reliability Engineer – Storage