Service Reliability Engineer-Southern California-Permanent
SERVICE RELIABILITY ENGINEER/SYSTEMS ADMINISTRATOR
Our SREs focus is on three things: overall ownership of production, production code quality, and deployments.
We expect our SREs to have opinions on the state of our network, what we are doing right, and what we can do better. They are empowered to say when new features are ready for production, and work with other teams to make sure our requirements are met as early in the lifecycle as possible.
• 5+ years in either Software Development or Systems Administration (or both!): We expect you to be knowledgeable in one or two core fields and open to coming up to speed quickly in everything else.
• Strong interpersonal and communication skills: You will interact with other teams on a daily basis.
• A strong sense of responsibility: SREs are largely self-directed, and are key decision makers so must take pride part(s) of production they own.
• Available for on call. There will be times when your expertise is needed outside of core hours.
• Development experience in 1 or more languages: SRE tools are primarily written Python and Bash, but often need to inspect other languages such as Java, Node.js, C++ and Ruby.
• Comfortable at a Bash prompt: Ideal candidate will also be familiar with *nix debugging tools, both at the system (lsof, strace, tcpdump) and code level (gdb, jvisualvm, etc.).
• One or more of the following areas of expertise:
o SQL (ideally MySQL or Postgres)
o NoSQL at scale (ideally Hadoop, Mongo clusters and/or sharded Redis)
o Event Aggregation (e.g. Graphite, Zenoss, Flume, Splunk)
o Virtualization (Ideally in-house clouds using OpenStack or Eucalyptus)
o Release Engineering (Package management and distribution at scale)
o Load Testing (QA or SDET experience is a big plus)
Skills: administration, debugging, engineering, management, qa