Senior Staff Site Reliability Engineer, Dremio

$166.3-225k

AWS

Kubernetes

GCP

Python

Java

Terraform

Azure

flux

Senior and Expert level

Remote from US

Open data lakehouse

Job no longer available

Open data lakehouse

201-500 employees

B2BEnterpriseBig dataAnalyticsSaaSData Analysis

Job no longer available

$166.3-225k

AWS

Kubernetes

GCP

Python

Java

Terraform

Azure

flux

Senior and Expert level

Remote from US

201-500 employees

B2BEnterpriseBig dataAnalyticsSaaSData Analysis

Company mission

To shorten the distance to data by removing barriers, accelerating time to insight, and putting control in the hands of the user.

Job

Company

Role

Who you are

7+ years of relevant experience in the following areas: SRE, DevOps, Distributed Systems, Cloud Operations, Software Engineering
Expertise in Kubernetes, Istio, Terraform, ArgoCD/Flux
Expertise with software defined networking infrastructure: dedicated and partner interconnects, VPNs, BGP
Excellent command of cloud services on GCP/AWS/Azure, CI/CD pipelines
Have moderate-advanced experience in Python/Go, and at least reading knowledge of Java
Interested in designing, analyzing and troubleshooting large-scale distributed systems
Have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
Great ability to debug and optimize code and automate routine tasks
Solid background in software development and architecting resilient and reliable applications

Desirable

Hands-on experience with large-scale production Kubernetes clusters (<=1000 nodes)
Developed SLIs/SLOs for production systems
Hands-on experience using Honeycomb for OpenTelemetry trace analysis
Engagement with the Learning from Incidents community. Familiarity with seminal work of Allspaw, Woods, Cook et al

What the job involves

Drive continuous improvements to our usage of Kubernetes, our Operators, and the GitOps deployment paradigm
Extend our networking, service mesh and Kubernetes systems to support connectivity between GCP, AWS and Azure
Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and service reviews
Help define and instrument Service Level indicators and objectives (SLIs/SLOs) with service owners in the Engineering teams. Develop SLO-based on-call strategies for service owners and their teams
Collaborate within our virtual Observability team: develop and improve observability (tracing, events, metrics, profiling, logging and exceptions) of the Dremio Cloud product
Ability to debug and optimize code written by others and automate routine tasks. You recognize complexity and are familiar with multiple techniques to manage it but recognize the folly in complete rewrites
Evangelize and advocate for resilience engineering and reliability practices across our organization
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity
Join an on-call rotation for systems and services that the SRE team owns
Practice sustainable incident response and post-incident investigation analysis. Use techniques developed in and around the Learning from Incidents community
Drive the cultural, technical, and process changes to move towards a true continuous delivery model within the company

Share this job

View 4 more jobs at Dremio

Insights

Top investors

-18% employee growth in 12 months

Glassdoor (4)

Company

Company benefits

Medical, dental and vision insurance
401(k) Plan
Short term / long term disability and life insurance
Pre-IPO stock options
Flexible PTO
16 hours of volunteer time off
12 company paid holidays, including Juneteenth
Hybrid workplace
Monthly “Get Stuff Done” (GSD) Days
Paid parental leave
Employee Assistance Program (EAP)
Quarterly swag surprise

Funding (last 2 of 6 rounds)

Jan 2022

$160m

SERIES E

Jan 2021

$135m

SERIES D

Total funding: $410m

Our take

Data lakehouse platform Dremio bridges the gap between data warehouses and data lakes to help data engineers, analysts and scientists streamline, curate, and run queries on raw, purpose non-specific data, allowing them to quickly build analytics stacks. The self-service platform enables users to create datasets from a multitude of sources and can be deployed on Kubernetes clusters, AWS, and Azure.

The company is tackling the longstanding, time-draining problem of having to manually extract raw data and load it into data warehouses, providing customers with low-cost, sub-second SQL queries for BI on the data lakehouse. Its offering has captured the attention of some big-name clients including Samsung, Bose, HSBC, and Microsoft, and has raised substantial funding.

Dremio is well set to see continued growth as it continues to innovate its data offerings, with a focus on security and privacy. A partnership with PlainID, provider of a central access control platform, will allow clients to manage and control access to data via Dremio, and the company has also achieved HIPAA certification.

Kirsty

Company Specialist at Welcome to the Jungle