Senior Staff Site Reliability Engineer, Dremio

$166.3-225k

AWS
Kubernetes
GCP
Python
Java
Go
Terraform
Azure
flux
Senior and Expert level
Remote from US
Dremio

Open data lakehouse

Job no longer available

Dremio

Open data lakehouse

201-500 employees

B2BEnterpriseBig dataAnalyticsSaaSData Analysis

Job no longer available

$166.3-225k

AWS
Kubernetes
GCP
Python
Java
Go
Terraform
Azure
flux
Senior and Expert level
Remote from US

201-500 employees

B2BEnterpriseBig dataAnalyticsSaaSData Analysis

Company mission

To shorten the distance to data by removing barriers, accelerating time to insight, and putting control in the hands of the user.

Role

Who you are

  • 7+ years of relevant experience in the following areas: SRE, DevOps, Distributed Systems, Cloud Operations, Software Engineering
  • Expertise in Kubernetes, Istio, Terraform, ArgoCD/Flux
  • Expertise with software defined networking infrastructure: dedicated and partner interconnects, VPNs, BGP
  • Excellent command of cloud services on GCP/AWS/Azure, CI/CD pipelines
  • Have moderate-advanced experience in Python/Go, and at least reading knowledge of Java
  • Interested in designing, analyzing and troubleshooting large-scale distributed systems
  • Have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
  • Great ability to debug and optimize code and automate routine tasks
  • Solid background in software development and architecting resilient and reliable applications

Desirable

  • Hands-on experience with large-scale production Kubernetes clusters (<=1000 nodes)
  • Developed SLIs/SLOs for production systems
  • Hands-on experience using Honeycomb for OpenTelemetry trace analysis
  • Engagement with the Learning from Incidents community. Familiarity with seminal work of Allspaw, Woods, Cook et al

What the job involves

  • Drive continuous improvements to our usage of Kubernetes, our Operators, and the GitOps deployment paradigm
  • Extend our networking, service mesh and Kubernetes systems to support connectivity between GCP, AWS and Azure
  • Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and service reviews
  • Help define and instrument Service Level indicators and objectives (SLIs/SLOs) with service owners in the Engineering teams. Develop SLO-based on-call strategies for service owners and their teams
  • Collaborate within our virtual Observability team: develop and improve observability (tracing, events, metrics, profiling, logging and exceptions) of the Dremio Cloud product
  • Ability to debug and optimize code written by others and automate routine tasks. You recognize complexity and are familiar with multiple techniques to manage it but recognize the folly in complete rewrites
  • Evangelize and advocate for resilience engineering and reliability practices across our organization
  • Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity
  • Join an on-call rotation for systems and services that the SRE team owns
  • Practice sustainable incident response and post-incident investigation analysis. Use techniques developed in and around the Learning from Incidents community
  • Drive the cultural, technical, and process changes to move towards a true continuous delivery model within the company

Share this job

View 4 more jobs at Dremio

Insights

Top investors

-18% employee growth in 12 months

Company

Company benefits

  • Medical, dental and vision insurance
  • 401(k) Plan
  • Short term / long term disability and life insurance
  • Pre-IPO stock options
  • Flexible PTO
  • 16 hours of volunteer time off
  • 12 company paid holidays, including Juneteenth
  • Hybrid workplace
  • Monthly “Get Stuff Done” (GSD) Days
  • Paid parental leave
  • Employee Assistance Program (EAP)
  • Quarterly swag surprise

Funding (last 2 of 6 rounds)

Jan 2022

$160m

SERIES E

Jan 2021

$135m

SERIES D

Total funding: $410m

Our take

Data lakehouse platform Dremio bridges the gap between data warehouses and data lakes to help data engineers, analysts and scientists streamline, curate, and run queries on raw, purpose non-specific data, allowing them to quickly build analytics stacks. The self-service platform enables users to create datasets from a multitude of sources and can be deployed on Kubernetes clusters, AWS, and Azure.

The company is tackling the longstanding, time-draining problem of having to manually extract raw data and load it into data warehouses, providing customers with low-cost, sub-second SQL queries for BI on the data lakehouse. Its offering has captured the attention of some big-name clients including Samsung, Bose, HSBC, and Microsoft, and has raised substantial funding.

Dremio is well set to see continued growth as it continues to innovate its data offerings, with a focus on security and privacy. A partnership with PlainID, provider of a central access control platform, will allow clients to manage and control access to data via Dremio, and the company has also achieved HIPAA certification.

Kirsty headshot

Kirsty

Company Specialist at Welcome to the Jungle