Site Reliability Engineer, Scaleway

High Performance Computing / Artificial Intelligence

Salary not provided
Python
Elasticsearch
Bash
Linux
Go
Kibana
Sentry
Ansible
MySQL
Prometheus
CUDA
Grafana
GSuite
Ubuntu
JIRA
Confluence
Slack
Junior, Mid and Senior level
Paris
Scaleway

Complete cloud ecosystem

Open for applications

Scaleway

Complete cloud ecosystem

501-1000 employees

B2BAPICloud Computing

Open for applications

Salary not provided
Python
Elasticsearch
Bash
Linux
Go
Kibana
Sentry
Ansible
MySQL
Prometheus
CUDA
Grafana
GSuite
Ubuntu
JIRA
Confluence
Slack
Junior, Mid and Senior level
Paris

501-1000 employees

B2BAPICloud Computing

Company mission

To power our fast-growing global digital infrastructure in a smart, responsible and renewable way.

Role

Who you are

  • We expect you to have a strong background in HPC environment and system administration, along with some DevOps experience and SRE best practices
  • Experience in system programming using at least one of these languages:Python, Bash, Go, etc
  • Demonstrated ability to troubleshoot production system failures
  • A positive mindset and desire to work with a team
  • Passion for automation and incremental improvements on tooling,
  • Experience with Linux systems: based on Debian and Centos derivatives
  • Experience with batch job schedulers like Slurm, OAR, SGE
  • Good understanding of computer networks: TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, …
  • Storage knowledge: large pools, NAS, S3, .
  • Experience with Nvidia, Cuda, MPI
  • Good command of English

Desirable

  • Ability to meticulously identify and solve any kind of bug in any codebase
  • Experience with infrastructure-as-code and continuous deployment
  • Experience dealing with physical hardware automation
  • Experience monitoring & logging systems
  • Experience handling account management (LDAP)
  • Knowledge of at least one cloud platform and related use-cases
  • Experience as an OSS contributor and/or maintainer
  • Knowledge in AI / LLM / ML / neuronal networks

What the job involves

  • Reporting to our Engineering Manager Emerick Mounoury, you will be responsible to ensure the deployment and the health of the components of our multiple HPC clusters composed of Nvidia hardware
  • Create or optimize existing tools & documentation that will help identify, diagnose, and solve production incidents, automating as much as possible
  • Troubleshoot high-impact issues by working with multiple Engineering teams (Storage, Network, Hardware)
  • Take on-call responsibilities, mitigate issues encountered in production and answer our customers in real time
  • Ensure a high quality of service for our customers by leveraging observability and monitoring technologies
  • Manage the life cycle of HPC clusters in production and take part to the escalation of the hardware and software issues to our suppliers
  • Empower your teammates to swiftly integrate and deploy software components across our systems
  • Help implementing best stability, resiliency, scalability, security, and performance practices across our systems

Application process

  • Screening call - 30 mins with the recruiter
  • Manager Interview - 45 mins
  • Technical Interviews 1h 30 mins
  • HR Interview - 45 mins
  • Offer sent - 48 hours
  • On average our process lasts 2-3 weeks and offers usually follow within 48 hours 🤞

Otta's take

Sam Franklin headshot

Sam Franklin

CEO of Otta

Cloud computing is fast increasing in popularity and use, and many companies are turning to the cloud in order to better their functions. Scaleway enables developers and businesses to build, deploy and scale their applications for the cloud, with a focus on providing a range of options so that clients can access exactly what they need.

Scaleway, which labels itself “the cloud of choice”, delivers fully managed offerings for bare metal, containerization, and serverless architectures. It is attractive especially to startups, as it offers more flexibility than alternate cloud providers, and it boasts a smooth developer experience, carbon-neutral data centers, and native tools for managing multi-cloud architectures.

Scaleway is used by over 25,000 businesses including Golem.ai and Aternos. Aligning with its focus on bringing the cloud to startups, the company has launched a “Startup Program” where participators can earn cloud credits as well as gaining access to technical expertise and masterclasses. Scaleway has also partnered with Clever Cloud, in order to deliver an efficient alternative to for the automation of app deployment, enriching the product catalogues of both companies.

Insights

Led by a woman

Many candidates hear
back within 2 weeks

19% employee growth in 12 months

Company

Company benefits

  • Yummy: Whether you're working at La Maison or in a Datacenter, you get fresh breakfasts and snacks everyday to provide the fuel you need to innovate.
  • Events, Hackathons, & Tech talks: We do believe that peer-to-peer sharing is essential, so we provide media-training courses, event planning support and conference tickets. No more excuses!
  • Company outings: Work hard, play hard! We have an in-house bar with a large choice of beverages and games. Plus, we organize memorable & engaging (virtual or not) team events at least twice a month.
  • MacBook Pro or ThinkPad laptop: Choose your laptop, MacBook Pro or ThinkPad 15" on your first day to get your job done.
  • Flexrything & Remote: We choose to work in a flex office environment to grow new collaborative ideas and practices. We work in small product oriented teams to focus & execute faster. We don't track working hours, what matters is what you get done.
  • No diplomas required: At Scaleway, we don't care about your degrees, we only care about what you know. Your skillset will be evaluated in regard to your technical experiences and skills.

Company values

  • Singularity: Seek singularity. We look out for people who stand out, no matter their professional or personal background.
  • Community: When we agree to disagree, we do so with passion and respect, but we’re always rowing in the same direction toward the same goal.
  • Adventure: Scaleway is a challenge nobody’s ready for but we go for it anyway. Move, reinvent. The only constant is change. We never stop at the status quo.
  • Leadership: Leadership is about action, not just a position. As effective leaders, we have major impacts on not only the team members we manage, but also on the company as a whole
  • Excellence: We want to be customer’s first choice. We never stop iterating and looking for that 1% improvement, every day.
  • Rock-Solid: Everything is connected. We take actions as a team, never in silos. We give credit where credit is due and we can always count on each other.

Company HQ

Madeleine, Paris, France

Founders

Albane Bruyas

(CCO and COO (Not Founder))

Previously Advisory Board Member at Alyx. Previously Operations Manager at Toshiba, Consultant at Harpagon and Purchaser at LVMH.

Salary benchmarks

We don't have enough data yet to provide salary benchmarks for this role.

Submit your salary to help other candidates with crowdsourced salary estimates.

Share this job

View 18 more jobs at Scaleway