Software Engineer, CentML

Tools and Infrastructure

Salary not provided

+ Employee stock options

AWS
Docker
Kubernetes
GCP
Python
Bash
Azure
Pulumi
Mid and Senior level
San Francisco Bay Area
Toronto
CentML

Machine learning training platform

Job no longer available

CentML

Machine learning training platform

21-100 employees

B2CArtificial IntelligenceMachine LearningSaaS

Job no longer available

Salary not provided

+ Employee stock options

AWS
Docker
Kubernetes
GCP
Python
Bash
Azure
Pulumi
Mid and Senior level
San Francisco Bay Area
Toronto

21-100 employees

B2CArtificial IntelligenceMachine LearningSaaS

Company mission

To pioneer novel technology to enhance computing efficiency, making AI accessible for innovation and to benefit the global community.

Role

Who you are

  • 3+ years of experience in Site Reliability Engineering, DevOps, or related roles with extensive experience in performance testing and managing large-scale infrastructure
  • Proven track record of building and operating highly reliable, scalable, and secure systems in a production environment
  • Deep expertise in CI workflows (e.g., GitHub Actions), cloud platforms (e.g., AWS, GCP, Azure), containerization (e.g., Docker, Kubernetes), and infrastructure-as-code (e.g., Pulumi)
  • Advanced proficiency in scripting and automation using languages such as Python, Bash, or similar
  • Strong understanding of distributed systems, networking, and storage solutions, with the ability to architect complex systems from the ground up
  • Excellent problem-solving skills, with a proactive approach to identifying and resolving issues before they impact the business
  • Strong communication and collaboration skills, with the ability to work effectively across different teams and stakeholders
  • Ability to operate effectively in a fast-paced, dynamic startup environment, with a focus on delivering results

What the job involves

  • As a Site Reliability Engineer, you will play a pivotal role in shaping the infrastructure and reliability practices at CentML
  • You will be responsible for working on complex projects, and collaborating with cross-functional teams to ensure our systems meet the highest standards of reliability, performance, and security
  • We’re looking for a Site Reliability Engineer with a strategic focus on performance optimization and testing
  • Build large-scale, distributed systems that support complex workloads, ensuring high availability and fault tolerance
  • Contribute towards efforts in automation, configuration management, and infrastructure-as-code, minimizing manual operations and ensuring consistency
  • Optimize the performance and scalability of our systems, identifying and addressing bottlenecks before they impact users
  • Participate in incident response efforts, including real-time troubleshooting, root cause analysis, and postmortem reviews
  • Develop and maintain comprehensive monitoring, alerting, and logging systems that provide deep visibility into system health and performance
  • Drive continuous improvement in system reliability, performance, and scalability through the adoption of new technologies, tools, and methodologies
  • Stay current with industry trends and innovations in SRE and ML infrastructure, bringing new ideas and approaches to the team

Share this job

View 14 more jobs at CentML

Company

Company benefits

  • An open and inclusive culture and work environment
  • Fully stocked kitchen at the office
  • Full health and dental benefits
  • Parental Leave top-up for 6 months
  • Continuous education budget
  • Generous vacation - we're not saying unlimited, but if you need extra time to recharge, just ask

Funding (2 rounds)

Sep 2023

$27.1m

SEED

Jun 2022

$3.5m

SEED

Total funding: $30.6m

Our take

In an increasingly AI and ML-driven world, the demand for these technologies is skyrocketing, alongside their costs, leaving numerous companies without access to tools that could enhance their operations. CentML emerges as a solution, aiming to democratize AI and ML by making them more accessible and cost-effective for all.

Backed by a team with extensive expertise in AI, ML compilers, and ML hardware, CentML possesses a deep understanding of the inefficiencies prevalent in the industry. Among the challenges it addresses is the scarcity of AI chips. By meticulously analyzing clients' AI/ML requirements, CentML advises on suitable hardware options to optimize performance and minimize costs.

With its inception in 2022, CentML has swiftly garnered attention and funding, underscoring the market's appetite for its offerings. Recent funding will enable the company to further refine its product and conduct pivotal research in the field, solidifying its position as a pioneering force in democratizing AI and ML technologies.

Kirsty headshot

Kirsty

Company Specialist at Welcome to the Jungle