Software Engineer, Anyscale

Model Training Infrastructure

$170.1-237k

+ Stock Options

AWS

Kubernetes

GCP

Tensorflow

PyTorch

Senior and Expert level

San Francisco Bay Area

3+ days a week in office

Open-source framework for AI software development

Job no longer available

Open-source framework for AI software development

201-500 employees

B2BArtificial IntelligenceEnterpriseMachine LearningSaaS

Job no longer available

$170.1-237k

+ Stock Options

AWS

Kubernetes

GCP

Tensorflow

PyTorch

Senior and Expert level

San Francisco Bay Area

3+ days a week in office

201-500 employees

B2BArtificial IntelligenceEnterpriseMachine LearningSaaS

Company mission

To remove distributed systems expertise from the critical path of realizing the business potential of AI.

Job

Company

Role

Who you are

We’re particularly interested in engineers who can help shape and execute a vision for the future of ML training infrastructure
We welcome both Individual Contributors and technically inclined individuals with experience managing small teams
Minimum 5+ years of experience building, scaling, and maintaining software systems in production environments
Strong fundamentals in algorithms, data structures, and system design
Proficiency with machine learning frameworks and libraries (e.g., PyTorch, TensorFlow, XGBoost)
Experience designing fault-tolerant distributed systems
Solid architectural skills

Desirable

Experience with cloud technologies (AWS, GCP, Kubernetes)
Hands-on experience building ML training platforms in production
Background in managing and maintaining open-source libraries
Experience leading small teams to achieve ambitious technical goals
Familiarity with Ray

What the job involves

We’re looking for passionate, motivated engineers excited to build infrastructure and tools for the next generation of machine learning applications
We’re hiring exceptional Software Engineers for our distributed training team, which develops and maintains widely adopted open-source machine learning libraries
The Distributed Training team drives the development and optimization of Ray’s distributed training libraries, focusing on features and performance enhancements for large-scale machine learning workloads
They are responsible for building and maintaining core libraries like Ray Train (for distributed model training) and Ray Tune (for distributed hyperparameter tuning)
You’ll collaborate closely with the Ray Core and Ray Data teams to create impactful, end-to-end solutions, and have the exciting opportunity to work directly with Machine Learning teams around the globe, shaping products that are transforming the AI landscape
Develop scalable, fault-tolerant distributed machine learning libraries that power leading ML platforms
Create an exceptional end-to-end experience for training machine learning models
Solve complex architectural challenges and transform them into practical solutions
Contribute to and engage with the open-source community, collaborating with ML researchers, engineers, and data scientists to build new scalable machine learning abstractions
Share your work and expertise with a broader audience through talks, tutorials, and blog posts
Collaborate with a team of experts in distributed systems and machine learning
Work directly with end-users to iterate on and enhance the product based on their feedback
Partner with engineering and product managers to nurture a talented team of software engineers
Play a key role in building and shaping a world-class company

Share this job

View 15 more jobs at Anyscale

Insights

Top investors

133% employee growth in 12 months

Glassdoor (3.9)

Company

Company benefits

We offer a wide range of health, dental, and vision coverage options for you and your family — including many that are 99% covered by Anyscale
Whether you’re hopping on public transit or driving in yourself, Anyscale covers a portion of your commuting costs each month
Lunch is served every day in our San Francisco office. Dinner, too, if you’re ever working late. And did we mention daily boba runs?
Give back to the causes and communities you love with paid time off specifically for volunteer work
Anyscalers can take advantage of 12 weeks of paid leave following the arrival of a new little one
Paid time off at Anyscale is flexible and unlimited. We encourage all Anyscalers to rest up and recharge when needed

Funding (last 2 of 4 rounds)

Aug 2022

$99m

SERIES C

Dec 2021

$100m

SERIES C

Total funding: $259.6m

Our take

AnyScale is a platform facilitating the development of distributed applications with high computing demands on its Ray platform. It answers an increasing requirement for higher-level tech strategies among companies which are incorporating AI and machine learning into their products, and a similarly broad and growing demand for easier access to cloud programming.

AnyScale launched its first commercial offering in 2021, and intends to continue developing products that will lower the skill and resource threshold for cloud programming. In 2023 it released Aviary, a project to simplify open-source large language model (LLM) deployment. What is promising, and lucrative, about its proposition is that it offers a general-use distributed system, this sidesteps the cumbersome process of stitching together disparate distributed systems, previously a requirement for novel applications. It is currently finding use in a range of applications such as supply chain, environmental restoration and retail by organisations including Amazon and Uber.

Anyscape has raised considerable funding led by Andreessen Horowitz and Addition. This is being used to scale its team and further develop the Ray platform. As the demand for AI apps rapidly increases, Anyscape is poised to disrupt a $13 trillion market.

Kirsty

Company Specialist at Welcome to the Jungle