Sr Software Engineer - Compute Data Plane

Netflix

United States Remote

Full time

Software Engineering / Software Developer

Sep 1

About the Team:

We build and support products such as Titus, our multi-tenant Kubernetes-based container platform and runtime, which enable and enhance fleet-wide agility, efficiency, and reliability while empowering engineers through valuable abstractions and reduced operational burden. We take an active role in driving our compute primitives forward, evolving them to meet our customer’s needs.

The Compute Platform’s products power workloads across our Data Platforms, Stream Processing, Studio and Content, Encoding, Streaming, Content Delivery, Machine Learning, and Engineering Tooling. We provide a worldwide highly available compute fabric launching over a million containers every day. As a critical component of our streaming service, the container management platform is a tier-one system. The team not only designs and develops this tier-one system but also operates and supports it 24x7.

About the Role:

We are looking for a Senior Software Engineer to join us in growing our Compute Platform built on our Kubernetes-based container platform. In this role, you will work on our container orchestration system, with a focus on enabling a best-in-class experience for machine learning and big data processing workloads, and batch-enabled platforms running on top of our products.

People who do well in this role are self-motivated engineers experienced at building and supporting distributed systems, who love to delight the customer both in terms of regularly delivering value as well as in providing stellar support for our products. A proven ability to successfully tackle complex and ambiguous problems and deliver quality results quickly are essential skills for this role.

We believe talent is equally geographically distributed but opportunities are not. Our US-based team is happy to embrace remote work and our general support hours are 10am - 4pm Pacific Time. We believe safe spaces where everyone can be their authentic selves is the key to a strong team so we welcome and embrace all identities, cultures, and backgrounds.

What we are building

  • A globally available and extensible container runtime and orchestration platform, built on Kubernetes
  • Advanced and industry-leading ML-based scheduling across service and batch jobs, including capacity management, bin packing, and over-subscription, fault-tolerance, and cross workload optimization
  • An operationally resilient and global-scale control plane
  • An intelligent and full-featured batch system supporting optimized scheduling and management of all workload types
  • Linux and container runtimes providing industry-leading security and multi-tenant isolation with deep integration to AWS EC2 networking and security as well as Netflix platform infrastructure systems

Primary Responsibilities

  • Creating clarity within ambiguity to produce and execute designs and plans
  • Participating in the creation and curation of a fantastic team culture
  • Championing projects, managing and communicating impact, and delivering results
  • Collaborating with the team, Product Managers, partners, and stakeholders on our roadmap
  • Operating our systems and responding to incidents, issues, and user support requests as part of an on-call rotation
  • Evolving the platform to solve novel challenges while handling web-scale load

Skills we are looking for

  • Ability to break down abstract problems into concrete solutions
  • Demonstrated experience in improving the reliability and operational automation of complex, multi-tier systems
  • Experience beyond the usage of container management platforms and/or container runtimes. Specifically, we are looking for engineers who have extended and improved these platforms vs. operated them.
  • Experience with addressing performance issues across the whole stack from applications to operating systems
  • Ability to program across the core project languages Java and Golang

What sets you apart

  • Experience building a business-critical large-scale distributed system with extreme availability
  • Understanding of systemic security challenges within infrastructure-as-a-service offerings
  • Demonstrated community advocacy or open source contribution
  • Deep experience with batch systems such as Luigi, Airflow, AWS Batch, Chronos, or other big data infrastructure platforms

What might be interesting to you

  • Identifying and implementing improvements to systems and architecture with Netflix-wide impact
  • Working with stunning colleagues at the top of their field
  • A demonstrated commitment (including funding and time) to creating a more inclusive working environment and diverse workforce
  • The ability to choose between working remotely or in the office

Does this sound interesting? Or does this sound interesting-but-intimidating? Please don’t self-select out, let’s figure it out together. We’d love to talk to you!

Apply for this position Back to job

You must be logged in to to apply to this job.

Apply

Your application has been successfully submitted.

Please fix the errors below and resubmit.

Something went wrong. Please try again later or contact us.

Personal Information

Profile

View resume

Details

{{notification.msg}}