We are looking for a skilled Lead DevOps Engineer to join our team and help ensure our platform remains secure, scalable, and highly reliable. In this role, you will work closely with engineering leadership to set the technical direction of our infrastructure and ensure it scales to support our growth. You will play a key role in building a DevOps culture across the company, empowering developers, and maintaining operational excellence. This is an exciting opportunity for someone who has a deep technical understanding, enjoys mentoring others, and thrives in a dynamic, fast-paced environment.
Key Responsibilities
· Platform Reliability & Availability: Collaborate with the Engineering Director and Principal Engineer to define the technical direction for our infrastructure, ensuring that it scales cost-effectively to support our growth.
· Infrastructure Management: Utilize tools like Terraform, GitHub Actions, and scripting languages to manage and optimize our infrastructure and CI/CD systems.
· Cloud Infrastructure Expertise: Become an expert in our technology stack, including AWS (RDS, ECS, EC2, S3, Lambda), Cloudflare, Redis, DNS, Docker, and the rest of the infrastructure platform.
· Observability & Incident Response: Use observability tools such as DataDog and AWS CloudWatch to monitor platform health, troubleshoot performance issues, and identify underlying causes. Participate in the on-call rotation to respond to incidents.
· Disaster Recovery & Incident Management: Ensure disaster recovery and incident response plans are regularly exercised and improved, using industry practices like gamedays and chaos engineering.
· Developer Experience: Own and improve the developer experience by refining the development, testing, and continuous deployment processes to make it safer, faster, and easier for engineers to work.
· CI/CD Leadership: Be an expert in CI/CD principles, empowering engineers to deliver high-quality services to production continuously.
· Mentorship & Collaboration: Support software engineers by pairing, mentoring, and demonstrating effective engineering practices. Facilitate understanding of the production deployment process and performance debugging.
· DevOps Practice Leadership: Define and manage platform engineering decisions, ensuring all engineers on the on-call rota are well-prepared for incident response.
Requirements
· Experience architecting and supporting cloud-native web application infrastructure, ideally using AWS services like RDS, ECS, EC2, S3, and Lambda.
· Hands-on experience with containers and schedulers (e.g., Amazon ECS) and expertise with automated configuration management systems such as Terraform.
· Strong understanding of Linux, networking, and security.
· Experience supporting database administration and performance, with a focus on scalability and maintainability.
· Passion for automating processes and improving the developer experience.
· Experience working in a DevOps environment, closely collaborating with software engineers.
· Proficiency in version control (e.g., Git) and the ability to use it effectively to structure and communicate your work.
Good to Have
· Programming experience in Ruby, JavaScript, or Go.
· Experience managing relationships with third-party suppliers, such as AWS and Cloudflare.
· Familiarity with gamedays, chaos engineering, and other industry practices to enhance platform resilience.
· Experience with disaster recovery and business continuity planning.