Lead Site Reliability Engineer (SRE) - Based in London

Description du poste:

Description

Job description

We’re currently looking for a Lead Site Reliability Engineer (SRE) to join our Platform team and help build our SRE functions. As a Lead, this is a leadership role but is also very hands on.

We’re looking for SREs who are software engineers at heart - you’re as comfortable writing software to solve problems as you are operating AWS or Kubernetes. If you’re a software engineer who has some good cloud infrastructure experience already, or you’re eager to get really familiar with systems, tooling and libraries, this could be the role for you.

As a team, we’re responsible for designing, building, and operating the services we consume from AWS, along with the software we run on top like Kubernetes, Kafka, Redis, PostgreSQL and more. We’re also responsible for operating our network, and being on-call for the things we own and run.

To achieve this, we’re organised into three teams within the Platform Universe; Platform Engineering, Data Engineering, and Operations. Each squad is responsible for solving a specific set of problems for our customers and our engineers. In this role, you will be joining our Operations SRE squad. 

We're investing a lot in modernizing our platform and moving to a more sustainable architecture. Help us build a state-of-the-art microservices platform to power the future’s biggest brands.

Key Responsibilities:

  • Design and implement automation tools and frameworks to streamline our operations and deployment processes. This will involve creating new tools as well as improving existing ones.
  • Participate in architecture and design reviews to ensure that our systems are scalable, reliable, and secure. 
  • Build, maintain and continuously improve our monitoring, alerting, and logging systems. 
  • Identify and troubleshoot production issues and provide quick resolution. You will be responsible for identifying problems and finding solutions, as well as working with other teams to ensure that they are resolved quickly.
  • Collaborate with development teams to ensure that our systems are designed and built for reliability and scalability. You will be working with other teams to make sure that our systems are designed and built to be robust and scalable.
  • Define and improve our incident response processes and procedures. 
  • Monitor and report on system performance and availability. 
  • Mentor and train other engineers on SRE best practices. You will be responsible for helping to train and mentor other engineers on SRE best practices.

At MangoPay you will get to work with a lot of exciting new technology.

We rely heavily on the following tools and technologies:

  • AWS Cloud
  • Kubernetes (EKS)
  • Terraform
  • Git (Gitlab)
  • TeamCity, Octopus Deploy
  • Net Core, .Net 4.8 (migration .Net 6)
    • NHibernate
    • Entity Framework
  • java
    • Micronaut
    • GraalVM (native compilation)
  • RabbitMQ (MassTransit)
  • SQL Server
  • Elasticsearch
  • Redis
  • Kafka (MSK)


    5 autres jobs qui pourrait t'intéresser:

    Meilleurs outils télétravail

    Les Meilleurs Outils pour le télétravail

    Découvrez les outils indispensables pour optimiser la communication et la collaboration à distance

    Slack, Figma, Notion et bien d'autres vous aideront à rester productif !