Location: 100% Remote (Europe-based)
Preferred Markets: Poland, Bulgaria, Kosovo, North Macedonia, Ukraine, Romania, or Türkiye
Eligibility: Must be located in Europe with valid work permissions. No sponsorship or visa support available.
Language: English (C1+)
The Role
As a Site Reliability Engineer (SRE), you act as the critical bridge between software development and operations. Your mission is to enable "reliable speed" for our clients, empowering them to leverage the full benefits of continuous deployment without compromising customer experience.
You will embed with multidisciplinary teams in a DevOps environment, ensuring a laser focus on production stability while building the facilities required to maintain it.
We are looking for a technical leader who can mobilise and motivate teams. In this role, you will be the go-to expert for determining production robustness, defining reliable deployment procedures, analysing failure scenarios, and engineering solutions to mitigate them.
Key Responsibilities
- Reliability Engineering: collaborate with product and engineering teams to define and implement SLIs and SLOs.
- Observability: Design and build comprehensive systems for observability to ensure deep visibility into application health.
- Failure Analysis: Lead the analysis of failure scenarios and develop potential mitigations.
- Resilience: Create and maintain runbooks to remediate or proactively prevent failure scenarios.
- Toil Reduction: Identify and automate repetitive work that does not add value, freeing up time for engineering challenges.
- Incident Management: Participate in and facilitate incident management processes, including rotation in on-call duty.
Qualifications & Experience
Essential Background
- Communication: Excellent command of English (C1 or above) with strong assertive communication skills.
- Experience: 5+ years in Software Engineering, DevOps, QA, or Cloud Engineering, with at least 2 years specifically as a dedicated Site Reliability Engineer
- Leadership: Proven ability to take the lead, make decisions, and coach development teams to make architectural choices that favour reliability.
- Context: Experience working in large corporate environments and international contexts involving both onshore and offshore teams.
Technical Expertise
- Cloud & Infrastructure: Basic to intermediate knowledge of serverless services in public clouds (AWS, Azure, GCP). Deep experience with AWS is highly preferred.
- Containerisation: Extensive knowledge of microservices architecture, specifically Docker and Kubernetes/EKS.
- CI/CD & GitOps: Experience with pipelining tools (GitHub Actions, Azure DevOps, GitLab, Jenkins) and specifically ArgoCD.
- Observability: Expert-level knowledge of monitoring systems, particularly APM tools like Datadog, New Relic, Dynatrace, Prometheus, and Grafana.
- Development: Strong programming and scripting skills. Familiarity with Java/Springboot environments is a significant plus.
- Data Streaming: Familiarity with Kafka.
Operational Focus
- Experience managing incidents in a high-traffic, 24x7 public-facing production environment.
- Background working with highly available eCommerce platforms.
- Strong conceptual understanding of software architecture and systems thinking.