This job has expired

This position was posted on February 26, 2026 and is likely no longer accepting applications. We've kept it here for historical reference. Check out the similar jobs below!

Senior Reliaibility Engineer - Technology

TrueLogic Company

Job Overview

Location

Santo Domingo

Job Type

Full-time

Full Job Description

📋 Description

• As a Senior Reliability Engineer at TrueLogic Company, you will be instrumental in ensuring the operational excellence and unwavering reliability of sophisticated distributed systems. Your primary focus will be on enhancing the existing infrastructure, specifically within AWS and Kubernetes environments, with a keen emphasis on observability, operational maturity, and the implementation of automated responses to system behaviors. This role is not about building infrastructure from the ground up, but rather about deeply understanding how services perform in production, proactively identifying potential issues, and developing automated solutions for scaling, recovery, and remediation leveraging established platforms and tools.
• You will be a critical partner to backend and platform engineering teams, collaborating to refine observability practices, establish clear reliability signals, and optimize how the platform responds to operational and performance challenges. Your contributions will directly impact the overall resilience, stability, and performance of our client's cutting-edge technology solutions.
• A core aspect of your role will involve designing, implementing, and continuously refining comprehensive observability strategies. This encompasses the meticulous management of metrics, logs, traces, alerts, and dashboards, ensuring that every facet of system behavior is captured and understood.
• You will dive deep into understanding system behavior within production environments, proactively identifying failure modes, pinpointing performance bottlenecks, and assessing potential reliability risks before they impact users.
• Your expertise will be crucial in evolving and maintaining shared AWS CDK and CDK8s constructs. The focus here is on enhancing observability, implementing robust autoscaling mechanisms, and embedding operational safeguards, rather than the initial provisioning of infrastructure.
• You will be responsible for the maintenance and operation of essential core platform components, including VPCs, EKS clusters, RDS databases, OpenSearch clusters, and MSK services, ensuring that these critical elements expose meaningful and actionable operational signals.
• Operating and enhancing Kubernetes cluster add-ons will be a key responsibility. This includes managing ingress controllers, cert-manager, autoscalers, and the entire suite of monitoring, logging, and tracing stacks.
• You will define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and sophisticated alerting strategies. The goal is to create a clear distinction between system symptoms, root causes, and genuinely actionable operational events, minimizing alert fatigue and maximizing response effectiveness.
• A significant part of your role will be dedicated to improving automated operational responses. This includes refining autoscaling policies, developing and enhancing self-healing mechanisms, and implementing runbook-driven remediation processes to ensure swift and efficient issue resolution.
• You will champion high reliability through the implementation and maintenance of structured alerting systems, such as Prometheus and CloudWatch. This involves diligent noise reduction, continuous improvement of alert quality, and the development of robust recovery mechanisms.
• Close collaboration with engineering teams will be essential for investigating production incidents, conducting thorough root cause analyses, and driving long-term, systemic reliability improvements.
• You will own the CI/CD pipelines for Infrastructure as Code (IaC) and for observability-related platform components, ensuring efficient and reliable deployments.
• You will actively apply Site Reliability Engineering (SRE) principles—including observability-first design, the strategic use of error budgets, and a strong focus on operational readiness—to all shared platform services.
• Supporting IAM roles, implementing secure secrets management practices, and ensuring robust tenant isolation are also key aspects of this role, contributing to the overall security and integrity of the platform.

🎯 Requirements

• Minimum of 5 years of experience in Site Reliability Engineering, Platform Engineering, or similar infrastructure-focused roles, with a substantial track record of operating and supporting production systems.
• Proven expertise in observability operations, including the definition and implementation of metrics, logs, traces, dashboards, alerts, and reliability indicators for complex distributed systems.
• Hands-on experience with core AWS services such as VPC, IAM, RDS, MSK, S3, and CloudWatch, alongside proficiency in Kubernetes components like Helm, RBAC, and ServiceAccounts.
• Fluency in Python and practical experience with Infrastructure-as-Code using AWS CDK, CDK8s, or comparable frameworks.
• Strong understanding of Prometheus, Grafana, alert tuning, alert fatigue reduction strategies, and leveraging incident data to improve monitoring.
• Demonstrated experience in optimizing and improving existing systems rather than building entirely new greenfield infrastructure, with a clear focus on operational excellence and system reliability.
• A proven ability to utilize observability data to drive automation initiatives, inform scaling decisions, and implement significant operational improvements.
• Experience in designing reusable infrastructure or observability patterns, or contributing to internal developer or platform tooling.
• Experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines is considered a strong advantage (nice-to-have).

🏖️ Benefits

• 100% Remote Work: Enjoy the flexibility and autonomy of working from any location that suits you best, requiring only a laptop and a stable internet connection.
• Highly Competitive USD Pay: Receive excellent, market-leading compensation in USD, exceeding typical industry offerings.
• Paid Time Off: Benefit from comprehensive paid time off policies designed to support your well-being and provide opportunities for rest and rejuvenation.
• Work with Autonomy: Manage your own schedule and focus on achieving results, with an emphasis on outcomes rather than strict adherence to a clock.
• Work with Top American Companies: Gain valuable experience and professional growth by collaborating on innovative, high-impact projects with leading U.S. companies.

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

TrueLogic Company

Visit Website

About TrueLogic Company

TrueLogic Company is a digital marketing agency founded in 2001 in the Philippines. It provides SEO, PPC, social media marketing, web design and development services to local and international clients. The agency focuses on data-driven strategies to improve online visibility, traffic and conversions for businesses across retail, finance, healthcare and technology sectors. With offices in Makati and Cebu, it serves small to large enterprises seeking measurable digital growth.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.