
Job Overview
Location
LatAm
Job Type
Full-time
Category
Software Engineering
Date Posted
February 26, 2026
Full Job Description
đź“‹ Description
- • Are you a seasoned Site Reliability Engineer with a passion for making complex distributed systems not just run, but *thrive*? Do you excel at diving deep into production environments, understanding the intricate dance of metrics, logs, and traces to ensure peak performance and unwavering reliability? If so, Truelogic is seeking a Senior Reliability Engineer (AWS) with a sharp focus on Observability & Operations to join our dynamic technology team, working with a cutting-edge data-driven client.
- • At Truelogic, we pride ourselves on being a premier nearshore staff augmentation service, connecting top-tier Latin American tech talent with innovative U.S. companies for over two decades. Our client, a leader in optimizing customer acquisition and retention for high-growth brands through data enrichment and audience targeting, is looking for an expert to enhance the operational resilience of their sophisticated platform. This is your chance to step into a role where your expertise in observability and operational maturity will directly impact the success of a company that partners with giants like Shopify, Experian, and TransUnion.
- • This isn't a role focused on building infrastructure from the ground up. Instead, your mission will be to deeply understand how our client's services behave in production. You'll be the detective, identifying subtle performance bottlenecks, potential failure modes, and critical reliability risks. Your primary objective is to enhance the existing systems, ensuring they are not only stable but also intelligently responsive. You will work hand-in-hand with backend and platform engineering teams, acting as a crucial partner in evolving observability practices, meticulously defining reliability signals (SLIs/SLOs), and refining how the platform automatically scales, recovers, and remediates issues. Your contributions will be instrumental in driving overall system resilience and ensuring the platform's unwavering reliability.
- • Your responsibilities will span the entire lifecycle of observability and operational excellence. You will design, implement, and continuously refine comprehensive observability strategies, encompassing metrics, logs, traces, alerts, and dashboards that provide clear, actionable insights into system health. A significant part of your role will involve maintaining and operating core AWS and Kubernetes platform components. This includes services like VPC, EKS clusters, RDS, OpenSearch, and MSK, ensuring each component exposes meaningful operational signals that facilitate proactive monitoring and rapid response. You'll also be responsible for operating and enhancing critical Kubernetes cluster add-ons, such as ingress controllers, cert-manager, autoscalers, and the entire monitoring, logging, and tracing stack.
- • Defining and maintaining robust Service Level Indicators (SLIs), Service Level Objectives (SLOs), and alerting strategies will be a cornerstone of your work. You'll be adept at distinguishing between mere symptoms, underlying root causes, and truly actionable operational events, ensuring that alerts are precise and effective. Furthermore, you will significantly improve automated operational responses, developing and enhancing self-healing mechanisms, autoscaling capabilities, and runbook-driven remediation processes. Your efforts will directly contribute to reducing Mean Time To Recovery (MTTR) and ensuring high system reliability through meticulously tuned alerting systems (leveraging tools like Prometheus and CloudWatch), aggressive noise reduction, and continuous improvements in alert quality.
- • Collaboration is key in this role. You will partner closely with diverse engineering teams to investigate production incidents, conduct thorough root cause analyses, and champion the implementation of long-term reliability improvements. You will own the CI/CD pipelines for your Infrastructure as Code (IaC) and observability-related platform components, ensuring a smooth and automated deployment process. Applying core Site Reliability Engineering (SRE) principles—such as observability-first design, error budgets, and a strong focus on operational readiness—to shared platform services will be fundamental to your approach. Additionally, you will support best practices in IAM roles, secrets management, and tenant isolation, ensuring a secure and well-governed environment.
- • This role offers a unique opportunity to leverage your expertise in a high-impact environment, working with a client that is at the forefront of data-driven marketing technology. You will be instrumental in shaping the reliability and operational excellence of a platform that powers significant revenue growth for leading brands. If you are driven by a desire to build and maintain highly reliable, observable, and resilient systems, and you thrive in a collaborative, forward-thinking environment, we encourage you to apply.
Skills & Technologies
Python
AWS
Kubernetes
Kafka
Apache Spark
Senior
Remote
About TrueLogic Company
TrueLogic Company is a digital marketing agency founded in 2001 in the Philippines. It provides SEO, PPC, social media marketing, web design and development services to local and international clients. The agency focuses on data-driven strategies to improve online visibility, traffic and conversions for businesses across retail, finance, healthcare and technology sectors. With offices in Makati and Cebu, it serves small to large enterprises seeking measurable digital growth.
Similar Opportunities

Coinbase Global, Inc.
Remote - Canada
Full-time
Expires May 2, 2026
Go
MongoDB
Redis
+3 more
4 days ago


