Senior Cloud Architect / Principal Cloud Platform Engineer

Overview

We are seeking a Senior Cloud Architect / Principal Cloud Platform Engineer with deep hands-on experience designing, debugging, and operating large-scale cloud-native systems. This role is not for someone who only understands high-level cloud concepts or relies on AI-assisted tooling to reason through problems. We need a real engineer who can analyze a complex environment, trace problems across infrastructure, Kubernetes, networking, ingress, application services, observability data, and client behavior, then identify root causes and recommend practical fixes. The ideal candidate has experience working with large AWS environments, Kubernetes clusters, AWS CDK/IaC projects, complex ingress and networking patterns, NAT gateways, load balancers, DNS, monitoring pipelines, and high-volume test or production workloads. This person should be comfortable being dropped into an existing architecture, reverse-engineering how it works, identifying weaknesses, and improving reliability, scalability, security, and operability. We are looking for a senior engineer who thinks from first principles, asks hard questions, and validates assumptions with evidence. This person should be comfortable challenging poor architecture decisions respectfully and backing recommendations with data. They should not require a perfectly documented environment to be effective. They should be able to inspect the system, build a mental model, test hypotheses, and explain findings clearly. This role is best suited for someone who has operated real systems under pressure, not someone who has only followed cloud tutorials or deployed standard reference architectures.

Responsibilities

Analyze and reverse-engineer complex AWS and Kubernetes environments, including VPCs, NAT gateways, route tables, load balancers, ingress controllers, DNS, IAM, security groups, CDK stacks, and deployed workloads.
Investigate scale, performance, and reliability problems across the full system path, including client behavior, ingress points, Kubernetes services, pods, cluster autoscaling, application services, databases, identity services, and external dependencies.
Trace traffic and system behavior across multiple inputs and outputs to identify where bottlenecks occur, including client-side timeouts, slow ingress handling, pod readiness delays, service saturation, networking constraints, NAT exhaustion, DNS problems, cluster resource pressure, and application-level failures.
Operate and debug Kubernetes platforms at a senior level, including deployments, services, ingress, config maps, secrets, probes, resource requests and limits, autoscaling behavior, pod scheduling, node pressure, CrashLoopBackOff conditions, failed readiness/liveness checks, service discovery, and cluster-level observability.
Evaluate whether current architecture decisions are appropriate, including large single-cluster designs, excessive NAT gateway patterns, overcomplicated ingress models, improperly scoped shared services, and cloud patterns that create unnecessary operational or compliance burden.
Use AWS-native monitoring and diagnostic tools to investigate and explain system behavior, including CloudWatch, CloudTrail, VPC Flow Logs, AWS Config, ALB/NLB metrics, EKS metrics, container logs, and application logs.
Work with AWS CDK and infrastructure-as-code projects to understand deployed resources, propose clean improvements, and help ensure cloud architecture is reproducible, maintainable, and aligned with operational needs.
Identify gaps in observability and propose practical monitoring, logging, alerting, and dashboard improvements that make future problems easier to detect and diagnose.
Collaborate with DevOps, security, application development, QA, and program leadership to explain technical findings clearly and recommend realistic remediation paths.
Produce clear written technical assessments, root-cause summaries, diagrams, implementation plans, and architectural recommendations.

Requirements

10+ years of hands-on infrastructure, platform engineering, cloud engineering, DevOps, or systems engineering experience.
5+ years of hands-on AWS experience in production or mission-critical environments.
Strong Kubernetes operations and debugging experience, preferably with EKS or similarly managed Kubernetes platforms.
Deep understanding of AWS networking, including VPCs, subnets, route tables, NAT gateways, security groups, NACLs, load balancers, DNS, private endpoints, transit routing, and network troubleshooting.
Strong ability to troubleshoot complex distributed systems under load.
Experience diagnosing scale and performance issues using logs, metrics, traces, packet/flow data, cluster diagnostics, and application behavior.
Experience with AWS CDK, CloudFormation, Terraform, or similar infrastructure-as-code frameworks.
Strong Linux systems knowledge, including networking, process behavior, resource utilization, service diagnosics, and log analysis.
Experience with observability platforms and AWS monitoring tools, including CloudWatch metrics/logs, container logs, dashboards, alarms, and distributed tracing concepts.
Ability to read unfamiliar code, scripts, infrastructure definitions, manifests, and deployment pipelines to understand how a system actually works.
Ability to communicate technical findings to both senior engineers and non-technical leadership.

Preferred Qualifications

Experience with high-scale test environments, load testing, browser/client automation, synthetic test harnesses, or large concurrent-user simulations.
Experience debugging bottlenecks involving client performance, cluster scaling, application services, authentication/SSO, database access, and network egress.
Experience with AWS GovCloud, regulated cloud environments, CMMC, FedRAMP, IL4/IL5, or similar compliance-driven architectures.
Experience with multi-account AWS environments, shared services accounts, centralized logging, security tooling, and controlled network egress.
Experience with GitLab CI/CD, Kubernetes-based deployments, Helm, Kustomize, container registries, and runner infrastructure.
Experience evaluating whether proposed technologies create unnecessary operational risk, security burden, or compliance scope.
Experience with service mesh, ingress gateways, API gateways, reverse proxies, and internal/external traffic routing patterns.
Technical Skills
AWS: VPC, EC2, EKS, IAM, CloudWatch, CloudTrail, ALB/NLB, Route 53, NAT Gateway, VPC Flow Logs, AWS Config, S3, RDS, Secrets Manager, SSM, KMS.
Kubernetes: Pods, Deployments, Services, Ingress, Autoscaling, Probes, Resource Limits, Scheduling, Networking, DNS, Logs, Metrics, Cluster Events, Helm, kubectl.
Infrastructure as Code: AWS CDK, CloudFormation, Terraform, Git-based deployment workflows.
Debugging: log analysis, metric correlation, root-cause analysis, bottleneck isolation, latency tracing, failure reproduction, performance tuning.
Systems: Linux, networking fundamentals, DNS, TLS, HTTP, TCP/IP, containers, load balancing, process/resource diagnostics.

Apply for This Position

Full Name *

Email *

Phone

Current Clearance

Resume *

📄 Click to upload (PDF, DOC, DOCX)

Cover Note (Optional)