Job Summary
We are seeking a SRE Engineer with a passion for performance optimization to join our dynamic Infrastructure Team. In this role, you will be instrumental in designing, implementing, and maintaining scalable and resilient infrastructure solutions that align with our business objectives. You will leverage your expertise in cloud technologies and automation to enhance our infrastructure’s performance, resilience, and cost-effectiveness.
Job Responsibilities
- Collaborate with service engineering teams to design, implement, and maintain scalable and resilient infrastructure solutions optimizing for performance, resilience, and cost.
- Ensure infrastructure aligns with business requirements and industry standards.
- Leverage Terraform to automate infrastructure provisioning and configurations.
- Implement Site Reliability Engineering (SRE) principles to improve system reliability and reduce downtime.
- Improve developer workflows by creating self-service tools, optimizing CI/CD pipelines, and enhancing deployment processes to remove friction.
- Develop and maintain robust monitoring and alerting systems to proactively identify and resolve issues.
- Lead incident responses, manage on-call rotations, and facilitate post-incident reviews to drive continuous improvement and resilience.
- Automate everything—drive adoption of Infrastructure as Code (IaC) and build automated pipelines for testing, monitoring, and deployments.
- Utilize excellent written and verbal communication skills to create communications on upcoming changes and how they affect teams.
Basic Qualifications
- Proven experience in building, scaling, and monitoring cloud infrastructure on AWS, especially EKS, S3, RDS, API Gateway, Load Balancers, VPC, Lambdas, DocumentDB, and DynamoDB.
- Proven experience using Terraform to update and maintain cloud infrastructure.
- Proven experience with containerized applications, Kubernetes, and microservice deployments.
- Strong knowledge of GitHub Actions and CI/CD best practices.
- Experience with developer productivity tools: designing CI/CD workflows, building internal tools, and creating self-service solutions to streamline software development.
- Knowledge of monitoring and observability tools and frameworks; working knowledge of Datadog is a plus.
- Familiarity with networking concepts (DNS, load balancing, firewalls, VPNs).
- Strong collaboration skills with the ability to work effectively across teams and communicate technical ideas clearly.
- Experience coding/reading in one of the industry-standard languages such as Java, Python, or TypeScript.
Target Start Date: ASAP
Engagement Length: 6 to 11 months, with the possibility to be extended based on performance.
Time Zone: PST
Working Hours: IMPORTANT MUST OVERLAP 5/6 hours
Country Restrictions: Yes, avoid Venezuela, Cuba, Mexico considering laptop shipment.
Holidays (Local / US): Local