[Remote] Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Seneca Resources is a client driven provider of strategic Information Technology consulting services and Workforce Solutions to government and industry. They are seeking a Senior Cloud Engineer to design and develop cloud solutions and reliability tools for the Cloud Foundation Services platform, enhancing platform reliability across the Federal Reserve System.
Responsibilities
- Design, develop, and maintain reliability solutions and SRE utilities to reduce toil, improve cloud platform reliability, and industrialize SRE practices across the system
- Build and optimize Infrastructure as Code (IaC) using Terraform to manage AWS resources related to SRE solutions, incorporating cost-efficient design principles
- Develop CI/CD pipelines and automated testing to ensure code quality, reliability, and rapid delivery of the solutions
- Define SRE standards, best practices, and guidelines for adoption across teams; establish SRE metrics like SLI, SLOs, etc
- Apply software engineering best practices including version control, code reviews, test-driven development, and documentation to all development
- Participate in incident management and on-call rotation, providing technical support for SRE tools, troubleshooting production issues, and collaborating with teams to reduce incident recurrence through proactive detection and pattern analysis
- Stay current with emerging AWS services, SRE methodologies, and cloud-native development technologies, and drive adoption of innovative solutions
- Collaborate within Agile and Scaled Agile frameworks with cross-functional teams to deliver integrated cloud automation solutions
- Produce clear, blameless postmortems with actionable items and documented failure scenarios
Skills
- Bachelor's degree in computer science, Information Systems, or equivalent background or equivalent experience
- 7+ years of extensive experience in software development with focus on reliability and platform engineering
- 5+ Years of advanced Python development skills with proven experience building enterprise-grade, highly available tools, APIs, and utilities
- 3+ years of hands-on experience developing solutions in AWS environments with deep understanding of core services (EC2, VPC, S3, Lambda, IAM, CloudFormation, EventBridge, Step Functions etc.) and resource cost optimization
- 3+ years of experience applying SRE principles including observability, toil automation, SLIs/SLOs and reliability engineering
- Expert-level proficiency with Infrastructure as Code (IaC) using Terraform, including module development and state management
- Strong experience with CI/CD pipelines, automated testing frameworks, and DevOps practices
- Experience with observability tools and practices including Grafana, AWS CloudWatch, AWS Canary
- Experience defining, implementing, and managing SLOs/SLIs and error budgets; familiarity with conducting RCAs and producing postmortem documentation
- Working experience in Agile and Scaled Agile environments and familiarity with ITSM processes (incident, change, and problem management), resilience testing and chaos engineering practices
- Experience with GoLang or additional programming languages is a plus
Company Overview