Senior Manager, Infrastructure Operations
Senior Manager, Infrastructure Operations
The Senior Manager of Infrastructure Operations will lead and manage the Infrastructure Operations and Engineers supporting Colocation and AWS Infrastructure and automation for different applications in FFN as part of the migration and Cloud First journey. Be a key contributor on Stability, Scalability, governance, decisions related to projects, and participate as part of a cloud operations decision group both collaboratively within Intel and as an externally regarded leader in the space.
This position requires a highly motivated, proactive leader with the leadership, collaboration, and communication skills necessary to forge a partnership with the senior leaders in the business units and within Technology. Additionally, this leader should bring broad knowledge of IT and Freedom’s business, along with key relationships across the company.
The Senior Manager will be a member of the IT leadership team and report directly to the SVP, Technology Operations.
The Senior Manager, Infrastructure Operations is responsible for:
- Lead the teams supporting Cloud application infrastructure for various cloud initiatives at a large technology or fintech organizations
- Lead a team of 10+ talented Network, Systems, Cloud Infrastructure and DevOps engineers
- Ensure team delivers with high quality and predictability
- Partner with DevSecOps, Architecture, API, Delivery, Security organizations while building highly scalable, secure AWS Cloud Infrastructure as code.
- Partner closely with peer Engineering & Technology leaders to ensure we operate as a single team
- Proven leadership with ability to lead multiple teams in a fast-paced multi-disciplinary environment
- A willingness to mentor people inside/outside of the Information Technology department on best practices, system design principles, and computing industry trends
- Continuously manage, monitor, and update architecture models as business needs evolve and additional cloud services become available.
- Have managed production infrastructure sites for front and back-end services
- Good knowledge of Linux internals and administration
- Deep knowledge of infrastructure as code principles, knowledge of Terraform is a must to have.
- Deep experience with AWS (Cloud Computing: EC2, S3, RDS, VPC, Security Groups, ELB...)
- Able to define actionable monitoring and alerting for systems
- On-call experience dealing with production incident management and resolution
- Cloud Expert: Well versed in AWS services for monitoring, logging, metrics, high availability, and automation
- Operationally Focused: Passionate about monitoring, resiliency, uptime, performance and automation
- Effective Communication: Excellent listener; proven collaborator with superiors, peers and staff
- Automation Driver: Constantly look for automation opportunities
- Curious: Hands-on, "roll up your sleeves" collaborative style of working
- Passionate: Bring energy and enthusiasm to the job and organization
- Achiever: Consistently attain/exceed individual and team goals
- Multitasker: Ability to juggle multiple work items
- Enjoy problem solving: Ability to find creative and reliable solutions to complex problems
- Define Service Level Objectives and performs the work required to ensure we meet those SLOs.
- Knowledge of networking and monitoring skills
- Strong communication skills with an ability to relay incident details expeditiously, concisely, and accurately
- Proficient leading remote online collaborative meetings adhering to project management principles and documentation
- Strong organizational skills with extremely high level of attention to detail
- Highly motivated, quality conscious self-starter that requires little to no supervision, able to own tasks from start to finish
- Customer focused - Investigates and resolves customer issues and inquiries (i.e., emergency and non-emergency)
- Identify, receive, triage and act upon events and incidents coming from various SaaS services
- Consistently meets or exceeds established Command Center key performance indicators (KPI’s)
- Work per escalation, notification and incident practices
- Monitor the availability or the CI/CD environments
- Working under pressure in production environments running production customer workloads and services
- Previous knowledge or strong desire to learn about crisis management issues.
- Primarily focus on 24x7x365 eyes-on-glass monitoring, alerting, requests, and troubleshooting to include:
- Alert verification and validation of false positives in alignment with SOPs
- Performing daily system monitoring, verifying the integrity and availability of cloud infrastructure, server resources, systems and key processes, reviewing system and application logs, and verifying completion of scheduled jobs such as backups, live data feeds, and batch processing
- Managing internal and external access requests, including approvals and general user administration in alignment with user access control policy
- Facilitating scheduled and ad-hoc requests including, but not limited to, application restarts and instance resizing
- Sending internal and external communications for scheduled maintenance and high-priority major incidents
- Triaging all support requests and performing preliminary investigation for all reported issues
- Attempting to provide first-call resolution for all reported issues by researching documentation and knowledge base
- Performing root cause analysis (RCA) and drafting customer-facing summary of events and preventative measures
- SVP, TechOps (and reports to the SVP) — daily
- IT Leadership - regularly
- Business unit and functional executives — regularly
- Outside vendors and technology leaders in other companies — regularly
- Bachelor's or Master's degree in computer science, information systems, business administration or related field.
- 10 or more years in IT and business/industry
- Five to seven years of leadership responsibility in managing multiple, large, cross-functional teams or projects and influencing senior-level management and key stakeholders
- Proven experience in working with external service providers
- Demonstrated effective leadership, teamwork and influencing skills
- Very strong budgeting, planning, and financial management skills (prior experience in IT budgeting and forecasting)
- Exceptional project management skills, including the ability to effectively deploy resources and manage multiple projects of diverse scopes in a cross-functional environment
- Excellent oral and written communication skills, including the ability to explain technology solutions in business terms, establish rapport and persuade others
- Excellent interpersonal and communication skills (written, verbal, presentation, negotiation), including the ability to communicate effectively with people at different job levels within the organization.
- 5+ years of experience managing Cloud Operations and support teams
- Hands-on experience with typical project and system/customer support. This includes planning, coordinating, customer education and support, troubleshooting, problem resolution, product evaluation, and documentation. Additional needed experience includes
- Implementation, management, and administration of Enterprise systems tools and processes
- Granting SSH and RDP access
- Network configuration of Firewalls, VPN, Routers/Switches, and Load Balancers
- Troubleshooting and resolving single customer issues with Windows, Mac, and Linux, VPN, permissions, and ownership of a wide variety of account administration tasks.
- Following ITIL processes (Incident, Change and Problem Management)
- Experience with AWS Managed Services (EC2, DynamoDB, RDS, Lambda)
- Experience with AWS Networking & Security Groups and their underlying technologies (Route53, VPC, ALB, Security Groups)
- Experience in Linux environments (Ubuntu, Amazon Linux)
- Experience in Infrastructure as Code (Terraform, Gitlab CI/CD)
- Knowledge of one or more programming/scripting languages (Python, Go, Bash)
- Knowledge of container platforms (Docker, Kubernetes, ECS)
- Knowledge of configuration management and automation tools (Puppet, Chef, Ansible, SaltStack)
- Knowledge of agile software development practices and release management
- Good teamwork skills and attention to detail